The rapid expansion of e-commerce has turned online shopping into a complex ecosystem that challenges consumers to process diverse types of information simultaneously. Previous research has typically examined images and texts either separately or for their similarity; however, detailed interactive effects, particularly specific types of information each modality conveys, have been less explored. This study addresses how differences in information presented by images and texts influence online purchase decisions. Using deep learning models, we analyzed over 15 million transactions and 705,056 images from a leading e-commerce platform. Our econometric analysis reveals that while images with rich contextual information diminish the positive impacts of textual information, these effects vary by product type, as explained by the Elaboration Likelihood Model. We further discovered that image-text similarity negatively affects sales, highlighting the benefits of providing distinct information through each modality. This research contributes foundational insights into the multimodal data analytics in e-commerce.