An artificial intelligence (AI) system capable of simulating functions of a human brain, such as recognition and judgment, by using the machine learning algorithm such as deep learning, and an application thereof are provided. A method of learning multi-modal data according to the AI system and an application thereof includes: obtaining first context information representing a characteristic of a first signal and second context information representing a characteristic of a second signal by using a first learning network model; obtaining hidden layer information based on the first context information and the second context information by using a second learning network model; obtaining a correlation value representing a relation degree between the hidden layer information by using the second learning network model; and learning the hidden layer information in which the correlation value is derived as a maximum value.