DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Kweon, In So | - |
dc.contributor.advisor | 권인소 | - |
dc.contributor.author | Senocak, Arda | - |
dc.date.accessioned | 2023-06-23T19:33:30Z | - |
dc.date.available | 2023-06-23T19:33:30Z | - |
dc.date.issued | 2022 | - |
dc.identifier.uri | http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1007849&flag=dissertation | en_US |
dc.identifier.uri | http://hdl.handle.net/10203/309064 | - |
dc.description | 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2022.8,[ix, 73 p. :] | - |
dc.description.abstract | If we think about how we, as human beings, experience the world around us, it can be realized that we continuously use all of our senses. With all these different sensory signals, we learn and understand the scenes. Regardless of whether a person sees an image of “lion”, hears a “lion roaring” sound, or hears someone says the word “lion”, the same response is triggered inside of the human brain. Though human perception uses multimodal information, most of the existing models for understanding the scene around us deal with only a single modality, such as vision. Thus, developing a machine perception that uses multimodal data is very essential. Among these sensory signals, inarguably the most dominant ones are vision and audition. The sound is not only complementary to the visual information but also correlated to visual events. When we see that a car is moving, we hear the engine sound at the same time. In this thesis, I introduce computational models to find the correspondence and complementary information between audio and visual signals. I introduce several tasks that benefit from the correspondence information such as sound source localization, audio-visual cross-modal retrieval, and audio-visual driven important moment selection in the videos. I propose effective self-supervised, semi-supervised and weakly-supervised methods to learn audio-visual correspondence. I also discuss different relationships of audio-visual signals as they do not follow a single type of relationship and leverage these two signals as complementary information to each other in video understanding task by following different ways of audio-visual formations. | - |
dc.language | eng | - |
dc.publisher | 한국과학기술원 | - |
dc.subject | Audio-visual learning▼aSelf-supervision▼aMultimodal learning | - |
dc.subject | 시각-청각 학습▼a자가학습▼a다중모달 학습 | - |
dc.title | Learning audio-visual relationships and correspondences in the visual scenes | - |
dc.title.alternative | 시각과 청각 정보 간의 관련성 학습 기법 | - |
dc.type | Thesis(Ph.D) | - |
dc.identifier.CNRN | 325007 | - |
dc.description.department | 한국과학기술원 :전기및전자공학부, | - |
dc.contributor.alternativeauthor | 세노자크 아르다 | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.