Delving into human speech understanding through multimodal representation learning멀티모달 표현 학습을 통한 인간 음성 이해 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 35
  • Download : 0
DC FieldValueLanguage
dc.contributor.advisor노용만-
dc.contributor.authorHong, Joanna-
dc.contributor.author홍요안나-
dc.date.accessioned2024-08-08T19:31:33Z-
dc.date.available2024-08-08T19:31:33Z-
dc.date.issued2024-
dc.identifier.urihttp://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100039&flag=dissertationen_US
dc.identifier.urihttp://hdl.handle.net/10203/322139-
dc.description학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[iv, 54 p. :]-
dc.description.abstractin real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. We firstly analyze that the previous speech recognition models are not robust to the corruption of multimodal input streams. Then, we design multimodal input corruption modeling and develop an audio-visual speech recognition model that is robust to both audio and visual corruption. Third, we further extend to delve into the challenges from a multilingual viewpoint, where the existing multilingual techniques have been facing a critical problem of data imbalance among languages. Motivated by a human cognitive system that humans can intuitively distinguish different languages without any conscious effort or guidance, we design a model that can capture and recognize which language is given as an input speech. Overall, the proposed research aims to bridge the gaps caused by the insufficiency of certain modalities in communication, allowing for a more comprehensive understanding of human communication processes. The effectiveness of the proposed methods is evaluated with comprehensive experiments.-
dc.description.abstractSpeech perception is inherently multimodal. In human communication, visual information is generally utilized and readily integrated with auditory speech. Aligned with human perception, machines can also better comprehend human communication by considering multiple modalities. It has been widely known that using complementary information from different modalities is effective in understanding speech. In this research, we deliver several issues that generally occur in speech understanding techniques and provide solutions with a specific task of speech recognition using multimodal audio-visual information. First, we deal with the issue of human communication in noisy environments. Since the visual information is not affected by noisy environments, we design a noise-robust audio-visual speech recognition system that enhances an input noisy audio speech using audio-visual correspondence. Second, we consider the case where both audio and visual information are corrupted-
dc.languageeng-
dc.publisher한국과학기술원-
dc.subject다중모달▼a오디오-비주얼▼a음성처리▼a음성이해▼a오디오-비주얼 음성인식▼a다국어 음성인식-
dc.subjectMultimodal▼aAudio-visual▼aSpeech processing▼aSpeech understanding▼aAudio-visual speech recognition▼aMultilingual-
dc.titleDelving into human speech understanding through multimodal representation learning-
dc.title.alternative멀티모달 표현 학습을 통한 인간 음성 이해 연구-
dc.typeThesis(Ph.D)-
dc.identifier.CNRN325007-
dc.description.department한국과학기술원 :전기및전자공학부,-
dc.contributor.alternativeauthorRo, Yong Man-
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0