Delving into human speech understanding through multimodal representation learning멀티모달 표현 학습을 통한 인간 음성 이해 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 34
  • Download : 0
in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. We firstly analyze that the previous speech recognition models are not robust to the corruption of multimodal input streams. Then, we design multimodal input corruption modeling and develop an audio-visual speech recognition model that is robust to both audio and visual corruption. Third, we further extend to delve into the challenges from a multilingual viewpoint, where the existing multilingual techniques have been facing a critical problem of data imbalance among languages. Motivated by a human cognitive system that humans can intuitively distinguish different languages without any conscious effort or guidance, we design a model that can capture and recognize which language is given as an input speech. Overall, the proposed research aims to bridge the gaps caused by the insufficiency of certain modalities in communication, allowing for a more comprehensive understanding of human communication processes. The effectiveness of the proposed methods is evaluated with comprehensive experiments.; Speech perception is inherently multimodal. In human communication, visual information is generally utilized and readily integrated with auditory speech. Aligned with human perception, machines can also better comprehend human communication by considering multiple modalities. It has been widely known that using complementary information from different modalities is effective in understanding speech. In this research, we deliver several issues that generally occur in speech understanding techniques and provide solutions with a specific task of speech recognition using multimodal audio-visual information. First, we deal with the issue of human communication in noisy environments. Since the visual information is not affected by noisy environments, we design a noise-robust audio-visual speech recognition system that enhances an input noisy audio speech using audio-visual correspondence. Second, we consider the case where both audio and visual information are corrupted
Advisors
노용만researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[iv, 54 p. :]

Keywords

다중모달▼a오디오-비주얼▼a음성처리▼a음성이해▼a오디오-비주얼 음성인식▼a다국어 음성인식; Multimodal▼aAudio-visual▼aSpeech processing▼aSpeech understanding▼aAudio-visual speech recognition▼aMultilingual

URI
http://hdl.handle.net/10203/322139
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100039&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0