DSpace at KOASAS: Delving into human speech understanding through multimodal representation learning

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Ph.D.(박사논문)

Delving into human speech understanding through multimodal representation learning멀티모달 표현 학습을 통한 인간 음성 이해 연구

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 34
Download : 0

Export

Hong, Joanna / 홍요안나

in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. We firstly analyze that the previous speech recognition models are not robust to the corruption of multimodal input streams. Then, we design multimodal input corruption modeling and develop an audio-visual speech recognition model that is robust to both audio and visual corruption. Third, we further extend to delve into the challenges from a multilingual viewpoint, where the existing multilingual techniques have been facing a critical problem of data imbalance among languages. Motivated by a human cognitive system that humans can intuitively distinguish different languages without any conscious effort or guidance, we design a model that can capture and recognize which language is given as an input speech. Overall, the proposed research aims to bridge the gaps caused by the insufficiency of certain modalities in communication, allowing for a more comprehensive understanding of human communication processes. The effectiveness of the proposed methods is evaluated with comprehensive experiments.; Speech perception is inherently multimodal. In human communication, visual information is generally utilized and readily integrated with auditory speech. Aligned with human perception, machines can also better comprehend human communication by considering multiple modalities. It has been widely known that using complementary information from different modalities is effective in understanding speech. In this research, we deliver several issues that generally occur in speech understanding techniques and provide solutions with a specific task of speech recognition using multimodal audio-visual information. First, we deal with the issue of human communication in noisy environments. Since the visual information is not affected by noisy environments, we design a noise-robust audio-visual speech recognition system that enhances an input noisy audio speech using audio-visual correspondence. Second, we consider the case where both audio and visual information are corrupted

Advisors: 노용만 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2024

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[iv, 54 p. :]

Keywords: 다중모달▼a오디오-비주얼▼a음성처리▼a음성이해▼a오디오-비주얼 음성인식▼a다국어 음성인식; Multimodal▼aAudio-visual▼aSpeech processing▼aSpeech understanding▼aAudio-visual speech recognition▼aMultilingual

URI: http://hdl.handle.net/10203/322139

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100039&flag=dissertation

Appears in Collection: EE-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Delving into human speech understanding through multimodal representation learning멀티모달 표현 학습을 통한 인간 음성 이해 연구

KOASAS

Communities & Collections