Multimodal Language Processing by Employing Phonetic and Discrete Characteristics of Speech Unit음성유닛의 발음적 및 이산적 특성을 통한 멀티모달 언어 처리 및 학습

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 3
  • Download : 0
When humans communicate with each other, they naturally utilize multimodal information such as visual, audio, and text information. This multimodal information allows humans to understand better the intent and content of ongoing conversations. This is because the human brain has great knowledge in modeling the relationships among different multimodal features. We explore how can we develop the machine to understand the relationships between different modalities. However, as different modalities have different data forms, it is not easy to develop each data-specific module. For example, audio represents a continuous-time signal, while video or images are 2-dimensional signals that may include optional time information, and text is a discrete signal devoid of temporal characteristics. To extract common representations from the audio speech, visual speech, and text modalities, we explore a discretized speech representation, namely speech unit. The speech unit is obtained by clustering (i.e., discretizing) extracted speech features from a pre-trained self-supervised speech model. As it is discretized, now we can express the continuous audio and visual signals with discrete representations. Moreover, it keeps the information of speech, the phonetic information. By employing the characteristics of speech unit, phonetic and discrete, we show that we can improve different multimodal translation systems, visual speech-to-text translation, speech-to-speech translation, and text-to-speech translation. First, in the visual speech-to-text translation, we show that we can learn general visual speech knowledge without depending on a specific language by using the speech unit, and improve the Visual Speech Recognition (VSR) performance for low VSR resource languages. Second, in speech-to-speech translation and text-to-speech translation, we can train a machine translation system as the text system has done by employing the discrete characteristics of speech units. That is, we treat the speech unit as pseudo text and show that speech-to-speech translation for multiple languages can be possible. The effectiveness of the proposed methods is evaluated with extensive experiments including comparisons with state-of-the-art methods, ablation studies, and qualitative analysis.
Advisors
노용만researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 63 p. :]

Keywords

멀티모달 음성 처리▼a멀티모달 처리▼a이산화된 자기 감독 표현▼a음성 유닛▼a음성 토큰▼a시각적 음성 인식▼a음성 대 음성 번역▼a문자 대 음성 번역; Multimodal speech processing▼amultimodal processing▼adiscretized self-supervised representation▼aspeech unit▼avisual speech recognition▼aspeech-to-speech translation▼atext-to-speech translation

URI
http://hdl.handle.net/10203/322185
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1100091&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0