One-shot multi-speaker text-to-speech using RawNet3 speaker representationRawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성에 대한 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 3
  • Download : 0
The recent advancements in Text-to-Speech (TTS) technology have significantly improved the Speech Quality and Naturalness of synthesized speech, reaching a level where it can produce more naturalsounding voices resembling human speech. Consequently, TTS systems find applications in various fields such as automated response system (ARS), mobile phone voice assistants, AI tutors, advertisements, movie or content dubbing, and even in the development of models for language disorder therapy. In light of this, there is a growing need for TTS models that can exhibit diverse acoustic characteristics and synthesize voices based on a single speech file (One-shot) of unseen speaker to secure the speaker’s unique characteristics. In this regard, this dissertation proposes a one-shot multi-speaker TTS model, leveraging the FastSpeech2 acoustic model and HifiGAN vocoder, supplemented by an additional speaker encoder. The speaker encoder utilizes a pre-trained RawNet3 model to extract speaker-related information, ensuring that the speaker’s characteristics are incorporated into both the training and synthesis processes. This enables the generation of speech with the unique voice attributes of unseen speakers during training. Objective and subjective evaluations reveal that the proposed model outperforms other comparative models in terms of both Naturalness and Speaker Similarity. Furthermore, this paper extends the proposed approach to include not only an English one-shot multi-speaker TTS model but also a Korean counterpart.
Advisors
김회린researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 54 p. :]

Keywords

음성합성▼a다화자 음성합성▼a화자 임베딩▼a화자 적응▼a원샷 음성합성; Speech Synthesis▼aMulti-Speaker TTS▼aSpeaker Embedding▼aSpeaker Adaptation▼aOne-Shot Speech Synthesis

URI
http://hdl.handle.net/10203/321651
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097229&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0