DSpace at KOASAS: One-shot multi-speaker text-to-speech using RawNet3 speaker representation

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Master(석사논문)

One-shot multi-speaker text-to-speech using RawNet3 speaker representationRawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성에 대한 연구

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 3
Download : 0

Export

Han, Sohee / 한소희

The recent advancements in Text-to-Speech (TTS) technology have significantly improved the Speech Quality and Naturalness of synthesized speech, reaching a level where it can produce more naturalsounding voices resembling human speech. Consequently, TTS systems find applications in various fields such as automated response system (ARS), mobile phone voice assistants, AI tutors, advertisements, movie or content dubbing, and even in the development of models for language disorder therapy. In light of this, there is a growing need for TTS models that can exhibit diverse acoustic characteristics and synthesize voices based on a single speech file (One-shot) of unseen speaker to secure the speaker’s unique characteristics. In this regard, this dissertation proposes a one-shot multi-speaker TTS model, leveraging the FastSpeech2 acoustic model and HifiGAN vocoder, supplemented by an additional speaker encoder. The speaker encoder utilizes a pre-trained RawNet3 model to extract speaker-related information, ensuring that the speaker’s characteristics are incorporated into both the training and synthesis processes. This enables the generation of speech with the unique voice attributes of unseen speakers during training. Objective and subjective evaluations reveal that the proposed model outperforms other comparative models in terms of both Naturalness and Speaker Similarity. Furthermore, this paper extends the proposed approach to include not only an English one-shot multi-speaker TTS model but also a Korean counterpart.

Advisors: 김회린 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2024

Identifier: 325007

Language: eng

Description: 학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 54 p. :]

Keywords: 음성합성▼a다화자 음성합성▼a화자 임베딩▼a화자 적응▼a원샷 음성합성; Speech Synthesis▼aMulti-Speaker TTS▼aSpeaker Embedding▼aSpeaker Adaptation▼aOne-Shot Speech Synthesis

URI: http://hdl.handle.net/10203/321651

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097229&flag=dissertation

Appears in Collection: EE-Theses_Master(석사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

One-shot multi-speaker text-to-speech using RawNet3 speaker representationRawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성에 대한 연구

KOASAS

Communities & Collections