A text-to-speech (TTS) system converts an arbitrary text to synthetic speech. As TTS systems are being incorporated into more and more various applications like e-mail reader and language education system, human users`` desire for a higher quality system is increasing. Recently, large corpus-based concatenative speech synthesis has been the most popular approach for constructing TTS systems. With this method, it should be possible to synthesize more natural sounding speech than can be produced with a small set of controlled units. Although intelligibility of the TTS system with this method is extremely good and certainly good enough for many real applications, the lack of natural prosody is the major source of barriers to meeting the users`` expectation. Prosody, therefore, is the feature within TTS systems that is most in need of improvement. In this thesis, we develop a large corpus-based Korean TTS system and propose prosody control methods for the system to improve the naturalness of synthetic speech.
The implemented TTS system uses a triphone as a basic unit for concatenation, and has 400,042 triphone instances as a speech corpus, which contains 16,072 unique triphone types. Since a triphone includes context information, it can present all possible allophones. However, it has two problems to use a triphone as a basic synthesis unit. One is the absence or sparsity of some triphone types, and the other is the size of search space caused by some triphone types which have too many instances. In a text selection process where a set of sentence for recording is prepared, we use a greedy algorithm with the score table designed in consideration of the triphone coverage and the balance of instances in an effort to avoid these problems. After recording speech corpus, we use a bottom-up clustering and three backing off trees to solve the sparsity problem. To reduce search space for real-time processing, we use pre-selected candidate unit lists, and the performance te...