Automatic speech recognition (ASR) is a one of key techniques for human-machine interaction through human’s voice and has recently been deployed in voice search, car navigation and artificial intelligence speaker. Although ASR accuracy has been greatly improved by deploying deep-learning-based techniques, its consistency still cannot be guaranteed in real environment owing to unpredictable speaking timing, background noise, reverberation and interfering speakers. To build the robust ASR for real environment, various front-end systems have been studied for decades such as voice activity detection, speech enhancement, de-reverberation and source separation. Conventionally, most of them depend on signal processing techniques and contributed to the robustness for ASR, however, still have some limitations due to their modeling assumptions to the speech and noise environments. In recent, deep-learning-based front-end systems have outperformed the signal processing ones.
In this dissertation, we study and develop deep-learning-based techniques for two major sub-disciplines of front-end systems: single-microphone voice activity detection (VAD) and single-microphone speech enhancement (SE). Specifically, we focus on improving the utilization of context information within speech signal for our models for VAD and SE, as context information has been known to a crucial asset for deep-learning-based, speech-related applications.
For VAD, the context information (CI) of speech signal has considered to one of key information to detect the speech from noisy signal. Although CI of speech signal is a relevant VAD asset, its usefulness can vary in unpredictable noise environments i.e. according to noise types, the importance of long-short term CI can be changed. Therefore, its usage should be adaptively adjustable to the noise type. This dissertation improves the use of context information by using an adaptive context attention model (ACAM) with a novel training strategy for effective attention, which weights the most crucial parts of the context for proper classification. Experiments in real-world scenarios demonstrate that the proposed ACAM-based VAD outperforms the other baseline VAD methods.
For SE, a novel neural network architecture called two-stage network (TSN) with a multi-objective learning method (MOL) for an efficient boosting strategy (BS) is proposed to deploy various CI with reasonable computational cost. BS is an ensemble method using multiple base predictions (MBPs) for better final prediction. Due to the necessity of MBPs, the computational cost and model size of BS based methods are excessive than that of a single model. In this regard, TSN firstly obtains MBPs from different CI by using a single deep neural network. Then, to obtain better final prediction, the convolution layers of TSN aggregate not only MBP but also some auxiliary information such as contextual information, while adaptively filtering out some unnecessary information e.g., poor base predictions. At the training phase, MOL enables all stages of TSN to learn jointly, while allowing the TSN framework to embed a BS. Our experimental results confirm that the embedded BS leads the TSN to outperform other baseline methods with a reasonably low computational cost and model size.
Further, we propose auxiliary methods to lead the improvement of VAD to that of ASR. As VAD is frame-level classifier, it should be changed to utterance-level classifier for ASR. To achieve this, additional state transition model (STM) that cooperating with VAD is proposed and VAD with STM is often referred to as end-point detection (EPD). Finally, we carry out in-depth empirical analysis of the effect of proposed EPD and SE to the speech recognition performance.