Recently, hidden Markov model (HMM) has become the predominant approach to speech recognition. Although the conventional HMM is good at modeling the stationary and sequential characteristics of speech signals, it has inherent drawbacks of poor duration modeling and weak discrimination capability between competing classes. In this dissertation work, we present various methods to improve acoustic modeling in speech recognition based on continuous density HMM.
First, we propose to model and incorporate context-dependent word duration information to reduce insertion and deletion errors in connected digit recognizers. The proposed method is different from the conventional postprocessing-based method in that it is incorporated directly in the Viterbi decoding algorithm. Experimental results show that the proposed method reduces word error rates by as much as 10% for unknown length decoding, while the postprocessing method does not achieve significant improvements over a baseline system. Simple duration modeling by a bounded uniform distribution achieves performance improvements comparable to detailed duration modeling by a gamma or Gaussian distribution with low complexity, and therefore it is a good compromise between performance and complexity.
Second, we propose a supersegment-based postprocessing approach to improve recognition accuracies for connected digit recognition. A supersegment for a string means a concatenation of one or more segments sharing similar begin- and end-points with the other strings within some tolerances. In the approach, N-best candidate strings are generated by a conventional recognizer and string-matched so that they are all represented by the same number of supersegments. We obtain total log likelihoods by combining the conventional first-stage recognizer and a supersegment-based second-stage postprocessor. Experimental results show that connected digit recognizers by the supersegment-based postprocessing method achieves about 20% decr...