Robust acoustic word representation for personalized wake-up word detection개인화 기동어 검출의 음성단어표현 강인성 향상에 관한 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 174
  • Download : 0
Wake-up word detection (WWD) is one of the most widely used speech application that efficiently manages resources by activating the device only when needed. Especially, personalized WWD where users can customize their devices by registering preferred wake-up word by themselves has received much attention recently due to its flexibility and individuality. WWD attempts to detect the occurrence of a wake-up word for an incoming audio stream, so it requires a discriminating ability that can ignore ordinary speech utterances that do not contain a specific word. In the case of personalized WWD, we need more attention since it handles an arbitrary wake-up word defined by the user. In this regard, there have been myriad studies focusing on proper word representation, initiated from traditional hidden Markov model (HMM)- and template-based approaches. More recently, embedding-based approach has been proposed, where a word is represented by a fixed-dimensional vector. This simple form of representation efficiently reduced the computational cost of WWD, allowing it to cope with the constraint that should be always operated on the device. Meanwhile, one must bear in mind that WWD can suffer from performance degradation due to interfering factors such as noise or reverberation occurred in real-world environments surrounding us. To overcome these difficulties, we propose embedding-based acoustic word representations in this dissertation, that are robust to the environments. First, we propose interlayer selective attention network (ISAN) that pursues robustness of an acoustic word embedding by improving its ability to discriminate words. Inspired by the notion of selective attention, the method advances the word representation power of an embedding by emphasizing relevant components of it corresponding to certain characters in the word, where "Relevant" or "irrelevant" are determined by the interlayer selective attention mechanism we propose. As a result, the embedding possesses an improved ability to distinguish words, allowing to effectively cope with environmental factors such as noise and reverberation as well as unpredictable wake-up words. Second, unlike the above, we propose a new training method called cross-informed domain adversarial training (CiDAT) reducing disturbing environmental factors more directly. The proposed method improves the existing domain adversarial training (DAT) method by introducing the paths that explicitly removes irrelevant information. Experimental results showed that CiDAT outperformed the baselines including DAT regardless of noise types, showing over 70% relative improvement in overall. Finally, we represented a way to incorporate the two methods above. We integrated them in a sequential manner such as improving the word discrimination ability for an acoustic word embedding where the environmental effect was first reduced. In the experiments of the same scenario as before, the integrated model achieved better performance than each model, which confirmed the collaborative potential of the individual models.
Advisors
Kim, Hoirinresearcher김회린researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2020
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2020.8,[iv, 76 p. :]

Keywords

Personalized wake-up word detection▼arobust acoustic word embedding▼ainterlayer selective attention network▼across-informed domain adversarial training; 개인화 기동어 검출▼a환경적 요인에 강인한 음성단어표현▼a층간 선택적 주의 네트워크▼a교차 알림 도메인 적대적 훈련

URI
http://hdl.handle.net/10203/284434
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=924522&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0