Binaural sound event localization and detection neural network based on HRTF localization cues for humanoid robots인간형 로봇을 위한 머리전달함수 정위 단서 기반의 두 귀의 소리 사건 정위 및 감지 신경망

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 44
  • Download : 0
and left and right SC-maps providing spectral cue (SC) above 5 kHz for the elevation estimation of sound event. The effectiveness of BTFF was confirmed by evaluating its detection and localization performance for sound events coming from omnidirectional, horizontal, and median planes. Using BTFF as input feature, a variety of BiSELDnets were implemented that output a time series of direction vectors for each sound event class. The magnitude and direction of each vector represent the activity and direction of the corresponding sound event class, allowing simultaneous detection and localization of sound events. Among them, BiSELDnet based on Trinity module, which has the best performance with a small number of parameters, was selected. Based on depthwise separable convolution, which is suitable for BTFF with low cross-channel correlation, Trinity module is implemented by factorizing each of the three concatenated kernels of size 3×3, 5×5, and 7×7 into kernels of size 3×3. It has the advantage of simultaneously extracting feature maps of various sizes from its input feature with a small number of parameters. In addition, vector activation map (VAM) visualization was proposed to visualize what BiSELDnet learned and check which parts of input feature contribute to the final decision of detection and localization. Through VAM visualization, it is confirmed that BiSELDnet focuses on the N1 notch frequency for the elevation estimation of sound event. Finally, the detection and localization performances of BiSELD model and state-of-the-art (SOTA) SELD model were compared for sound events in the horizontal or median plane under urban background noise conditions with various signal-to-noise ratios. The comparison results demonstrate that the proposed BiSELD model performs better than the existing SOTA SELD model under binaural input conditions.; ILD-map representing interaural level difference (ILD) above 5 kHz with front-back asymmetry as a clue to solve the front-back confusion; ITD-map estimating interaural time difference (ITD) below 1.5 kHz; left and right V-maps showing the time change rate of each frequency component; In order for a humanoid robot to recognize the situation through sound, it must simultaneously estimate the type and direction of surrounding sound events. Also, to be applied to hearing aids or human-robot interaction technologies such as telepresence, it is necessary to be implemented with two-channel input like a human. However, with horizontal two-channel input, it is difficult to estimate the elevation of sound event, and front-back confusion occurs when estimating the azimuth. To solve this problem, binaural sound event localization and detection (BiSELD) neural network is proposed, which can simultaneously estimate the class and direction of each sound event by learning the time-frequency pattern and head-related transfer function (HRTF) localization cues of sound event from a binaural input feature. For learning, HRTFs were measured by establishing clear standards for origin transfer function measurement and non-causality compensation, and binaural dataset was constructed by synthesizing the measured HRTFs with collected sound event databases. In particular, based on the analysis of HRTF localization cues, binaural time-frequency feature (BTFF) was proposed as the input feature for BiSELDnet. A BTFF consists of eight-channel feature maps: left and right mel-spectrograms
Advisors
박용화researcher
Description
한국과학기술원 :기계공학과,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 기계공학과, 2024.2,[xii, 179 p. :]

Keywords

인간형 로봇▼a두 귀의 소리 사건 정위 및 감지▼a머리전달함수▼a두 귀의 시간-주파수 특징▼a삼일체 모듈▼a깊이별 분리 합성곱▼a벡터 활성화 맵; Humanoid robot▼aBinaural sound event localization and detection (BiSELD)▼aHead-related transfer function (HRTF)▼aBinaural time-frequency feature (BTFF)▼aTrinity module▼aDepthwise separable convolution▼aVector activation map (VAM)

URI
http://hdl.handle.net/10203/321931
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097766&flag=dissertation
Appears in Collection
ME-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0