Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 18
  • Download : 0
Speech Emotion Recognition (SER) is an important area of research in speech processing that aims to identify and classify emotional states conveyed through speech signals. Recent studies have shown considerable performance in SER by exploiting deep contextualized speech representations from self-supervised learning (SSL) models. However, SSL models pre-trained on clean speech data may not perform well on emotional speech data due to the domain shift problem. To address this problem, this paper proposes a novel approach that simultaneously exploits an SSL model and a domain-agnostic spectral feature (SF) through the Mixture of Experts (MoE) technique. The proposed approach achieves the state-of-the-art performance on weighted accuracy compared to other methods in the IEMOCAP dataset. Moreover, this paper demonstrates the existence of the domain shift problem of SSL models in the SER task.
Publisher
ELSEVIER
Issue Date
2024-03
Language
English
Article Type
Article
Citation

DATA & KNOWLEDGE ENGINEERING, v.150

ISSN
0169-023X
DOI
10.1016/j.datak.2023.102262
URI
http://hdl.handle.net/10203/320087
Appears in Collection
CS-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0