CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition

Cited 16 time in webofscience Cited 0 time in scopus
  • Hit : 273
  • Download : 0
Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in long-distance or crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. The audio value memory is guided to imprint the audio feature and the lip-video key memory is guided to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Thus, the proposed VSR framework can refine the prediction with the imprinted audio information during inference time where the audio inputs are not provided. We validate the proposed method on popular benchmark databases, LRW, LRW-1000, GRID, and LRS2. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level visual speech recognition. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR by examining and visualizing the learned representations.
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Issue Date
2022-10
Language
English
Article Type
Article
Citation

IEEE TRANSACTIONS ON MULTIMEDIA, v.24, pp.4342 - 4355

ISSN
1520-9210
DOI
10.1109/TMM.2021.3115626
URI
http://hdl.handle.net/10203/300263
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 16 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0