Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Cited 26 time in webofscience Cited 0 time in scopus
  • Hit : 431
  • Download : 367
DC FieldValueLanguage
dc.contributor.authorSenocak, Ardako
dc.contributor.authorOh, Tae-Hyunko
dc.contributor.authorKim, Junsikko
dc.contributor.authorYang, Ming-Hsuanko
dc.contributor.authorKweon, In-Soko
dc.date.accessioned2021-04-19T01:50:07Z-
dc.date.available2021-04-19T01:50:07Z-
dc.date.created2019-11-26-
dc.date.created2019-11-26-
dc.date.created2019-11-26-
dc.date.issued2021-05-
dc.identifier.citationIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, v.43, no.5, pp.1605 - 1619-
dc.identifier.issn0162-8828-
dc.identifier.urihttp://hdl.handle.net/10203/282411-
dc.description.abstractVisual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. We show that the false conclusion can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. We present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we incorporate this proposed algorithm into sound saliency based automatic camera view panning in 360 degree videos.-
dc.languageEnglish-
dc.publisherIEEE COMPUTER SOC-
dc.titleLearning to Localize Sound Sources in Visual Scenes: Analysis and Applications-
dc.typeArticle-
dc.identifier.wosid000637533800009-
dc.identifier.scopusid2-s2.0-85103800881-
dc.type.rimsART-
dc.citation.volume43-
dc.citation.issue5-
dc.citation.beginningpage1605-
dc.citation.endingpage1619-
dc.citation.publicationnameIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE-
dc.identifier.doi10.1109/TPAMI.2019.2952095-
dc.contributor.localauthorKweon, In-So-
dc.contributor.nonIdAuthorOh, Tae-Hyun-
dc.contributor.nonIdAuthorYang, Ming-Hsuan-
dc.description.isOpenAccessY-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorVisualization-
dc.subject.keywordAuthorVideos-
dc.subject.keywordAuthorTask analysis-
dc.subject.keywordAuthorCorrelation-
dc.subject.keywordAuthorDeep learning-
dc.subject.keywordAuthorNetwork architecture-
dc.subject.keywordAuthorUnsupervised learning-
dc.subject.keywordAuthorAudio-visual learning-
dc.subject.keywordAuthorsound localization-
dc.subject.keywordAuthorself-supervision-
dc.subject.keywordAuthormulti-modal learning-
dc.subject.keywordAuthorcross-modal retrieval-
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
111887.pdf(11.95 MB)Download
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 26 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0