Text-guided distillation learning to diversify video embeddings for text-video retrieval

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 71
  • Download : 0
Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.
Publisher
ELSEVIER SCI LTD
Issue Date
2024-12
Language
English
Article Type
Article
Citation

PATTERN RECOGNITION, v.156

ISSN
0031-3203
DOI
10.1016/j.patcog.2024.110754
URI
http://hdl.handle.net/10203/321168
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0