DSpace at KOASAS: Phrase-frames alignment network with contrastive attention loss for video description

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Master(석사논문)

Phrase-frames alignment network with contrastive attention loss for video description의미 중심 구-프레임 정렬과 대조 집중 손실을 통한 영상 묘사

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 170
Download : 0

Export

Ryu, Hobin

This paper considers a video caption generating network referred to as Phrase-Frames Alignment Network (PFAN) that solves the problem of information redundancy of successive sampled frame, prevalent in most video captioning algorithms. As consecutive sampled frames are less likely to provide unique information, prior methods have focused on encoding compact video representation from an input video through various methods such as using a hierarchical encoder or learning to sample informative frames. The PFAN attempts to compactly encode the input video by not only using the visual features of frames but also the semantics of a partially decoded caption. The PFAN (1) forms \textit{semantic groups} by aligning each video frame feature with discriminating word phrases of partially decoded caption and then (2) decodes the semantic groups to predict the next of the partially decoded caption. In contrast to the prior methods, the continuous feedback from decoded words enables the PFAN to dynamically update the video representation that adapts to the partially decoded caption. Furthermore, a contrastive attention loss is proposed to facilitate accurate alignment between word phrases and video frame features without requiring any manual annotations. The PFAN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1% and 2.4% in a CIDEr-D score on MSVD and MSR-VTT datasets, respectively. Extensive experiments are conducted to demonstrate the effectiveness and interpretability of the PFAN.

Advisors: Yoo, Chang Dong researcher; 유창동 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2020

Identifier: 325007

Language: eng

Description: 학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2020.8,[iv, 25 p. :]

Keywords: Deep Learning▼aComputer Vision▼aVideo Captioning▼aMulti-Modal Alignment▼aContrastive Attention Mechanism; 심층학습▼a컴퓨터 비전▼a영상 묘사▼a멀티모달 정렬▼a대조적인 주의 메커니즘

URI: http://hdl.handle.net/10203/285058

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=925222&flag=dissertation

Appears in Collection: EE-Theses_Master(석사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Phrase-frames alignment network with contrastive attention loss for video description의미 중심 구-프레임 정렬과 대조 집중 손실을 통한 영상 묘사

KOASAS

Communities & Collections