Simple but effective attention calibration for CLIP-guided diffusion modelsCLIP 지도 디퓨젼 모델을 위한 간단하지만 효과적인 주의 집중 교정

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 3
  • Download : 0
While Contrastive Language-Image Pre-training (CLIP) model has significantly advanced text-to-image generation, we uncover two notable issues in its application to diffusion models, particularly with the implementation of local embeddings. First, the model disproportionately focuses on word embeddings with less information of the input prompt. Second, local embeddings disrupt the image geometry established by global embeddings at initial timesteps, risking misalignment with the original prompt. To mitigate the identified issues, we introduce two adjustments to cross-attention: sequence-dependent and time-dependent attention calibration. Our method employs simple numerical operations, for which we provide the values, ensuring easy implementation. In the sequence-dependent attention calibration, constants are added to the logits in the cross-attention layer to counterbalance the diminishing attention across the word sequence. The time-dependent attention adjustment enhances the attention towards global embeddings in the initial stages, facilitating better geometry formation. Our experiments on various datasets show that this simple method significantly improves the performance of Stable Diffusion, yielding images that more accurately depict the input prompts.
Advisors
김창익researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2024
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2024.2,[vi, 31 p. :]

Keywords

CLIP▼a디퓨젼▼a교차 어텐션; CLIP▼aDiffusion▼aCross-attention

URI
http://hdl.handle.net/10203/321607
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1097179&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0