Recently, many studies have constructed multimodal dialogue datasets containing image-sharing behavior, which is vital to increase the social relationship with interlocutors in open-domain conversation. In this paper, we report the empirical results that CLIP can understand the alignment between the dialogue history and image by conducting various experiments for (1) zero-shot transferability, (2) the effect of dialogue history, and (3) robustness. Our experiments demonstrate that it is necessary for improving the zero-shot performance of CLIP on the multi-modal dialogue dataset. Additionally, the CLIP model is benefitted from more informative texts (i.e., dialogue history), not the last utterance only.