Image captioning is the task of generating textual descriptions of a given image, requiring techniques of computer vision and natural language processing. Recent models have utilized deep learning techniques to this task to gain performance improvement. However, these models can neither distinguish more important objects than others in a given image, nor explain the reasons why specific words have been selected when generating captions. To overcome these limitations, this paper proposes an explainable image captioning model, which generates a caption by indicating specific objects in a given image and providing visual explanation using them. The model has been evaluated with datasets such as MSCOCO, Flickr8K, and Flickr30K, and some qualitative results are presented to show the effectiveness of the proposed model.