Deep learning based recognition systems have shown high performances in various tasks. Most of them are single-modality based, using camera inputs only, thus are vulnerable to look-alike fraud inputs. Fraud inputs may frequently be abused when rewards are given to the users, such as in reverse vending machines. Joint use of multi-modal inputs can be a solution to fraud inputs since modalities contain different information about the target task. In this work, we propose a deep neural network that utilizes multi-modal inputs with an attention mechanism and a correspondence learning scheme. With an attention mechanism, the network can learn better feature representation for multiple modalities; with the correspondence learning scheme, the network learns intermodal relationships and thus can detect fraud inputs where modalities do not correspond to each other. We investigate the proposed approach in a reverse vending machine system, where the task is to perform classification among 3 given classes (can, PET bottles, glass bottles), and reject any suspicious input. Three different modalities (image, ultrasound, and weight) are used. As a result, we show that our proposed model can effectively learn to detect fraud inputs while maintaining a high accuracy for the given classification task.