In recent years, research on first-person images has become increasingly important in the field of computer vision due to the development of wearable cameras and the interest in life logging. However, it is difficult to analyze the first-person image because the user 's hand is represented in various ways as well as the camera motion is mixed. As a general approach, Convolutional Neural Network (CNN) based learning methods are used primarily for vision tasks such as classification and recognition, because they better represent the latent features of an image. However, for vision work involving video data, the CNN-based model has the disadvantage that it is difficult to learn the long-time dependence between sequence data. In order to overcome such limitations, we propose a deep network structure consisting of CNN and LSTM (Long short term memory) for action recognition in first-person image data. Our model has two main concepts: First, each object information and motion information is learned through a convolution network divided into two streams. The next step is to learn the temporal dependence of multi-task learning in the LSTM model through the latent features obtained from each stream. We evaluated the performance of the GTEA dataset and compared it with other studies.