Long short-term memory (LSTM) is a type of recurrent neural networks that is efficient for encoding spatio-temporal features in dynamic sequences. Recent work has shown that the LSTM retains information related to the mode of variation in the input dynamic sequence which reduces the discriminability of the encoded features. To encode features robust to unseen modes of variation, we devise an LSTM adaptation named attentive mode variational LSTM. The proposed attentive mode variational LSTM utilizes the concept of attention to separate the input dynamic sequence into two parts: (1) task-relevant dynamic sequence features and (2) task-irrelevant static sequence features. The task-relevant dynamic features are used to encode and emphasize the dynamics in the input sequence. The task-irrelevant static sequence features are utilized to encode the mode of variation in the input dynamic sequence. Finally, the attentive mode variational LSTM suppresses the effect of mode variation with a shared output gate and results in a spatio-temporal feature robust to unseen variations. The effectiveness of the proposed attentive mode variational LSTM has been verified using two tasks: facial expression recognition and human action recognition. Comprehensive and extensive experiments have verified that the proposed method encodes spatio-temporal features robust to variations unseen during the training.