This paper considers a 3D convolutional neural network (CNN) that learns spatial and temporal regions of higher importance through a bi-direction long short-term memory (bi-LSTM) attention for action recognition. First- and second-order differences of spatially most relevant C3D features (sp-m-C3D) are obtained, and the concatenation of the two differences with the sp-m-C3D is used to generate a temporal attention on the sp-m-C3D using a bi-LSTM. Temporally most relevant sp-m-C3D features are fed into another bi-LSTM for action recognition. Essentially, the network learns spatial and temporal regions of high importance for action recognition. We evaluate the network on two public action recognition datasets: UCF-101 (YouTube Action) and HMDB51. The proposed network performs better compared to other state-of-the-art networks.