In this article, we introduce a new benchmark dataset for the challenging writing in the air (WiTA) task-an elaborate task bridging vision and natural language processing (NLP). WiTA implements an intuitive and natural writing method with finger movement for human-computer interaction (HCI). Our WiTA dataset will facilitate the development of data-driven WiTA systems, which, thus, far have displayed unsatisfactory performance-due to lack of dataset as well as traditional statistical models they have adopted. Our dataset consists of five subdatasets in two languages (Korean and English) and amounts to 209 926 video instances from 122 participants. We capture finger movement for WiTA with red-green-blue (RGB) cameras to ensure wide accessibility and cost-efficiency. Next, we propose spatio-temporal residual network architectures inspired by 3-D ResNet. These models perform unconstrained text recognition from finger movement, guarantee a real-time operation [>100 frames per second (FPS)], and will serve as an evaluation standard.