An always-on video-based human action recognition (HAR) system on chip (SoC) integrated with a CMOS image sensor (CIS) is proposed for the Internet of Things (IoT) devices. The proposed SoC is the first always-on integrated circuit (IC) performing the full process of HAR in a single chip. To resolve large power consumption from vision sensor and compute- intensive DNN operation, the proposed SoC operates in two different modes; 1) In adaptive frame resolution based human action recognition (AFR-HAR) mode, CIS resolution prediction algorithm and self-adjustable CIS reduce 42.9-91.8% of readout power by adaptively adjusting frame resolution. 2) In motion event detection (MED) mode, the motion event detection unit (MEDU) skips unnecessary imaging and DNN computation by monitoring motion events and leads to over 99% power saving. The proposed HAR SoC is simulated in 65-nm CMOS technology and occupies 8.56 mm2. It consumes only 0.82 μW when no motion is detected and 0.31-8.52 mW for evaluating human actions on the ActivityNet dataset.