Advancement of RGB-D cameras that are capable of tracking human body movement in the form of a skeleton has contributed to growing interest in skeleton-based human action recognition. However, the tracking performance of a single camera is prone to occlusion and is view dependent. In this study, we use fusion skeletal data obtained from two views for recognizing human action.We perform a substitutive fusion based on joint tracking status and build a view-invariant action recognition system. The resulting fusion skeletal data are transformed into histogram of cubes as a frame level feature. Clustering is applied to build a dictionary of frame representatives, and actions are encoded as sequences of frame representatives. Finally, recognition is performed as a sequence matching task by using Dynamic Time Warping with K-nearest neighbor. Experimental results show that fusion skeletal data consistently give better recognition performance than their single view counterpart.