This paper considers an unsupervised learning algorithm that can automatically discover key behavior patterns to characterize a complex video scene. For behavior features (bFs) extracted at multiple spatial-temporal scales, an optimization problem is formulated to cluster bFs in their scales while establishing a collaborative nonlinear relationship in the form of a kernel regression function among clustered bFs across different spatial scales. The relationship allows features extracted in one scale to be considered as contextual information in the analysis of another scale. This optimization problem is solved using linear programming to reduce computational complexity. The proposed algorithm is evaluated on four crowded traffic scenes and two sports video data sets. Experimental results show that the proposed algorithm achieves a better performance compared with the current state-of-the-art algorithms in terms of video segmentation accuracy.