Text contained in scene images provides the semantic context of the images. For that reason, robust extraction of text regions is essential for successful scene text understanding. However, separating text pixels from scene images still remains as a challenging issue because of uncontrolled lighting conditions and complex backgrounds. In this paper, we propose a two-stage conditional random field (TCRF) approach to robustly extract text regions from the scene images. The proposed approach models the spatial and hierarchical structures of the scene text, and it finds text regions based on the scene text model. In the first stage, the system generates multiple character proposals for the given image by using multiple image segmentations and a local CRF model. In the second stage, the system selectively integrates the generated character proposals to determine proper character regions by using a holistic CRF model. Through the TCRF approach, we cast the scene text separation problem as a probabilistic labeling problem, which yields the optimal label configuration of pixels that maximizes the conditional probability of the given image. Experimental results indicate that our framework exhibits good performance in the case of the public databases.