This paper proposes a method for the joint inference of road layouts and the semantic segmentation of urban scenes by applying spatial contexts. The proposed method is based on the conjecture that a set of relevant elements in an urban environment contains a locational relationship among them. This relationship can be modeled as a location prior and label co-occurrences to help segment an image accurately. To apply these environmental characteristics, special coordinates referred to as road-normal coordinates are defined on the inferred road layout. These coordinates are determined by obtaining the most befitting road layout based on the marginal probability from the result of an existing segmentation algorithm. All possible segments in an image having depth information from the lidar sensor are projected into the road-normal coordinates, and the learned location prior and label co-occurrence statistics are applied to each segment as additional potentials of a conditional random field model. The proposed method is evaluated with the publicly available urban dataset including images and the corresponding point clouds.