Three-dimensional city models are becoming a valuable resource because of their close geospatial, geometrical, and visual relationship with the physical world. However, ground-oriented applications in virtual reality, 3D navigation, and civil engineering require a novel modeling approach, because the existing large-scale 3D city modeling methods do not provide rich visual information at ground level. This paper proposes a new framework for generating 3D city models that satisfy both the visual and the physical requirements for ground-oriented virtual reality applications. To ensure its usability, the framework must be cost-effective and allow for automated creation. To achieve these goals, we leverage a mobile mapping system that automatically gathers high-resolution images and supplements sensor information such as the position and direction of the captured images. To resolve problems stemming from sensor noise and occlusions, we develop a fusion technique to incorporate digital map data. This paper describes the major processes of the overall framework and the proposed techniques for each step and presents experimental results from a comparison with an existing 3D city model.