High-fidelity image and video synthesis tasks are long-standing problems in computer vision literature. By learning to represent and synthesize images and videos, these tasks have been mostly tackled via deep generative models, such as VAEs and GANs. In this thesis, we review two recent instantiations of VAE-GAN family that combine VAEs and GANs hierarchically for high-fidelity image and video synthesis. The core idea of these reviewed methods is two-folded. First, the synthesis task is decomposed into two separate yet successive learning problems, i.e. representation learning and synthesis learning problems. Then each problem is tackled by VAEs and GANs respectively in a hierarchical manner. This thesis considers two challenging synthesis scenarios to understand the benefits of such generative models: (1) mega-pixel image synthesis with interpretable disentangled representation and (2) persistent long-term video prediction. In each chapter addressing each topic, we provide the detailed motivation and method of hierarchically decomposed generative models, followed by comparison results with the other state-of-the-art generative models showing the effectiveness of such hierarchical decomposition techniques.