In this thesis, we have proposed a semantic video scene detection method. For a hierarchical video semantics model in this thesis, we defined the video scene level concepts with ``Dialog``, ``Adult``, and ``Miscellaneous`` scenes. These scene concepts are characterized by concepts drawn from a few key-frames. These key-frame level concepts present video semantics. The key-frame level concepts consist of 5 semantic concepts: ``Dressed Upper Body``, ``Human Full Face``, ``Naked Part of Body``, ``Naked Whole Body``, and ``Miscellaneous`` concepts. Using these key-frame level concepts, video scene concepts are modeled. For the concept learning with key-frame, firstly, the skin region-preserved key-frames are created to discard non-skin colored regions from the key-frames. Multiple visual features are extracted from both original key-frames and skin region-preserved key-frames. Each concept is modeled by Support Vector Machines (SVMs) that are trained by the extracted visual features. And these SVMs are boosted into a strong classifier with AdaBoost method. The score corresponding to the AdaBoosted SVMs is used as final confidence values for the key-frame level concepts. As a result, a concept vector with 5 confidence scores is generated and used for the scene concept modeling. Next, the video scene concepts are modeled by a Hidden Markov Model (HMM). To simplify HMM training, we defined several states and their transition probabilities for each video scene. Finally, the HMM is applied on top of the AdaBoosted SVMs with predefined states and transitions for scene concept modeling. In experiment, the key-frame level concept detection was performed with the best performance of 84.8% for the ``Naked Part of Body`` concept. And the ``Adult`` scene detection achieved 92.09% of recall with scene concept models from 79.22% with the multi-modal feature-based detection.