Recognizing dynamic textures or scenes is one of the fundamental problems in natural scene understanding, which categorizes moving scenes such as a forest fire, landslide, or avalanche. Over the past decade, considerable efforts have been devoted to these issues. While existing methods focus on reliable capturing of spatial and temporal information of moving patterns, few works have explored frame selection strategies. However, a sequence is likely to include irrelevant frames that appear suddenly or rarely in a particular texture or scene category. In this dissertation, we suggest a codebook-based dynamic texture descriptor that aggregates salient features on three orthogonal planes. Given a sequence, only those frame features that are highly correlated with each visual word are selected and aggregated from the perspective of non-Euclidean geometry. The proposed descriptor removes the feature from outlier frames that suddenly or rarely appear in a particular context, thus enhancing the emphasis of the salient features. By extending this study, we also propose a dynamic scene recognition framework using deep convolutional neural networks. Instead of using whole frames, random frames or partially consecutive frames as in conventional approaches, we used `key frames' and `key segments.' Key frames that reflect the feature distribution of the sequence with a small number are used for capturing salient static appearances. Key segments, which are captured from the area around each key frame, provide an additional discriminative power by dynamic patterns. A fully connected layer from deep convolutional neural networks is used to select the key frames and key segments, while the convolutional layer is used to describe them. Features from key frames and key segments are then aggregated separately and combined into an efficient video-level descriptor. The evaluation results on public dynamic texture and scene datasets demonstrated the state-of-the-art performance of the proposed methods.