Automatic engagement recognition is a technique that is used to measure the engagement level of people in a specific task. Although previous research has utilized expensive and intrusive devices such as physiological sensors and pressure-sensing chairs, methods using RGB video cameras have become the most common because of the cost efficiency and noninvasiveness of video cameras. Automatic engagement recognition methods using video cameras are usually based on hand-crafted features and a statistical temporal dynamics modeling algorithm. This paper proposes a data-driven convolutional neural networks (CNNs)-based engagement recognition method that uses only facial images from input videos. As the amount of data in a dataset of children's engagement is insufficient for deep learning, pre-trained CNNs are utilized for low-level feature extraction from each video frame. In particular, a new layer combination for temporal dynamics modeling is employed to extract high-level features from low-level features. Experimental results on a database created using images of children from kindergarten demonstrate that the performance of the proposed method is superior to that of previous methods. The results indicate that the engagement level of children can be gauged automatically via deep learning even when the available database is deficient.