In this thesis, two major problems are pointed out through the structural analysis of continuous sign language recognition (CSLR) datasets: (1) Since constructing CSLR dataset is expensive, additional annotations (pose, optical flow and frame-level gloss labels, etc.) are difficult. (2) Various background environments are not considered in the dataset construction process. From the first problem, we propose a lightweight backbone network that can independently extract non-manual (gaze direction, facial expressions and lip patterns) and manual (hand shape, movement) expression features without any additional annotations, and a method to generate more accurate pseudo-labels by combining the model output with the ground truth gloss sequence. In addition, from the second issue, we first construct a sign language dataset including various background scenes and further propose a disentanglement module to effectively
distinguish a signer and a background from a sign video. We verify that the proposed methodologies have a great effect on overcoming the limitations caused by the existing CSLR dataset based on various quantitative and qualitative evaluations.