Model generalization in machine learning fields implies the ability to accurately classify data that are not used in training. In the case of an acoustic model for processing acoustic features in a speech recognition system, this model generalization is important because there is a large difference between training data and test data due to environment and speaker variations. In particular, the generative modeling-based acoustic model, which focuses on the distribution of data such as the Gaussian mixture model, is vulnerable to classifying data that are not used for training. In order to solve such problems, researchers have made extensive efforts to improve the generalization of acoustic models by using methods such as maximizing margins, model adaptation, and feature selection.
Recently, acoustic modeling techniques have moved away from the Gaussian mixture model and made a leap to improve recognition rates dramatically by using deep learning-based modeling. It can be seen that the deep learning-based modeling technique greatly improves the acoustic model generalization through the ability to simplify acoustic input features into linearly separable representations. However, this improvement is due to the structural merits of the deep learning model and it is not so different from the classical method that model training through the cross entropy (CE) criterion is dependent only on training data.
Therefore, in this dissertation, various training methods to improve the generalization of deep learning-based acoustic models are proposed. First, we reinterpret the machine learning technique which improves model generalization through margin maximization and propose a method to apply it to training deep neural network (DNN)-based acoustic models. In this method, instead of considering margins directly, we utilize a method of expanding the margins through a regularization technique which maps the last hidden layer outputs densely at the centroid of each class. For the proposed method, we also propose a $L_2$ distance-based output layer that performs classification through the centroid. Second, we propose a new speaker adaptation technique for DNN-based acoustic models. In this method, we introduce a closed form solution-based training method in a linear output network framework instead of the stochastic gradient descent (SGD)-based approach which should consider various training conditions. The proposed method uses the aforementioned $ L_2 $ distance-based output layer so that the linearly transformed last hidden layer outputs are mapped close to the center of each class. Finally, we propose a feature selection technique by classification contribution, which can be used in deep learning-based acoustic modeling. The proposed method extends the idea of conventional feature selection techniques and assigns weights between 0 and 1 to each element of input features through a DNN framework. All of the proposed methods showed consistent performance improvements compared to CE criterion-based training, which proves the effectiveness of the proposed methods.