DSpace at KOASAS: Generalization of neural network on unseen acoustic environment and sentence for spoken dialog system

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Ph.D.(박사논문)

Generalization of neural network on unseen acoustic environment and sentence for spoken dialog system음성 대화 시스템을 위한 신경망의 새로운 음향 환경과 문장에서의 일반화

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 330
Download : 0

Export

Kim, Geonmin

The spoken dialog system is required to respond appropriately for diverse user queries. Generalization of spoken dialog system on unseen user query given from unseen sentence and an acoustic environment is discussed in this dissertation. For the first part, we deal with two general problems in conventional neural sentence representation: (1) estimating embedding of the rare word and (2) no inter-sentence dependency. The above problems are simultaneously addressed with the hierarchical composition recurrent network (HCRN). The HCRN consists of a 3-level hierarchy: character-word-sentence-context. This method is tested on the dialog act classification task with the DAMSL database. Compared to the conventional word-to-sentence hierarchy model, word embedding built by character-to-word hierarchy form morphologically, semantically similar clusters and sentence-to-context hierarchy reduce dialog act classification error especially for the sentence with an omission. For the second part, we aim speech enhancement without clean speech as the target, since it is generally not obtainable in a real environment and only available for simulated data. We propose the acoustic and adversarial supervision (AAS) for clean-free speech enhancement. Acoustic supervision makes enhanced speech maximizes the likelihood on the pre-trained acoustic model. Therefore, enhanced speech focus on maintaining phonetic characteristic but having artifacts as a consequence of over-fitting. Adversarial supervision makes enhanced speech having a general characteristic of clean speech, however, often irrelevant to the noisy speech by consequence of mode-collapse. With proper supervision weight combination, acoustic and adversarial supervision make up for each other’s limitations. This method is tested on Librispeech+DEMAND and CHiME-4 database. By visualizing the enhanced speech with different supervision combinations, we understand the aforementioned pros/cons of each supervision. Compared to the enhancement method using clean speech target, AAS achieve lower word error rate although the distance from clean speech is higher. For the third part, we aim to achieve the source and position robustness of the enhancement model. For source robustness, we remove the source-dependency of enhancement model by using intermic-ratio, demixing weight as input and output of the model. Demixing weight is inherently source-independent and intermic-ratio is approximately source-independent when an analysis window is much longer than impulse response. For position robustness, we propose the frequency-wise complex multi-layer perceptron given a prior analysis that position-sensitivity of demixing weight increases from low frequency to high frequency. Moreover, the target for demixing weight varies depending on model size, initialization, and training data in a minibatch since the global optimal of demixing weight is non-uniquely determined. We propose the reference position regularization to reduce training target variance by uniquely determine true demixing weight. The proposed method is tested on the simulated reverberant dataset with varying source position while room and mics are fixed. Compared to conventional source-dependent training methods, the proposed source-independent method achieves a higher signal-to-distortion ratio especially the number of training sources is small. While proposed model tend to overfit to training positions, the reference position regularization alleviates signal-to-distortion ratio drop on out-of-training position.

Advisors: Kim, Daeshik researcher; 김대식 researcher; Lee, Soo-Young researcher; 이수영 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2020

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2020.2,[vii, 101 p. :]

Keywords: Sentence representation▼aOut-of-vocabulary▼aDialog context▼aClean-free speech enhancement▼aSource/Position robustness; 문장 표현법▼a사전 내 미포함 단어▼a대화 문맥▼a무잡음 음원이 필요없는 음성 향상법▼a음원 위치 강인성

URI: http://hdl.handle.net/10203/284231

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=909488&flag=dissertation

Appears in Collection: EE-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Academic Information Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Generalization of neural network on unseen acoustic environment and sentence for spoken dialog system음성 대화 시스템을 위한 신경망의 새로운 음향 환경과 문장에서의 일반화

KOASAS

Communities & Collections