Generalization of neural network on unseen acoustic environment and sentence for spoken dialog system음성 대화 시스템을 위한 신경망의 새로운 음향 환경과 문장에서의 일반화

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 159
  • Download : 0
The spoken dialog system is required to respond appropriately for diverse user queries. Generalization of spoken dialog system on unseen user query given from unseen sentence and an acoustic environment is discussed in this dissertation. For the first part, we deal with two general problems in conventional neural sentence representation: (1) estimating embedding of the rare word and (2) no inter-sentence dependency. The above problems are simultaneously addressed with the hierarchical composition recurrent network (HCRN). The HCRN consists of a 3-level hierarchy: character-word-sentence-context. This method is tested on the dialog act classification task with the DAMSL database. Compared to the conventional word-to-sentence hierarchy model, word embedding built by character-to-word hierarchy form morphologically, semantically similar clusters and sentence-to-context hierarchy reduce dialog act classification error especially for the sentence with an omission. For the second part, we aim speech enhancement without clean speech as the target, since it is generally not obtainable in a real environment and only available for simulated data. We propose the acoustic and adversarial supervision (AAS) for clean-free speech enhancement. Acoustic supervision makes enhanced speech maximizes the likelihood on the pre-trained acoustic model. Therefore, enhanced speech focus on maintaining phonetic characteristic but having artifacts as a consequence of over-fitting. Adversarial supervision makes enhanced speech having a general characteristic of clean speech, however, often irrelevant to the noisy speech by consequence of mode-collapse. With proper supervision weight combination, acoustic and adversarial supervision make up for each other’s limitations. This method is tested on Librispeech+DEMAND and CHiME-4 database. By visualizing the enhanced speech with different supervision combinations, we understand the aforementioned pros/cons of each supervision. Compared to the enhancement method using clean speech target, AAS achieve lower word error rate although the distance from clean speech is higher. For the third part, we aim to achieve the source and position robustness of the enhancement model. For source robustness, we remove the source-dependency of enhancement model by using intermic-ratio, demixing weight as input and output of the model. Demixing weight is inherently source-independent and intermic-ratio is approximately source-independent when an analysis window is much longer than impulse response. For position robustness, we propose the frequency-wise complex multi-layer perceptron given a prior analysis that position-sensitivity of demixing weight increases from low frequency to high frequency. Moreover, the target for demixing weight varies depending on model size, initialization, and training data in a minibatch since the global optimal of demixing weight is non-uniquely determined. We propose the reference position regularization to reduce training target variance by uniquely determine true demixing weight. The proposed method is tested on the simulated reverberant dataset with varying source position while room and mics are fixed. Compared to conventional source-dependent training methods, the proposed source-independent method achieves a higher signal-to-distortion ratio especially the number of training sources is small. While proposed model tend to overfit to training positions, the reference position regularization alleviates signal-to-distortion ratio drop on out-of-training position.
Advisors
Kim, Daeshikresearcher김대식researcherLee, Soo-Youngresearcher이수영researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2020
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2020.2,[vii, 101 p. :]

Keywords

Sentence representation▼aOut-of-vocabulary▼aDialog context▼aClean-free speech enhancement▼aSource/Position robustness; 문장 표현법▼a사전 내 미포함 단어▼a대화 문맥▼a무잡음 음원이 필요없는 음성 향상법▼a음원 위치 강인성

URI
http://hdl.handle.net/10203/284231
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=909488&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0