DSpace at KOASAS: Voice activity detection and speech enhancement based on deep neural network with improved utilization of context information

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Ph.D.(박사논문)

Voice activity detection and speech enhancement based on deep neural network with improved utilization of context information심층 신경망의 문맥 정보 활용 향상 기법을 통한 음성 검출기 및 음성 향상에 대한 연구

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 213
Download : 0

Export

Kim, Juntae

Automatic speech recognition (ASR) is a one of key techniques for human-machine interaction through human’s voice and has recently been deployed in voice search, car navigation and artificial intelligence speaker. Although ASR accuracy has been greatly improved by deploying deep-learning-based techniques, its consistency still cannot be guaranteed in real environment owing to unpredictable speaking timing, background noise, reverberation and interfering speakers. To build the robust ASR for real environment, various front-end systems have been studied for decades such as voice activity detection, speech enhancement, de-reverberation and source separation. Conventionally, most of them depend on signal processing techniques and contributed to the robustness for ASR, however, still have some limitations due to their modeling assumptions to the speech and noise environments. In recent, deep-learning-based front-end systems have outperformed the signal processing ones. In this dissertation, we study and develop deep-learning-based techniques for two major sub-disciplines of front-end systems: single-microphone voice activity detection (VAD) and single-microphone speech enhancement (SE). Specifically, we focus on improving the utilization of context information within speech signal for our models for VAD and SE, as context information has been known to a crucial asset for deep-learning-based, speech-related applications. For VAD, the context information (CI) of speech signal has considered to one of key information to detect the speech from noisy signal. Although CI of speech signal is a relevant VAD asset, its usefulness can vary in unpredictable noise environments i.e. according to noise types, the importance of long-short term CI can be changed. Therefore, its usage should be adaptively adjustable to the noise type. This dissertation improves the use of context information by using an adaptive context attention model (ACAM) with a novel training strategy for effective attention, which weights the most crucial parts of the context for proper classification. Experiments in real-world scenarios demonstrate that the proposed ACAM-based VAD outperforms the other baseline VAD methods. For SE, a novel neural network architecture called two-stage network (TSN) with a multi-objective learning method (MOL) for an efficient boosting strategy (BS) is proposed to deploy various CI with reasonable computational cost. BS is an ensemble method using multiple base predictions (MBPs) for better final prediction. Due to the necessity of MBPs, the computational cost and model size of BS based methods are excessive than that of a single model. In this regard, TSN firstly obtains MBPs from different CI by using a single deep neural network. Then, to obtain better final prediction, the convolution layers of TSN aggregate not only MBP but also some auxiliary information such as contextual information, while adaptively filtering out some unnecessary information e.g., poor base predictions. At the training phase, MOL enables all stages of TSN to learn jointly, while allowing the TSN framework to embed a BS. Our experimental results confirm that the embedded BS leads the TSN to outperform other baseline methods with a reasonably low computational cost and model size. Further, we propose auxiliary methods to lead the improvement of VAD to that of ASR. As VAD is frame-level classifier, it should be changed to utterance-level classifier for ASR. To achieve this, additional state transition model (STM) that cooperating with VAD is proposed and VAD with STM is often referred to as end-point detection (EPD). Finally, we carry out in-depth empirical analysis of the effect of proposed EPD and SE to the speech recognition performance.

Advisors: Hahn, Minsoo researcher; 한민수 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2019

Identifier: 325007

Language: eng

Description: 학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2019.8,[iv, 83 p. :]

Keywords: voice activity detection▼aend-point detection▼aspeech enhancement▼aspeech recognition; 음성 검출기▼a음질 향상▼a음성인식

URI: http://hdl.handle.net/10203/283309

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=871484&flag=dissertation

Appears in Collection: EE-Theses_Ph.D.(박사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Voice activity detection and speech enhancement based on deep neural network with improved utilization of context information심층 신경망의 문맥 정보 활용 향상 기법을 통한 음성 검출기 및 음성 향상에 대한 연구

KOASAS

Communities & Collections