Embedding approach based speech enhancement robust for noise and speaker variability임베딩 접근 기반의 잡음과 화자 변이에 강인한 음성 향상에 관한 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 76
  • Download : 0
Noise and speaker variations can degrade the performance of deep learning-based speech enhancement (SE) system. One of the ways to overcome these issues is that the SE model is adaptively trained by covering information about background noise or speaker in training stage so that then the SE model can produce the result optimal to unseen noise or speaker in inference stage. To make SE system robust to these variations, we propose 2 types of embedding that represent the information about background noise and speaker. Also, to improve the representation ability of each embedding, voice activity detection (VAD) is used ahead of SE. The speech presence probability obtained from VAD is used to focus on non-speech frames when extracting the noise-related embedding, the dynamic noise embedding (DNE), and speech frames when extracting the speaker-related embedding, the deep speaker embedding (DSE). This approach also flexibly resolves the chicken-and-egg problem associated with the order of use of the VAD and SE. In VAD, 3 types of attention module are proposed to improve the performance of VAD. The temporal attention (TA) and frequential attention (FA) can make attention vector containing temporal and frequential attention, respectively. These attention vectors can improve the performance of VAD by concentrating on important components of hidden states. The dual attention (DA) using both modules shows the best results and is used to show the correlation of VAD and SE. Experiments are conducted on TIMIT dataset for single-channel denoising task and convolutional recurrent neural network (CRNN) is used as baseline. Experimental results show that the DNE and DSE play an important role in the SE model by increasing the quality and the intelligibility of corrupted speech signal even if the noise and speaker are unseen. In addition, through ablation studies, we show that not only the performance of VAD but the performance of SE is improved by applying proposed attention module to VAD model.
Advisors
Kim, Hoirinresearcher김회린researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2021
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2021.2,[iv, 52 p :]

Keywords

Speech Enhancement▼aVoice Activity Detection▼aNoise Embedding▼aSpeaker Embeddins▼aAttention; 음성향상▼a음성검출▼a잡음 임베딩▼a화자 임베딩▼a어텐션

URI
http://hdl.handle.net/10203/295968
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=948741&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0