(A) unified deep learning framework for short-duration speaker verification in adverse environments열악한 환경에서의 짧은 발화 화자 검증을 위한 딥러닝 기반 통합 프레임워크에 관한 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 60
  • Download : 0
Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of speech-based virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this dissertation, we consider one more important requirement for practical applications: the SV system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these requirements, we propose Group Speaker, feature pyramid module (FPM)-based multi-scale aggregation (MSA), and self-adaptive soft VAD (SAS-VAD). To deal with short speech segments in noisy and reverberant environments, we present the Group Speaker and FPM-based MSA. At first, in the Group Speaker, deep speaker embedding learning incorporates the group information of speakers into a speaker embedding by learning group embeddings. After aggregating multiple group embeddings into a single embedding vector, we combine this with a deep speaker embedding to generate the final speaker embedding called group-aware speaker embedding. With this additional group information, we can reduce the set of speaker candidates that need to be recognized by a speaker embedding, thus effectively handling short utterances. Second, the MSA, which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To further increase the robustness dealing with utterances of arbitrary duration, we improve the MSA by using the FPM. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Third, we use the SAS-VAD to increase the robustness to long non-speech segments. The SAS-VAD is a combination of soft VAD and self-adaptive VAD. The soft VAD performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The self-adaptive VAD fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Fourth, we apply a masking-based speech enhancement (SE) method to further improve the robustness to acoustic distortions (i.e., noise and reverberation). Finally, we combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.
Advisors
Kim, Hoirinresearcher김회린researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2022.2,[iv, 59 p. :]

URI
http://hdl.handle.net/10203/309098
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=996261&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0