Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition

Cited 14 time in webofscience Cited 0 time in scopus
  • Hit : 552
  • Download : 0
In text-independent speaker recognition, each speech is composed of different phonemes depending on spoken text. The conventional neural networks for speaker recognition are static models, so they do not reflect this phoneme-varying characteristic well. To tackle this limitation, we propose an adaptive convolutional neural network (ACNN) for text-independent speaker recognition. The utterance is divided along the time axis into short segments with small fluctuating phonemes. Frame-level features are extracted by applying input-dependent kernels adaptive to each segment. By applying time average pooling and linear layers, utterance-level embeddings extraction and speaker recognition are performed. Adaptive VGG-M using 0.356 seconds segmentation shows better speaker recognition performance than baseline models, with a Top-1 of 86.51% and an EER of 5.68%. It extracts more accurate frame-level embeddings for vowel and nasal phonemes compared to the conventional method without overfitting and large parameters. This framework for text-independent speaker recognition effectively utilizes phonemes and text-varying characteristic of speech.
Publisher
International Speech Communication Association
Issue Date
2021-08-31
Language
English
Citation

INTERSPEECH 2021, pp.641 - 645

ISSN
2308-457X
DOI
10.21437/Interspeech.2021-65
URI
http://hdl.handle.net/10203/291820
Appears in Collection
ME-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 14 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0