Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

Cited 2 time in webofscience Cited 0 time in scopus
  • Hit : 86
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorKim, Seong-Huko
dc.contributor.authorNam, Hyeonukko
dc.contributor.authorPark, Yong-Hwako
dc.date.accessioned2023-07-13T02:02:13Z-
dc.date.available2023-07-13T02:02:13Z-
dc.date.created2023-07-13-
dc.date.created2023-07-13-
dc.date.issued2023-06-
dc.identifier.citationIEEE ACCESS, v.11, pp.60646 - 60659-
dc.identifier.issn2169-3536-
dc.identifier.urihttp://hdl.handle.net/10203/310480-
dc.description.abstractTemporal dynamic convolution neural networks (TDY-CNNs) extract speaker embeddings considering the time-varying characteristics of speech and improve text-independent speaker verification performance. In this paper, we optimize TDY-CNNs based on the detailed analysis of the network architecture. The temporal dynamic convolution generates attention weight of basis kernels from features defined by concatenating average channel and frequency data, resulting in a reduction in network parameters by 26%. In addition, the temporal dynamic convolutions replace vanilla convolutions in earlier layers, while the optimized temporal dynamic convolutions of latter layers use a steady kernel regardless of time bin data. As a result, Opt-TDY-ResNet-34(x0.50) shows the best speaker verification performance with EER of 1.07% among speaker verification models without data augmentation including ResNet-based baseline networks and other state-of-the-art networks. Moreover, we validate that Opt-TDY-CNNs adapt to time-bin data through various methods. By comparing the inter and intra phoneme distance of attention weights, it was confirmed that the temporal dynamic convolution uses different kernels depending on the phoneme groups directly related to the time-bin data. In addition, by applying gradient-weighted class activation mapping (Grad-CAM) on speaker verification to obtain speaker activation map (SAM), we showed that temporal dynamic convolution extracts speaker information from frequency characteristics of time bins such as phonemes' formant frequencies while vanilla convolution extracts vague outline of Mel-spectrogram.-
dc.languageEnglish-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleAnalysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification-
dc.typeArticle-
dc.identifier.wosid001018648700001-
dc.identifier.scopusid2-s2.0-85162716004-
dc.type.rimsART-
dc.citation.volume11-
dc.citation.beginningpage60646-
dc.citation.endingpage60659-
dc.citation.publicationnameIEEE ACCESS-
dc.identifier.doi10.1109/ACCESS.2023.3286034-
dc.contributor.localauthorPark, Yong-Hwa-
dc.description.isOpenAccessN-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorSpeaker verification-
dc.subject.keywordAuthortext-independent-
dc.subject.keywordAuthortemporal dynamic convolution-
dc.subject.keywordAuthortemporal data-dependent kernel-
dc.subject.keywordPlusRECOGNITION-
dc.subject.keywordPlusEMBEDDINGS-
Appears in Collection
ME-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 2 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0