DSpace at KOASAS: Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

DSpace at KOASAS

College of Engineering(공과대학)School of Mechanical and Aerospace Engineering(기계항공공학부)Dept. of Mechanical Engineering(기계공학과)ME-Journal Papers(저널논문)

Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

Cited 2 time in

Cited 0 time in

Hit : 86
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Kim, Seong-Hu	ko
dc.contributor.author	Nam, Hyeonuk	ko
dc.contributor.author	Park, Yong-Hwa	ko
dc.date.accessioned	2023-07-13T02:02:13Z	-
dc.date.available	2023-07-13T02:02:13Z	-
dc.date.created	2023-07-13	-
dc.date.created	2023-07-13	-
dc.date.issued	2023-06	-
dc.identifier.citation	IEEE ACCESS, v.11, pp.60646 - 60659	-
dc.identifier.issn	2169-3536	-
dc.identifier.uri	http://hdl.handle.net/10203/310480	-
dc.description.abstract	Temporal dynamic convolution neural networks (TDY-CNNs) extract speaker embeddings considering the time-varying characteristics of speech and improve text-independent speaker verification performance. In this paper, we optimize TDY-CNNs based on the detailed analysis of the network architecture. The temporal dynamic convolution generates attention weight of basis kernels from features defined by concatenating average channel and frequency data, resulting in a reduction in network parameters by 26%. In addition, the temporal dynamic convolutions replace vanilla convolutions in earlier layers, while the optimized temporal dynamic convolutions of latter layers use a steady kernel regardless of time bin data. As a result, Opt-TDY-ResNet-34(x0.50) shows the best speaker verification performance with EER of 1.07% among speaker verification models without data augmentation including ResNet-based baseline networks and other state-of-the-art networks. Moreover, we validate that Opt-TDY-CNNs adapt to time-bin data through various methods. By comparing the inter and intra phoneme distance of attention weights, it was confirmed that the temporal dynamic convolution uses different kernels depending on the phoneme groups directly related to the time-bin data. In addition, by applying gradient-weighted class activation mapping (Grad-CAM) on speaker verification to obtain speaker activation map (SAM), we showed that temporal dynamic convolution extracts speaker information from frequency characteristics of time bins such as phonemes' formant frequencies while vanilla convolution extracts vague outline of Mel-spectrogram.	-
dc.language	English	-
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	-
dc.title	Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification	-
dc.type	Article	-
dc.identifier.wosid	001018648700001	-
dc.identifier.scopusid	2-s2.0-85162716004	-
dc.type.rims	ART	-
dc.citation.volume	11	-
dc.citation.beginningpage	60646	-
dc.citation.endingpage	60659	-
dc.citation.publicationname	IEEE ACCESS	-
dc.identifier.doi	10.1109/ACCESS.2023.3286034	-
dc.contributor.localauthor	Park, Yong-Hwa	-
dc.description.isOpenAccess	N	-
dc.type.journalArticle	Article	-
dc.subject.keywordAuthor	Speaker verification	-
dc.subject.keywordAuthor	text-independent	-
dc.subject.keywordAuthor	temporal dynamic convolution	-
dc.subject.keywordAuthor	temporal data-dependent kernel	-
dc.subject.keywordPlus	RECOGNITION	-
dc.subject.keywordPlus	EMBEDDINGS	-

Appears in Collection: ME-Journal Papers(저널논문)

Files in This Item: There are no files associated with this item.

This item is cited by other documents in WoS

⊙ Detail Information in WoSⓡ	Click to see
⊙ Cited 2 items in WoS	Click to see citing articles in

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Analysis-Based Optimization of Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification

This item is cited by other documents in WoS

KOASAS

Communities & Collections