Comparison and Analysis of SampleCNN Architectures for Audio Classification

Cited 41 time in webofscience Cited 35 time in scopus
  • Hit : 762
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorKim, Taejunko
dc.contributor.authorLee, Jongpilko
dc.contributor.authorNam, Juhanko
dc.date.accessioned2019-06-12T07:50:19Z-
dc.date.available2019-06-12T07:50:19Z-
dc.date.created2019-06-12-
dc.date.created2019-06-12-
dc.date.created2019-06-12-
dc.date.issued2019-05-
dc.identifier.citationIEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, v.13, no.2, pp.285 - 297-
dc.identifier.issn1932-4553-
dc.identifier.urihttp://hdl.handle.net/10203/262580-
dc.description.abstractEnd-to-end learning with convolutional neural networks (CNNs) has become a standard approach in image classification. However, in audio classification, CNN-based models that use time-frequency representations as input are still popular. A recently proposed CNN architecture called SampleCNN takes raw waveforms directly and has very small sizes of filters. The architecture has proven to be effective in music classification tasks. In this paper, we scrutinize SampleCNN further by comparing it with spectrogram-based CNN and changing the suhsampling operation in three different audio domains: music, speech, and acoustic scene sound. Also, we extend SampleCNN to more advanced versions using components from residual networks and squeezeand-excitation networks. The results show that the squeeze-andexcitation block is particularly effective among them. Furthermore, we analyze the trained models to provide better understanding of the architectures. First, we visualize hierarchically learned features to see how the filters with small granularity adapt to audio signals from different domains. Second, we observe the squeeze-and-excitation block by plotting the distribution of excitation in several different ways. This analysis shows that the excitation tends to be increasingly class specific with increasing depth but the first layer that takes raw waveforms directly can be highly class specific, particularly in music data. We examine this further and show that the excitation in the first layer is sensitive to the loudness, which is an acoustic characteristic that distinguishes different genres of music.-
dc.languageEnglish-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleComparison and Analysis of SampleCNN Architectures for Audio Classification-
dc.typeArticle-
dc.identifier.wosid000468435500009-
dc.identifier.scopusid2-s2.0-85065982131-
dc.type.rimsART-
dc.citation.volume13-
dc.citation.issue2-
dc.citation.beginningpage285-
dc.citation.endingpage297-
dc.citation.publicationnameIEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING-
dc.identifier.doi10.1109/JSTSP.2019.2909479-
dc.contributor.localauthorNam, Juhan-
dc.description.isOpenAccessN-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorAudio classification-
dc.subject.keywordAuthorend-to-end learning-
dc.subject.keywordAuthorconvolutional neural networks-
dc.subject.keywordAuthorresidual networks-
dc.subject.keywordAuthorsqueeze-and-excitation networks-
dc.subject.keywordAuthorinterpretability-
dc.subject.keywordPlusCONVOLUTIONAL NEURAL-NETWORKS-
Appears in Collection
GCT-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 41 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0