Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Music auto-tagging is often handled in a similar manner to image classification by regarding the two-dimensional audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstraction. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pretrained convolutional networks separately and aggregate them altogether giving a long audio clip. Finally, we put them into fully connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset. We further show that the proposed architecture is useful in transfer learning.
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Issue Date
2017-08
Language
English
Citation

IEEE SIGNAL PROCESSING LETTERS, v.24, no.8, pp.1208 - 1212

ISSN
1070-9908
DOI
10.1109/LSP.2017.2713830
URI
http://hdl.handle.net/10203/225076
Appears in Collection
GCT-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
  • Hit : 67
  • Download : 0
  • Cited 0 times in thomson ci
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡClick to seewebofscience_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0