Learning hierarchical representations for music classification음악 분류를 위한 계층적 표현 학습

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 625
  • Download : 0
Music auto-tagging is often handled in a similar manner to image classification by regarding the 2D audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstractions. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pre-trained convolutional networks separately and aggregate them altogether given a long audio clip. Finally, we put them into fully-connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms previous state-of-the-arts. We further show that the proposed architecture is useful in transfer learning. Furthermore, recently the end-to-end approach that learns hierarchical representations from raw data using deep convolutional neural networks has been successfully explored in the image, text and speech domains. This approach was applied to musical signals as well but has been not fully explored yet. To this end, we propose sample-level deep convolutional neural networks which learn representations from very small grains of wave- forms (e.g. 2 or 3 samples) beyond typical frame-level input representations. Our experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset. In addition, we visualize filters learned in a sample-level DCNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency along layer, such as mel-frequency spectrogram that is widely used in music classification systems.
Advisors
Nam, Ju Hanresearcher남주한researcher
Description
한국과학기술원 :문화기술대학원,
Publisher
한국과학기술원
Issue Date
2017
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 문화기술대학원, 2017.8,[iv, 28 p. :]

Keywords

convolutional neural netowrks▼afeature aggregation▼amusic auto-tagging▼atransfer learning▼araw waveforms▼asample-level filters; 회선 신경망▼a특징 집합▼a음악 오토태깅▼a전이 학습▼a원시 파형▼a샘플-수준 필터

URI
http://hdl.handle.net/10203/242939
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=718574&flag=dissertation
Appears in Collection
GCT-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0