Semantic analysis and applications of vocal characteristics in music using deep learning딥 러닝을 이용한 음악 보컬 특징의 의미론적 분석 및 적용

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 222
  • Download : 0
Our voice have been the primary musical instrument in human history. Along with the development of expressive and perceptual competency of singing voice, its ability to deliver complicated and delicate feeling takes the most important part in most music genres, especially in popular music. Thus it has been a significant topic in music information retrieval(MIR) community. However, traditional studies on high-level characteristics of singing voice have faced limitations in modelling the complicated space of it mainly because of the multifaceted nature of music and lack of vocal-specific data. Recent advent of deep learning techniques and its application in MIR brought possibility to obtain complicated information in a robust representation and also need of extensive data. This dissertation aims to cover almost the entire process of deep learning research to prove the necessity of singing voice information and provide from foundations to practical applications for it. The research consists of three consecutive parts. The first part is mainly about the process of constructing a semantic tag dataset of K-pop songs, which focuses on singing voice. Since human feelings involve multiple levels of abstractions and complex cognitive process, it is necessary to build effective dataset that can represent qualitative characteristics of singing voice in a human-centered way. A popular approach is to use semantic notations, or tags to describe the complicated information of music. Previous music tag data do not provide extensive vocal-specific labels of commercial songs. The presented dataset has a few notable advantages for singing voice research. Its tags are regarding vocal characteristics and it was explicitly noted while labeling. Its labels are about 10-second-long segments to capture temporal variance. The tags and artists are collected from professional vocal reviews and selected by experts. The dataset, to our knowledge, is the first and only extensive dataset which focuses on singing voice. Its appropriateness and advantages have been proved with following analysis and experiments. Statistical analysis of the dataset and tag prediction tests with deep neural networks(DNN) model is presented second. This part aimed to reveal sanity, propriety, and other significant properties of the dataset. Through the statistic analysis, global statistics such as frequency and agreement as well as within-song characteristics such as temporal activation and intra-song frequency were calculated to discover the static and dynamic aspects of each tag. After the analysis, DNN models are trained to predict activation of the tags from audio input. Results from the tests show that the characteristics of singing voice can be learned using deep learning technique. The properties found from the tests conform with the results of the previous analysis and human understandings of the tags. Possible applications of the dataset and model are suggested as well. The third part presents a more profound exploration on deep representation of music with vocal information. Deep representation is a machine learning method that constructs a generalized representation that contains essential information of specific data domain by training a highly-complicated model with massive amount of data. Studies on deep representation in MIR are currently at an emerging stage and recent research suggest that the general representations may fail to contain the multifaceted nature of music such as delicateness of singing voice over overall musical information. Three novel and readily applicable ideas to augment singing voice information to deep music representations are proposed and applied to deep representations developed using deep metric learning. They were tested with two different target tasks. The results show that the suggested techniques can enhance the music representations in vocal related target tasks. This dissertation takes a holistic approach to applications of modern machine learning methods to the qualitative characteristics of singing voice. It reveals the value of singing voice research and prospects of its applications as well as providing the fundamentals to related research. It takes both of human-centered and data-driven approaches by building semantic dataset and exploring deep representations. From the result, it is shown that the characteristics of singing voice can be modeled using deep learning methods and improved by the suggested techniques.
Advisors
Nam, Juhanresearcher남주한researcher
Description
한국과학기술원 :문화기술대학원,
Country
한국과학기술원
Issue Date
2021
Identifier
325007
Language
eng
Article Type
Thesis(Ph.D)
URI
http://hdl.handle.net/10203/294534
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=956564&flag=dissertation
Appears in Collection
GCT-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0