EMOTIONAL VOICE CONVERSION USING MULTITASK LEARNING WITH TEXT-TO-SPEECH

Cited 14 time in webofscience Cited 0 time in scopus
  • Hit : 75
  • Download : 0
Voice conversion (VC) is a task that alters the voice of a person to suit different styles while conserving the linguistic content. Previous state-of-the-art technology used in VC was based on the sequence-to-sequence (seq2seq) model, which could lose linguistic information. There was an attempt to overcome this problem using textual supervision; however, this required explicit alignment, and therefore the benefit of using seq2seq model was lost. In this study, a voice converter that utilizes multitask learning with text-to-speech (TTS) is presented. By using multitask learning, VC is expected to capture linguistic information and preserve the training stability. This method does not require explicit alignment for capturing abundant text information. Experiments on VC were performed on a male-Korean-emotional-text-speech dataset to convert the neutral voice to emotional voice. It was shown that multitask learning helps to preserve the linguistic content.
Publisher
IEEE
Issue Date
2020-05
Language
English
Citation

IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.7774 - 7778

ISSN
1520-6149
DOI
10.1109/ICASSP40776.2020.9053255
URI
http://hdl.handle.net/10203/288365
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 14 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0