Multi-speaker Emotional Acoustic Modeling for CNN-based Speech Synthesis

Cited 25 time in webofscience Cited 24 time in scopus
  • Hit : 215
  • Download : 0
In this paper, we investigate multi-speaker emotional acoustic modeling methods for convolutional neural network (CNN) based speech synthesis system. For emotion modeling, we extend to the speech synthesis system that learns a latent embedding space of emotion, derived from a desired emotional identity, and we use emotion code and mel-frequency spectrogram as an emotion identity. In order to model speaker variation in a text-to-speech (TTS) system, we use speaker representations such as trainable speaker embedding and speaker code. We have implemented speech synthesis systems combining speaker representation and emotion representation and compared them by experiments. Experimental results have demonstrated that the multi-speaker emotional speech synthesis approach using trainable speaker embedding and emotion representation from mel spectrogram achieves higher performance when compared with other approaches in terms of naturalness, speaker similarity, and emotion similarity.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2019-05-17
Language
English
Citation

44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, pp.6950 - 6954

DOI
10.1109/ICASSP.2019.8683682
URI
http://hdl.handle.net/10203/269426
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 25 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0