Meta-StyleSpeech : multi-speaker adaptive text-to-speech generation다중 화자 적응형 음성 합성 시스템을 위한 합성 방식 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 237
  • Download : 0
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech’s adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker’s voice with single short-duration (1~3 sec) speech audio, significantly outperforming baselines.
Advisors
Hwang, Sung Juresearcher황성주researcher
Description
한국과학기술원 :김재철AI대학원,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 김재철AI대학원, 2022.2,[iii, 26 p. :]

URI
http://hdl.handle.net/10203/308232
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=997690&flag=dissertation
Appears in Collection
AI-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0