DSpace at KOASAS: ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

DSpace at KOASAS

College of Engineering(공과대학)Kim Jaechul Graduate School of AI(김재철AI대학원)AI-Conference Papers(학술대회논문)

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

Cited 0 time in webofscience

Cited 0 time in

Hit : 33
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Kang, Minki	ko
dc.contributor.author	Han, Wooseok	ko
dc.contributor.author	Hwang, Sung Ju	ko
dc.contributor.author	Yang, Eunho	ko
dc.date.accessioned	2023-12-12T07:00:53Z	-
dc.date.available	2023-12-12T07:00:53Z	-
dc.date.created	2023-12-08	-
dc.date.issued	2023-08-22	-
dc.identifier.citation	24th International Speech Communication Association, Interspeech 2023, pp.4339 - 4343	-
dc.identifier.uri	http://hdl.handle.net/10203/316285	-
dc.description.abstract	Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.	-
dc.language	English	-
dc.publisher	International Speech Communication Association	-
dc.title	ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models	-
dc.type	Conference	-
dc.identifier.scopusid	2-s2.0-85171580094	-
dc.type.rims	CONF	-
dc.citation.beginningpage	4339	-
dc.citation.endingpage	4343	-
dc.citation.publicationname	24th International Speech Communication Association, Interspeech 2023	-
dc.identifier.conferencecountry	IE	-
dc.identifier.conferencelocation	Dublin	-
dc.identifier.doi	10.21437/Interspeech.2023-754	-
dc.contributor.localauthor	Hwang, Sung Ju	-
dc.contributor.localauthor	Yang, Eunho	-
dc.contributor.nonIdAuthor	Kang, Minki	-
dc.contributor.nonIdAuthor	Han, Wooseok	-

Appears in Collection: AI-Conference Papers(학술대회논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

KOASAS

Communities & Collections