DSpace at KOASAS: V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Conference Papers(학술회의논문)

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 45
Download : 0

Export

DC Field	Value	Language
dc.contributor.author	Choi, Jeongsoo	ko
dc.contributor.author	Kim, Ji-Hoon	ko
dc.contributor.author	Li, Jinyu	ko
dc.contributor.author	Chung, Joon Son	ko
dc.contributor.author	Liu, Shujie	ko
dc.date.accessioned	2025-11-24T07:00:12Z	-
dc.date.available	2025-11-24T07:00:12Z	-
dc.date.created	2025-11-24	-
dc.date.created	2025-11-24	-
dc.date.created	2025-11-24	-
dc.date.issued	2025-04-10	-
dc.identifier.citation	2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025	-
dc.identifier.uri	http://hdl.handle.net/10203/336089	-
dc.description.abstract	In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.	-
dc.language	English	-
dc.publisher	Institute of Electrical and Electronics Engineers Inc.	-
dc.title	V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow	-
dc.type	Conference	-
dc.type.rims	CONF	-
dc.citation.publicationname	2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025	-
dc.identifier.conferencecountry	II	-
dc.identifier.doi	10.1109/ICASSP49660.2025.10889780	-
dc.contributor.localauthor	Chung, Joon Son	-
dc.contributor.nonIdAuthor	Li, Jinyu	-
dc.contributor.nonIdAuthor	Liu, Shujie	-

Appears in Collection: EE-Conference Papers(학술회의논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Academic Information Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

KOASAS

Communities & Collections