| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Choi, Jeongsoo | ko |
| dc.contributor.author | Kim, Ji-Hoon | ko |
| dc.contributor.author | Li, Jinyu | ko |
| dc.contributor.author | Chung, Joon Son | ko |
| dc.contributor.author | Liu, Shujie | ko |
| dc.date.accessioned | 2025-11-24T07:00:12Z | - |
| dc.date.available | 2025-11-24T07:00:12Z | - |
| dc.date.created | 2025-11-24 | - |
| dc.date.created | 2025-11-24 | - |
| dc.date.created | 2025-11-24 | - |
| dc.date.issued | 2025-04-10 | - |
| dc.identifier.citation | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 | - |
| dc.identifier.uri | http://hdl.handle.net/10203/336089 | - |
| dc.description.abstract | In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances. | - |
| dc.language | English | - |
| dc.publisher | Institute of Electrical and Electronics Engineers Inc. | - |
| dc.title | V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow | - |
| dc.type | Conference | - |
| dc.type.rims | CONF | - |
| dc.citation.publicationname | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 | - |
| dc.identifier.conferencecountry | II | - |
| dc.identifier.doi | 10.1109/ICASSP49660.2025.10889780 | - |
| dc.contributor.localauthor | Chung, Joon Son | - |
| dc.contributor.nonIdAuthor | Li, Jinyu | - |
| dc.contributor.nonIdAuthor | Liu, Shujie | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.