AdaptVC: High Quality Voice Conversion with Adaptive Learning

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 41
  • Download : 0
The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2025-04-08
Language
English
Citation

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

DOI
10.1109/ICASSP49660.2025.10889396
URI
http://hdl.handle.net/10203/336112
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0