Semantically complex audio to video generation with audio source separation

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 478
  • Download : 0
Recent advancements in artificial intelligence for audio-to-video generation have shown the ability to generate high-quality videos from audio, particularly by focusing on temporal semantics and magnitude. However, existing works struggle to capture all semantics from audio, as real world audios often consist of mixed sources, making it challenging to generate semantically aligned videos. To solve this problem, we present a novel multi- source audio-to-video generation framework that incorporates decomposed multiple audio sources into video generative models. Specifically, our proposed Attention Mosaic directly maps each decomposed audio feature to the corresponding spatial attention feature. In addition, our condition injection module is helpful for producing more natural contexts with non-audible objects by leveraging the knowledge of existing generative models. Our experiments show that the proposed framework achieves state-of-the-art performance in representing both multi- and single-source audio-to-video generation methods.
Publisher
PERGAMON-ELSEVIER SCIENCE LTD
Issue Date
2025-06
Language
English
Article Type
Article
Citation

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, v.149

ISSN
0952-1976
DOI
10.1016/j.engappai.2025.110457
URI
http://hdl.handle.net/10203/328762
Appears in Collection
AI-Journal Papers(저널논문)GCT-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0