Sound-Guided Semantic Video Generation

Cited 2 time in webofscience Cited 0 time in scopus
  • Hit : 110
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorLee, Seung Hyunko
dc.contributor.authorYoon, Sang Hoko
dc.contributor.authorKim, Sangpilko
dc.contributor.authorKim, Jinkyuko
dc.contributor.authorOh, Gyeongrokko
dc.contributor.authorByeon, Wonminko
dc.contributor.authorKim, Chanyoungko
dc.contributor.authorRyoo, Won Jeongko
dc.contributor.authorBae, Jihyunko
dc.contributor.authorCho, Hyunjunko
dc.date.accessioned2022-11-17T02:01:00Z-
dc.date.available2022-11-17T02:01:00Z-
dc.date.created2022-11-17-
dc.date.created2022-11-17-
dc.date.created2022-11-17-
dc.date.issued2022-10-23-
dc.identifier.citation2022 European Conference on Computer Vision, pp.34 - 50-
dc.identifier.issn978-3-031-
dc.identifier.urihttp://hdl.handle.net/10203/299776-
dc.description.abstractThe recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.-
dc.languageEnglish-
dc.publisherSpringer-
dc.titleSound-Guided Semantic Video Generation-
dc.typeConference-
dc.identifier.wosid000904106100003-
dc.identifier.scopusid2-s2.0-85142678353-
dc.type.rimsCONF-
dc.citation.beginningpage34-
dc.citation.endingpage50-
dc.citation.publicationname2022 European Conference on Computer Vision-
dc.identifier.conferencecountryUS-
dc.identifier.conferencelocationTel Aviv-
dc.identifier.doi10.1007/978-3-031-19790-1_3-
dc.contributor.localauthorYoon, Sang Ho-
dc.contributor.nonIdAuthorLee, Seung Hyun-
dc.contributor.nonIdAuthorKim, Sangpil-
dc.contributor.nonIdAuthorKim, Jinkyu-
dc.contributor.nonIdAuthorOh, Gyeongrok-
dc.contributor.nonIdAuthorByeon, Wonmin-
dc.contributor.nonIdAuthorKim, Chanyoung-
dc.contributor.nonIdAuthorRyoo, Won Jeong-
dc.contributor.nonIdAuthorBae, Jihyun-
dc.contributor.nonIdAuthorCho, Hyunjun-
Appears in Collection
GCT-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 2 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0