Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 61
  • Download : 0
To understand our surrounding world, our brain is continuously inundated with multisensory information and their complex interactions coming from the outside world at any given moment. While processing this information might seem effortless for human brains, it is challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with a single type of integration but require more sophisticated approaches. In this paper, we propose a new simple method to address the multisensory integration in video understanding. Unlike previous works where a single fusion type is used, we design a multi-head model with individual event-specific layers to deal with different audio-visual relationships, enabling different ways of audio-visual fusion. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos, e.g., semantically matched moments, and rhythmic events. Moreover, although our network is trained with single labels, our multi-head design can inherently output additional semantically meaningful multi-labels for a video. As an application, we demonstrate that our proposed method can expose the extent of event-characteristics of popular benchmark datasets.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2023-01
Language
English
Citation

23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, pp.2236 - 2246

DOI
10.1109/WACV56688.2023.00227
URI
http://hdl.handle.net/10203/305987
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0