If we think about how we, as human beings, experience the world around us, it can be realized that we continuously
use all of our senses. With all these different sensory signals, we learn and understand the scenes. Regardless of
whether a person sees an image of “lion”, hears a “lion roaring” sound, or hears someone says the word “lion”,
the same response is triggered inside of the human brain. Though human perception uses multimodal information,
most of the existing models for understanding the scene around us deal with only a single modality, such as vision.
Thus, developing a machine perception that uses multimodal data is very essential. Among these sensory signals,
inarguably the most dominant ones are vision and audition. The sound is not only complementary to the visual
information but also correlated to visual events. When we see that a car is moving, we hear the engine sound at
the same time.
In this thesis, I introduce computational models to find the correspondence and complementary information
between audio and visual signals. I introduce several tasks that benefit from the correspondence information
such as sound source localization, audio-visual cross-modal retrieval, and audio-visual driven important moment
selection in the videos. I propose effective self-supervised, semi-supervised and weakly-supervised methods to
learn audio-visual correspondence. I also discuss different relationships of audio-visual signals as they do not
follow a single type of relationship and leverage these two signals as complementary information to each other in
video understanding task by following different ways of audio-visual formations.