Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval

Cited 8 time in webofscience Cited 0 time in scopus
  • Hit : 254
  • Download : 0
This paper proposes a new strategy for learning effective cross-modal joint embeddings using self-supervision. We set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant data in one domain given input in another. The method builds on the recent advances in learning representations from cross-modal self-supervision using contrastive or binary cross-entropy loss functions. To investigate the robustness of the proposed learning strategy across multi-modal applications, we perform experiments for two applications - audio-visual synchronisation and cross-modal biometrics. The audio-visual synchronisation task requires temporal correspondence between modalities to obtain joint representation of phonemes and visemes, and the cross-modal biometrics task requires common speakers representations given their face images and audio tracks. Experiments show that the performance of systems trained using proposed method far exceed that of existing methods on both tasks, whilst allowing significantly faster training.
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Issue Date
2020-03
Language
English
Article Type
Article
Citation

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, v.14, no.3, pp.568 - 576

ISSN
1932-4553
DOI
10.1109/JSTSP.2020.2987720
URI
http://hdl.handle.net/10203/289579
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 8 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0