CE-BART: Cause-and-Effect BART for Visual Commonsense Generation

Cited 1 time in webofscience Cited 0 time in scopus
  • Hit : 386
  • Download : 36
DC FieldValueLanguage
dc.contributor.authorKim, Junyeongko
dc.contributor.authorHong, Ji Wooko
dc.contributor.authorYoon, Sunjaeko
dc.contributor.authorYoo, Chang-Dongko
dc.date.accessioned2022-12-22T03:01:10Z-
dc.date.available2022-12-22T03:01:10Z-
dc.date.created2022-12-21-
dc.date.created2022-12-21-
dc.date.created2022-12-21-
dc.date.created2022-12-21-
dc.date.issued2022-12-
dc.identifier.citationSENSORS, v.22, no.23-
dc.identifier.issn1424-8220-
dc.identifier.urihttp://hdl.handle.net/10203/303483-
dc.description.abstract“A Picture is worth a thousand words”. Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after. However, this task is challenging for machines, owing to two limitations: existing approaches (1) directly utilize conventional vision–language transformers to learn relationships between input modalities and (2) ignore relations among target cause-and-effect captions, but consider each caption independently. Herein, we propose Cause-and-Effect BART (CE-BART), which is based on (1) a structured graph reasoner that captures intra- and inter-modality relationships among visual and textual representations and (2) a cause-and-effect generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on the VisualCOMET and AVSD benchmarks. CE-BART achieved SOTA performance on both benchmarks, while an extensive ablation study and qualitative analysis demonstrated the performance gain and improved interpretability. © 2022 by the authors.-
dc.languageEnglish-
dc.publisherMDPI-
dc.titleCE-BART: Cause-and-Effect BART for Visual Commonsense Generation-
dc.typeArticle-
dc.identifier.wosid000896359500001-
dc.identifier.scopusid2-s2.0-85143667087-
dc.type.rimsART-
dc.citation.volume22-
dc.citation.issue23-
dc.citation.publicationnameSENSORS-
dc.identifier.doi10.3390/s22239399-
dc.contributor.localauthorYoo, Chang-Dong-
dc.contributor.nonIdAuthorKim, Junyeong-
dc.description.isOpenAccessY-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorAVSD-
dc.subject.keywordAuthordeep learning-
dc.subject.keywordAuthorvideo-grounded dialogue-
dc.subject.keywordAuthorvisual commonsense generation-
dc.subject.keywordAuthorVisualCOMET-
dc.subject.keywordAuthorvisual–language reasoning-
Appears in Collection
EE-Journal Papers(저널논문)
Files in This Item
127443.pdf(1.68 MB)Download
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 1 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0