DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kim, Junyeong | ko |
dc.contributor.author | Hong, Ji Woo | ko |
dc.contributor.author | Yoon, Sunjae | ko |
dc.contributor.author | Yoo, Chang-Dong | ko |
dc.date.accessioned | 2022-12-22T03:01:10Z | - |
dc.date.available | 2022-12-22T03:01:10Z | - |
dc.date.created | 2022-12-21 | - |
dc.date.created | 2022-12-21 | - |
dc.date.created | 2022-12-21 | - |
dc.date.created | 2022-12-21 | - |
dc.date.issued | 2022-12 | - |
dc.identifier.citation | SENSORS, v.22, no.23 | - |
dc.identifier.issn | 1424-8220 | - |
dc.identifier.uri | http://hdl.handle.net/10203/303483 | - |
dc.description.abstract | “A Picture is worth a thousand words”. Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after. However, this task is challenging for machines, owing to two limitations: existing approaches (1) directly utilize conventional vision–language transformers to learn relationships between input modalities and (2) ignore relations among target cause-and-effect captions, but consider each caption independently. Herein, we propose Cause-and-Effect BART (CE-BART), which is based on (1) a structured graph reasoner that captures intra- and inter-modality relationships among visual and textual representations and (2) a cause-and-effect generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on the VisualCOMET and AVSD benchmarks. CE-BART achieved SOTA performance on both benchmarks, while an extensive ablation study and qualitative analysis demonstrated the performance gain and improved interpretability. © 2022 by the authors. | - |
dc.language | English | - |
dc.publisher | MDPI | - |
dc.title | CE-BART: Cause-and-Effect BART for Visual Commonsense Generation | - |
dc.type | Article | - |
dc.identifier.wosid | 000896359500001 | - |
dc.identifier.scopusid | 2-s2.0-85143667087 | - |
dc.type.rims | ART | - |
dc.citation.volume | 22 | - |
dc.citation.issue | 23 | - |
dc.citation.publicationname | SENSORS | - |
dc.identifier.doi | 10.3390/s22239399 | - |
dc.contributor.localauthor | Yoo, Chang-Dong | - |
dc.contributor.nonIdAuthor | Kim, Junyeong | - |
dc.description.isOpenAccess | Y | - |
dc.type.journalArticle | Article | - |
dc.subject.keywordAuthor | AVSD | - |
dc.subject.keywordAuthor | deep learning | - |
dc.subject.keywordAuthor | video-grounded dialogue | - |
dc.subject.keywordAuthor | visual commonsense generation | - |
dc.subject.keywordAuthor | VisualCOMET | - |
dc.subject.keywordAuthor | visual–language reasoning | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.