Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Cited 34 time in webofscience Cited 0 time in scopus
  • Hit : 449
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorMoon, Jong Hakko
dc.contributor.authorLee, Hyungyungko
dc.contributor.authorShin, Woncheolko
dc.contributor.authorKim, Young-Hakko
dc.contributor.authorChoi, Yoonjaeko
dc.date.accessioned2022-12-15T09:00:11Z-
dc.date.available2022-12-15T09:00:11Z-
dc.date.created2022-12-03-
dc.date.created2022-12-03-
dc.date.created2022-12-03-
dc.date.issued2022-12-
dc.identifier.citationIEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, v.26, no.12, pp.6070 - 6080-
dc.identifier.issn2168-2194-
dc.identifier.urihttp://hdl.handle.net/10203/303063-
dc.description.abstractRecently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL), which adopts a BERT-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). By statistically and rigorously evaluating the proposed model on four downstream tasks with three radiographic image-report datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines, including task-specific architectures.-
dc.languageEnglish-
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC-
dc.titleMulti-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training-
dc.typeArticle-
dc.identifier.wosid000894943300028-
dc.identifier.scopusid2-s2.0-85139447655-
dc.type.rimsART-
dc.citation.volume26-
dc.citation.issue12-
dc.citation.beginningpage6070-
dc.citation.endingpage6080-
dc.citation.publicationnameIEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS-
dc.identifier.doi10.1109/JBHI.2022.3207502-
dc.contributor.localauthorChoi, Yoonjae-
dc.contributor.nonIdAuthorLee, Hyungyung-
dc.contributor.nonIdAuthorShin, Woncheol-
dc.contributor.nonIdAuthorKim, Young-Hak-
dc.description.isOpenAccessN-
dc.type.journalArticleArticle-
dc.subject.keywordAuthorHealthcare-
dc.subject.keywordAuthormedical-
dc.subject.keywordAuthormultimodal learning-
dc.subject.keywordAuthorrepresentation learning-
dc.subject.keywordAuthorself-supervised learning-
dc.subject.keywordAuthorvision-and-language-
Appears in Collection
AI-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.
This item is cited by other documents in WoS
⊙ Detail Information in WoSⓡ Click to see webofscience_button
⊙ Cited 34 items in WoS Click to see citing articles in records_button

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0