Toward robust visual question answering강건한 시각적 질문 답변을 위한 방법론

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 8
  • Download : 0
Visual Question Answering (VQA) is an important task within the field of Vision-and-Language. As a multimodal task, VQA holds its importance in research and scientific fields where it can be used to measure an AI model’s visual image and natural language understanding or aiding in diagnosing deep learning models. Beyond that, VQA holds its importance in the economic and social realm as it can aid the situationally and visually impaired. However, even with the current advances of deep learning, deployable VQA models are uncommon. In this thesis, we explore the issues of VQA in the perspective of robustness in the face of the limits of the real-world applicability of VQA and study the task of VQA. VQA is vision & language multi-modal task, and VQA has been shown to be heavily biased to language priors and to model biases. A large portion of the biases come from data biases. The data bias stems from either the collection process and or unforeseen biases that were possibly unintentional. As such, we explore the plausibility of using data approach for robustness, more specifically, the Active Learning framework. With this in mind, we explore Active Learning as method for robustness in VQA. Unlike uni-modal tasks where data labeling only requires a single label, as a multi-modal task, VQA datasets require much more effort to label. The addition of another modality does not correlate to a linear increase in cost of labeling. To tackle this issue, we propose a VQA model and data specific Active Learning approach to minimize labeling costs. Using the properties of VQA models and data, we leverage each individual modality separately by adding simple auxiliary networks to find the most informative samples. Inspired by the multi-modalilty of VQA, we further design an Active Learning approach with auxiliary networks for uni-modal tasks. We then explore robustness in VQA with the debiasing approach. Within debiasing approaches, we first propose a data centric approach using a counterfactual method. Here, we develop an automatic method to generate a large number of counterfactual images and questions to increase exposure to the model and inhibit the model from taking shortcuts. We further propose a model centric debiasing approach. In order to overcome the limitations of modeling biases through static data distributions or through static model outputs, we propose a novel method to stochastically model the possible biases that a model and dataset experiences using gaussian noise as an input. In dealing with bias in VQA, we find that the methods that focus heavily on tackling the bias issue sacrifice in-distribution performance in order to increase out-of-distribution performance. To mitigate this issue, we propose to use a simple and lightweight adapter module to achieve both in-distribution and out-of-distribution performance at the same time. Inspired by parameter efficient fine-tuning of large scale foundational models, we re-interpret the philosophy of large language models fine-tuning as a simple distribution shift. Using this philosophy, we apply adapters to various state-of-the-art debiasing methods and show that a simple linear layer is enough to recreate the performance of both in-distribution and out-of-distribution VQA performance with a single model. In addition, we discuss the implications of this finding within the VQA community and also suggest the use of adapters for targeted debiasing for future works. Lastly, with two separate tracks of techniques with different strengths and focuses, we explore the plausibility of combining two different techniques for robustness. With our experimental results, we explore the direction for robustness in VQA and discuss our findings.
Advisors
권인소researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2023
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2023.8,[x, 93 p. :]

Keywords

시각적 질문 답변▼a액티브 러닝▼a편향된 데이터▼a편향된 모델▼a강건한 모델; Visual Question Answering▼aActive Learning▼aBiased Data▼aBiased Models▼aRobust Models

URI
http://hdl.handle.net/10203/320947
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1047242&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0