High-level scene understanding with relational and linguistic priors관계 및 언어적 사전 지식을 이용한 장면 이해

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 155
  • Download : 0
High-level scene understanding is a task to describe the content in a scene with a natural sentence, which can provide informative prior knowledge for extensive practical vision & language applications, such as posting social media, language-based image search, video summarization, navigation, vehicle control, and helping visually impaired people. One of the fundamental challenges in high-level scene understanding, especially in image captioning, is the low diversity of the captions generated from the models, which is the main issue in this dissertation. In this dissertation, we explore several possible factors that harm the diversity of the high-level scene understanding, such as data bias and lack of data. First, we introduce the novel concept of explicitly leveraging the co-occurrence among words for visual relation classes to address the bias problem in the training dataset. We name this prior knowledge as the action co-occurrence priors. We also propose two orthogonal ways to exploit action co-occurrence priors, namely through a proposed hierarchical architecture and visual relationship label expansion via knowledge distillation. The resulting model is consistently advantageous compared to previous state-of-the-art techniques. While traditional works mostly focused on the network architecture, the proposed co-occurrence priors can be easily obtained and can be utilized with negligible overhead while improving performance. Next, we find the performance improvements from the existing methods to improve the diversity of high-level scene understanding tasks are still somewhat limited. Therefore, we tackle the fundamental problem of high-level scene understanding tasks themselves by devising a novel image captioning framework. We introduce a novel dense relational image captioning task, a new image captioning task that generates multiple captions grounded to relational information between objects in an image. This novel image captioning framework can provide significantly dense, diverse, rich, and informative image representation. For a dataset for the new task, we also propose a technique to leverage existing visual relationship detection (VRD) labels and visual attribute labels to automatically synthesis relational captioning labels, which significantly reduced the efforts to construct our “Relational Captioning dataset.” Moreover, to effectively learn the relational captions, we propose the multi-task triple-stream network (MTTSNet) by leveraging the part-of-speech (POS) as prior knowledge to guide the correct word in a caption. We introduce several applications of our framework as an application, including “caption graph” generation and sentence-based image region-pair retrieval tasks. Moreover, constructing human-labeled datasets for high-level scene understanding frameworks is hugely laborious and time-consuming. In contrast to manually annotating all the training samples, collecting unpaired images and captions separately from the web is immensely easier. We propose a novel framework for training an image captioner with the unpaired image-caption data and a small amount of paired data. We also devise a new semi-supervised learning approach by the novel usage of the GAN discriminator. We theoretically and empirically show the effectiveness of our method in various challenging image captioning setups, including our scarcely-paired COCO dataset, compared to strong competing methods.
Advisors
Kweon, In Soresearcher권인소researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2021
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2021.8,[ix, 100 p. :]

Keywords

Scene understanding▼aVisual context▼aImage captioning▼aDense captioning▼aVisual relationship▼aRelational analysis▼aDiversity; 장면 이해▼a시각적 맥락▼a이미지 캡션 생성▼a밀집한 캡션 생성▼a시각적 관계▼a관계 분석▼a다양성

URI
http://hdl.handle.net/10203/295605
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=962464&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0