Learning dense pixel features for video understanding and processing비디오 이해와 처리를 위한 픽셀 표현 학습

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 80
  • Download : 0
Videos offer something that images cannot; it provides motion information, which facilitates visual processing and understanding in human vision. Undoubtedly, the capability to model and process spatial-temporal data is essential to many computer vision tasks such as video editing and segmentation. However, research in video domain has been significantly lagging compared to its image counterpart. Given our dynamic visual world, we study whether existing image-based tasks and algorithms can be studied with videos, especially since motion is an indispensable cue that comes for free for learning visual representations. In this dissertation, we propose a 3D-2D encoder-decoder architecture that can produce dense and pixel-precise results for a suite of video tasks. First, we start with the video completion problems: caption removal and object inpainting tasks. Video completion aims to fill in spatio-temporal holes in videos with plausible content. Despite tremendous progress on deep learning-based inpainting of a single image, it is still challenging to extend these methods to video domain due to the additional time dimension. In this paper, we propose a recurrent temporal aggregation framework for fast deep video inpainting. In particular, we construct an encoder-decoder model, where the encoder takes multiple reference frames which can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a recurrent feedback in an auto-regressive manner to enforce temporal consistency in the video results. We propose two architectural designs based on this framework. Our first model is a blind video decaptioning network (BVDNet) that is designed to automatically remove and inpaint text overlays in videos without any mask information. Our BVDNet wins the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track 2: Video Decaptioning. Second, we propose a network for more general video inpainting (VINet) to deal with more arbitrary and larger holes. Video results demonstrate the advantage of our framework compared to state-of-the-art methods both qualitatively and quantitatively. Then, we propose and study a video panoptic segmentation (VPS), a task that requires assigning semantic classes and track identities to all pixels in a video. A holistic understanding of dynamic scenes is of fundamental importance in real-world computer vision problems such as autonomous driving, augmented reality and spatiotemporal reasoning. In this paper, we propose a new computer vision benchmark: Video Panoptic Segmentation (VPS). To study this important problem, we present two datasets, Cityscapes-VPS and VIPER together with a new evaluation metric, video panoptic quality (VPQ). Also, we propose a strong video panoptic segmentation network (VPSNet), which simultaneously performs classification, detection, segmentation, and tracking of all identities in videos. Specifically, VPSNet builds upon a top-down panoptic segmentation network by adding Fuse and Track heads, respectively learning pixel-level and object-level correspondences between consecutive frames. We further explore the effectiveness of stronger backbones and propose VPSNet++ with novel modifications in fuse head, track head, and panoptic head, each achieving performance gains over the base VPSNet and state-of-the-art results on the Cityscapes-VPS dataset. We further adapt our method with a modern anchor-free detector, which can avoid proposal generation and crop-and-resize operations. Finally, we propose an end-to-end clip-level video segmentation network inspired by Transformer architecture. We present TubeFormer, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer, a simple and effective mask transformer based model that is widely applicable to multiple video segmentation tasks. Our proposed Tube-Former directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks.
Advisors
Kweon, In Soresearcher권인소researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2022.2,[viii, 84 p. :]

URI
http://hdl.handle.net/10203/309065
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=996262&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0