Robust holistic scene understanding via multi-task learning and sensor fusion다중 작업 학습을 및 센서 융합 통한 강인한 전체론적 장면 이해

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 99
  • Download : 0
Accurate mathematical modeling of the surrounding environment plays a vital role in the safe operation of unmanned systems such as autonomous vehicles, mobile, and aerial robotics. Such modeling of dynamic scenes can be performed using centralized or distributed sensor systems, with the expectation to provide accurate attribute representation in diverse weather conditions. Given these constraints, modern sensor systems leverage LIDAR, RADAR, and Visible spectrum cameras to capture and represent different scene properties. Subsequently, data-driven approaches are utilized to process these raw measurements and extract complex patterns represented as high-level attributes. Such high-level features include object detection, semantic and road marking segmentation, depth estimation, and multi-object tracking. These attributes are then aggregated to provide a holistic scene understanding based on which path planning and control can be performed to ensure safe operation. However, current approaches for holistic scene understanding utilize multiple task-specific algorithms, resulting in a computationally expensive solution on account of increased redundant computations. Furthermore, these task-specific algorithms are sensitive to domain gaps arising from varying sensor properties or configurations. As construction of a sensor stack is based on the requirements of the end application. E.g., a mobile robot would require a short-range wide field of view (FoV) surround perception. At the same time, adaptive cruise control within an autonomous vehicle or ADAS would also require long-range forward perception with a narrow FoV. Hence, a well-annotated training dataset is required for each new sensor stack or domain, which is prohibitively expensive. This dissertation focuses on performing holistic scene understanding for autonomous vehicles using heterogeneous sensor systems where calibration parameters are known. We define holistic scene understanding as estimating attributes such as road markers and unique object instances in a 3D space. Towards this objective, we highlight standard perception systems to either focus on the surround or long-range forward perception using a ring-camera or stereo-camera apart from sensors such as RADAR and LIDAR. Given the strengths of different sensors, combining signals from these multi-modal sensors is beneficial to provide the necessary robustness for different scenarios. However, due to incompatibility in the signal output, it cannot be directly aggregated. Therefore, we propose a two-stage mechanism to simultaneously solve the issue of multi-modal data fusion while extracting meaningful information. The first mechanism focuses on extracting attributes using cameras into point-cloud space. Following this, we integrate different sensor signals into point-cloud space and perform downstream perception tasks such as 3D Object Detection. As vision sensors are widely used as primary sensors due to their ability to densely capture scene information, we focus on devising resource friendly algorithms to extract different scene attributes such as scene semantics, road attributes, object detections, etc. To ensure extraction of these attributes without excessive computational overhead, we propose utilizing Multi-Task (MT) Networks. While such an approach is theoretically sound, the practical performance of any data-driven system relies upon the quality of training data, which, while playing a critical role, is usually overlooked. In addition, one caveat of using the multi-task framework is the availability of task-specific ground truth per input. However, current state-of-the-art (SoTA) primarily comprises multiple task-specific datasets focusing on distinct operating conditions and tasks. Furthermore, as each dataset source has a non-identical sensor setup, these cannot be directly used in a multi-task setting. Thus to overcome this critical requirement of well-annotated datasets, we develop domain invariant task-specific networks that can provide high-quality pseudo ground truth labels for training the deep-learning-based multi-task algorithm. Hence we can summarize the contributions of this dissertation as follows, • We propose a multi-modal multi-task pipeline for performing holistic scene understanding generalizable to a wide variety of sensor configurations. • We demonstrate that such a pipeline is computationally efficient and robust to different weather variations compared to task-specific networks. • To ensure optimal training without requiring additional annotated labels, we develop different domain in- variant approaches that can be utilized to provide pseudo ground truth labels. • To improve the performance of the MT network further, we propose a blind image restoration algorithm to restore regions within images that are affected by weather variations. • Finally, we validate the performance and robustness of the proposed framework on publicly available datasets for downstream 3D perception tasks such as Object Detection.
Advisors
Kim, Kyung-Sooresearcher김경수researcherYoon, Kuk-Jinresearcher윤국진researcher
Description
한국과학기술원 :기계공학과,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 기계공학과, 2022.8,[ix, 131 p.:]

Keywords

Holistic Scene Understanding▼aMulti-Task Learning▼aBlind Image Enhancement▼aSensor Fusion; 전체적인 장면 이해▼a다중 작업 학습▼a블라인드 이미지 향상▼a센서 융합

URI
http://hdl.handle.net/10203/307837
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1007762&flag=dissertation
Appears in Collection
ME-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0