Learning from demonstrations under transition dynamic mismatch상태 전이 함수의 변화에 강인한 시연 학습

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 59
  • Download : 0
Demonstrations relieve the difficulties of Reinforcement Learning (RL). Learning from demonstrations (LfD) is the problem of seeking optimal policies without true reward signals, which is necessary in RL. Also, demonstrations help to speed up learning a new task in RL. Practical challenges arise when we handle demonstrations: (1) when the environments of an agent and a demonstrator (especially, transition dynamics functions) are different and (2) when demonstrations have suboptimal performances or are too few. The prior-art, Indirect Imitation Learning (I2L), overcomes different dynamics by matching state-only distributions, instead of state-action distributions, however, its performance is limited to that of the demonstrator. On the other hand, a method, Trajectory-ranked Reward Extrapolation (TREX) outperforms the demonstrator by inferring a high-quality reward function from ranked demonstrations. The learnt reward model inevitably performs poorly under the dynamic mismatch. Likewise, behavioral priors, learnt from diverse demonstrations, can accelerate RL, but are not useful in a new environment with different dynamic. Firstly, in this paper, we propose a novel algorithm that handles both of the challenges. It learns a reward function with ranked demonstrations while considering domain mismatches by I2L algorithm. Additionally, I2L in the proposed method is replaced with Adversarial Inverse Reinforcement Learning (AIRL) for environments with no dynamic mismatch. It takes the benefit of data augmentation effects when demonstrations are few. In the experiments on continuous physical locomotion tasks, the proposed method outperforms I2L and TREX baselines by up to 330%. Our method is shown robust to transition dynamic mismatches between the agent and demonstrator, and achieves good policies from suboptimal demonstrations. Also, the method with AIRL outperforms baselines when no dynamic mismatch. Secondly, we propose a method for accelerating RL, that incorporates past observations collected in different dynamic from new task.
Advisors
Kim, Tae-Kyunresearcher김태균researcher
Description
한국과학기술원 :전산학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전산학부, 2022.8,[iii, 18 p. :]

Keywords

Learning from Demonstrations▼aImitation Learning▼aReward Learning▼aReinforcement Learning; 시연 학습▼a모방 학습▼a보상 학습▼a강화 학습

URI
http://hdl.handle.net/10203/309520
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1008397&flag=dissertation
Appears in Collection
CS-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0