Novel entropy frameworks for sample-efficient exploration in off-policy reinforcement learning오프 폴리시 강화학습에서의 샘플 효율적 탐험을 위한 새로운 엔트로피 활용법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 106
  • Download : 0
In this thesis, we investigate novel entropy frameworks for sample-efficient exploration in off-policy reinforcement learning under continuous action-space environments. Additionally, we provide off-policy generalization of PPO for better sample exploitation. This thesis consists of four parts and its contents are as follows. In the first part of the thesis, sample-aware policy entropy regularization is proposed to enhance the conventional policy entropy regularization for better exploration. Exploiting the sample distribution obtainable from the replay buffer, the proposed sample-aware entropy regularization maximizes the entropy of the weighted sum of the policy action distribution and the sample action distribution from the replay buffer for sample-efficient exploration. A practical algorithm named diversity actor-critic (DAC) is developed by applying policy iteration to the objective function with the proposed sample-aware entropy regularization. Numerical results show that DAC significantly outperforms existing recent algorithms for reinforcement learning. In the second part of the thesis, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms. In the third part of the thesis, a new adaptive multi-batch experience replay scheme that uses the batch samples of past policies is proposed for proximal policy optimization (PPO) for continuous action control. The proposed scheme determines the number of the used past batches adaptively based on the average importance sampling (IS) weight. We combine PPO with the proposed scheme that maintains the advantages of original PPO and small bias due to low IS weights. Numerical results show that the proposed method significantly increases the performance on various continuous control tasks compared to original PPO. In the last part of the thesis, we resolve the problem that IS weights are typically clipped to avoid large variance in learning for IS-based reinforcement learning (RL) algorithms such as PPO. Policy update from clipped statistics can induce large bias, and bias from clipping makes it difficult to reuse old samples. Thus, we improves PPO by dimension wise IS weight clipping (DISC) which separately clips the IS weight of each action dimension to avoid large bias and adaptively controls the IS weight. This new technique enables efficient learning for high action-dimensional tasks and reusing old samples to increase the sample efficiency. Numerical results show that the proposed new algorithm outperforms PPO and other RL algorithms in various Open AI Gym tasks.
Advisors
Sung, Youngchulresearcher성영철researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2021
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전기및전자공학부, 2021.8,[viii, 98 p. :]

Keywords

Entropy framework▼aSample-efficient exploration▼aReinforcement learning▼aOff-policy learning▼aContinuous control; 엔트로피 기법▼a샘플 효율적 탐험▼a강화학습▼a오프-폴리시 학습▼a연속 환경 제어

URI
http://hdl.handle.net/10203/295656
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=962461&flag=dissertation
Appears in Collection
EE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0