DSpace at KOASAS: Hindsight goal ranking on replay buffer for sparse reward environment

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Master(석사논문)

Hindsight goal ranking on replay buffer for sparse reward environment희소 보상 환경을 위한 재생 버퍼의 사후 목표 랭킹 방법

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 154
Download : 0

Export

DC Field	Value	Language
dc.contributor.advisor	Yoo, Changdong	-
dc.contributor.advisor	유창동	-
dc.contributor.author	Luu, Minh Tung	-
dc.date.accessioned	2021-05-13T19:39:13Z	-
dc.date.available	2021-05-13T19:39:13Z	-
dc.date.issued	2020	-
dc.identifier.uri	http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=925214&flag=dissertation	en_US
dc.identifier.uri	http://hdl.handle.net/10203/285050	-
dc.description	학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2020.8,[iii, 24 p. :]	-
dc.description.abstract	Reinforcement learning (RL) agents successively updates their parameters by way of recalling past experience via experience replay. Strongly correlated updates violate many stochastic gradient-based algorithms, but experience replay disallows temporal correlations by mixing more and less recent experience for update. Furthermore, it permits rare experience to be reused in the update. It is a well-known fact that prioritizing the experience judiciously can improve sample efficiency. This paper considers a method for prioritizing the replay experience for off-policy RL referred to as Hindsight Goal Ranking (HGR) is proposed by addressing the limitation of Hindsight Experience Replay (HER) that generates hindsight goals based on uniform sampling. HGR samples with higher probability on the states visited in an episode with larger temporal difference (TD) error, which is considered as a proxy measure of the amount which the RL agent can learn from an experience. The actual sampling for large TD error is performed in two steps: first, an episode is sampled from the relay buffer according to the average TD error of its experiences, and then, for the sampled episode, hindsight goal leading to larger TD error is sampled with higher probability from future visited states. The proposed method combined with Deep Deterministic Policy Gradient (DDPG), an off-policy model-free actor-critic algorithm, accelerates learning significantly faster than that without any prioritization on four challenging simulated robotic manipulation tasks. The empirical results show that HGR uses samples more efficiently than previous methods on all four tasks. A video showing experimental results is available at https://youtu.be/KKqQ3aDzk1A.	-
dc.language	eng	-
dc.publisher	한국과학기술원	-
dc.subject	Multi-Goal Reinforcement Learning▼aSparse Reward▼aSample Efficiency▼aHindsight Goal Ranking	-
dc.subject	다중 목표 강화학습▼a드문 보상▼a표본 효율성▼a사후 평가 목표 순위	-
dc.title	Hindsight goal ranking on replay buffer for sparse reward environment	-
dc.title.alternative	희소 보상 환경을 위한 재생 버퍼의 사후 목표 랭킹 방법	-
dc.type	Thesis(Master)	-
dc.identifier.CNRN	325007	-
dc.description.department	한국과학기술원 :전기및전자공학부,	-
dc.contributor.alternativeauthor	Luu, Minh Tung	-

Appears in Collection: EE-Theses_Master(석사논문)

Files in This Item: There are no files associated with this item.

Display Simple Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Hindsight goal ranking on replay buffer for sparse reward environment희소 보상 환경을 위한 재생 버퍼의 사후 목표 랭킹 방법

KOASAS

Communities & Collections