DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Yoo, Changdong | - |
dc.contributor.advisor | 유창동 | - |
dc.contributor.author | Luu, Minh Tung | - |
dc.date.accessioned | 2021-05-13T19:39:13Z | - |
dc.date.available | 2021-05-13T19:39:13Z | - |
dc.date.issued | 2020 | - |
dc.identifier.uri | http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=925214&flag=dissertation | en_US |
dc.identifier.uri | http://hdl.handle.net/10203/285050 | - |
dc.description | 학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2020.8,[iii, 24 p. :] | - |
dc.description.abstract | Reinforcement learning (RL) agents successively updates their parameters by way of recalling past experience via experience replay. Strongly correlated updates violate many stochastic gradient-based algorithms, but experience replay disallows temporal correlations by mixing more and less recent experience for update. Furthermore, it permits rare experience to be reused in the update. It is a well-known fact that prioritizing the experience judiciously can improve sample efficiency. This paper considers a method for prioritizing the replay experience for off-policy RL referred to as Hindsight Goal Ranking (HGR) is proposed by addressing the limitation of Hindsight Experience Replay (HER) that generates hindsight goals based on uniform sampling. HGR samples with higher probability on the states visited in an episode with larger temporal difference (TD) error, which is considered as a proxy measure of the amount which the RL agent can learn from an experience. The actual sampling for large TD error is performed in two steps: first, an episode is sampled from the relay buffer according to the average TD error of its experiences, and then, for the sampled episode, hindsight goal leading to larger TD error is sampled with higher probability from future visited states. The proposed method combined with Deep Deterministic Policy Gradient (DDPG), an off-policy model-free actor-critic algorithm, accelerates learning significantly faster than that without any prioritization on four challenging simulated robotic manipulation tasks. The empirical results show that HGR uses samples more efficiently than previous methods on all four tasks. A video showing experimental results is available at https://youtu.be/KKqQ3aDzk1A. | - |
dc.language | eng | - |
dc.publisher | 한국과학기술원 | - |
dc.subject | Multi-Goal Reinforcement Learning▼aSparse Reward▼aSample Efficiency▼aHindsight Goal Ranking | - |
dc.subject | 다중 목표 강화학습▼a드문 보상▼a표본 효율성▼a사후 평가 목표 순위 | - |
dc.title | Hindsight goal ranking on replay buffer for sparse reward environment | - |
dc.title.alternative | 희소 보상 환경을 위한 재생 버퍼의 사후 목표 랭킹 방법 | - |
dc.type | Thesis(Master) | - |
dc.identifier.CNRN | 325007 | - |
dc.description.department | 한국과학기술원 :전기및전자공학부, | - |
dc.contributor.alternativeauthor | Luu, Minh Tung | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.