Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 54
  • Download : 0
Training agents via off-policy deep reinforcement learning algorithm requires a replay memory storing past experiences that are sampled uniformly or non-uniformly to create the batches for training. When calculating the loss function, off-policy algorithms commonly assume that all samples are of equal importance. We introduce a novel algorithm that assigns unequal importance, in the form of a weighting factor, to each experience, based on their distribution of temporal difference (TD) error, for the training objective. Results obtained with uniform sampling from the experiments in eight environments of the OpenAI Gym suite show that the proposed algorithm achieves in one environment 10% increase in convergence speed along with a similar success rate and in the other seven environments 3%–46% increases in success rate or 3%–14% increases in cumulative reward, along with similar convergence speed. The algorithm can be combined with existing prioritization method employing non-uniform sampling. The combined technique achieves 20% increase in convergence speed as compared to the prioritization method alone.
Publisher
Springer Science and Business Media Deutschland GmbH
Issue Date
2023-08-10
Language
English
Citation

19th International Conference on Intelligent Computing, ICIC 2023

DOI
10.1007/978-981-99-4761-4_51
URI
http://hdl.handle.net/10203/316720
Appears in Collection
GT-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0