Batch Reinforcement Learning with Hyperparameter Gradients

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 163
  • Download : 0
We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.
Publisher
International Conference on Machine Learning
Issue Date
2020-07-16
Language
English
Citation

The 37th International Conference on Machine Learning (ICML 2020), pp.5681 - 5691

ISSN
2640-3498
URI
http://hdl.handle.net/10203/278163
Appears in Collection
RIMS Conference Papers
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0