Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 28
  • Download : 0
Recent research highlights the potential of multimodal foundation models in tackling complex decision-making challenges. However, their large parameters make real-world deployment resource-intensive and often impractical for constrained systems. Reinforcement learning (RL) shows promise for task-specific agents but suffers from high sample complexity, limiting practical applications. To address these challenges, we introduce LVLM to Policy (LVLM2P), a novel framework that distills knowledge from large vision-language models (LVLM) into more efficient RL agents. Our approach leverages the LVLM as a teacher, providing instructional actions based on trajectories collected by the RL agent, which helps reduce less meaningful exploration in the early stages of learning, thereby significantly accelerating the agent's learning progress. Additionally, by leveraging the LVLM to suggest actions directly from visual observations, we eliminate the need for manual textual descriptors of the environment, enhancing applicability across diverse tasks. Experiments show that LVLM2P significantly enhances the sample efficiency of baseline RL algorithms. The code is available at https://github.com/i22024/LVLM2P.
Publisher
Institute of Electrical and Electronics Engineers Inc.
Issue Date
2025-04
Language
English
Citation

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

DOI
10.1109/ICASSP49660.2025.10888998
URI
http://hdl.handle.net/10203/336599
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0