Algorithm and application of imitation learning and reinforcement learning for sequential decision making problems with multiple agents다중 에이전트가 존재하는 순차적 의사 결정 모델에 대한 강화학습과 모방학습 알고리즘 및 적용에 관한 연구

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 317
  • Download : 0
Systems, in systems engineering, are characterized by a large number of interrelated elements to achieve predefined objectives. Therefore, the general purpose of the sequential decision problem defined in the engineering system is to operate the system according to the predefined objectives through decision making based on the recognized information at each decision epoch. Because a system consists of many components that are related to each other, many decision problems in a system often need to be treated as sequential decision-making problems with multiple agents. In this dissertation, we study various sequential decision-making problems with multiple agents, using the emergency medical service system as a major application domain. Markov Decision Processes (MDP), decentralized-Partially Observable MDP (dec-POMDP), stochastic game (SG) models are mainly used, and multi-agent reinforcement learning (MARL) and imitation learning algorithms are used to solve difficult problems to solve. A typical problem we used is a selective patient admission problem at an ED after a mass-casualty incident. In Chapter 2, we formulate and analyze the MDP model for the selective patient admission problem by focusing on a single ED. The structural properties of the optimal policy of MDP model are reviewed and we identify the variation of an optimal policy according to the characteristics of the input functions that represent the external factors affecting decision making. We propose the solution method for partially-observable multi-agent problems in disaster response operations in Chapter 3. A Dec-POMDP model is suitable for the sequential decision-making problems in disaster response because it assumes the situation where multiple decision-makers choose actions based on partial information. We propose a solution method for dec-POMDP problems in disaster response by combining MARL and behavior cloning (BC) technique of imitation learning. The proposed solution method uses reference policies from the previous research on disaster response through the imitation learning method. We utilize the domain knowledge about the problem through BC to pretrain policy network and value network which will be used in reinforcement learning. As a case of using the proposed solution method, we generalize the mathematical model for the selective patient admission problem to the dec-POMDP model. The proposed solution method significantly reduces the computation time than the MARL algorithm which does not use pretraining and can obtain a near optimal dec-POMDP policy in which performance is close to the upper bound value of a problem. Besides, we find through various numerical experiments that the proposed method is still effective in inherently partially observable environments and the cases when decisions at the prehospital phase effects on the performance of selective patient admission strategy. In Chapter 4, we propose a method to improve a cooperative MARL algorithm using the imitation learning method. This method is using the reference policy obtained from the decision environment with more information than the situation assumed in a dec-POMDP problem to find a solution to a dec-POMDP problem. It collects the demonstrations from the solution of an multi-agent MDP (MMDP) or multi-agent POMDP (MPOMDP) model to mix these demonstrations when training a policy network in an MARL algorithm. We discover that the baseline MARL algorithm can obtain a better dec-POMDP policy when we mix demonstrations from a solution of a centralized model through the experiments in benchmark dec-POMDP problems. A comparison test shows that the method of mixing demonstration is more effective than the another method of using demonstrations to improve an MARL algorithm. We also find that investing a computational budget to learn a centralized policy in the earlier training steps is effective when a reference centralized policy is not provided.
Advisors
Lee, Taesikresearcher이태식researcher
Description
한국과학기술원 :산업및시스템공학과,
Publisher
한국과학기술원
Issue Date
2020
Identifier
325007
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 산업및시스템공학과, 2020.2,[v, 89 p. :]

Keywords

Sequential decision making model▼aDisaster response system▼aMulti-agent reinforcement learning▼aImitation learning▼aEmergency medical services system; 순차적 의사결정 모델▼a재난 대응 시스템▼a다중 에이전트 강화학습▼a모방학습▼a응급의료시스템

URI
http://hdl.handle.net/10203/283609
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=908365&flag=dissertation
Appears in Collection
IE-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0