Deep multi-agent reinforcement learning

Research Area: Machine Learning

Abstract:

A plethora of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting. Furthermore, as more machine learning systems are deployed in the real world, they will start having impact on each other, effectively turning most decision making problems into multiagent problems. In this thesis we develop and evaluate novel deep multi-agent RL (DMARL) methods that address the unique challenges which arise in these settings. These challenges include learning to collaborate, to communicate, and to reciprocate amongst agents. In most of these real world use cases, during decentralised execution, the final policies can only rely on local observations. However, in many cases it is possible to carry out centralised training, for example when training policies on a simulator or when using extra state information and free communication between agents during the training process. The first part of the thesis investigates the challenges that arise when multiple agents need to learn to collaborate to obtain a common objective. One difficulty is the question of multi-agent credit assignment: Since the actions of all agents impact the reward of an episode, it is difficult for any individual agent to isolate the impact of their actions on the reward. In this thesis we propose Counterfactual Multi-Agent Policy Gradients (COMA) to address this issue. In COMA each agent estimates the impact of their action on the team return by comparing the estimated return with a counterfactual baseline. We also investigate the importance of common knowledge for learning coordinated actions: In Multi-Agent Common Knowledge Reinforcement Learning (MACKRL) we use a hierarchy of controllers that condition on the common knowledge of subgroups of agents in order to either act in the joint-action space of the group or delegate to smaller subgroups that have more common knowledge. The key insight here is that all policies can still be executed in a fully decentralised fashion, since each agent can independently compute the common knowledge of the group. In MARL, since all agents are learning at the same time, the world appears nonstationary from the perspective of any given agent. This can lead to learning difficulties in the context of off-policy reinforcement learning which relies on replay buffers. In order to overcome this problem we propose and evaluate a metadata fingerprint that effectively disambiguates training episodes in the replay buffer based on the time of collection and the randomness of policies at that time.

Name of the Researcher: Jakob N Foerster

Name of the Supervisor(s): Shimon Whiteson

Year of Completion: 2018

University: University of Oxford

Thesis Link: Home Page Url

Office Address

Social List

Good PhD Thesis on Deep multi-agent reinforcement learning

Abstract:

S-Logix (OPC) Private Limited