Continuous Control

Introduction

For this project, I explore the application of DDPG, DDPG with parameter-space noise variation PSNE, TD3, and PPO.

These application are applied to Reacher and Crawler environments. Two versions of Reacher are tried: one with a single agent, the other with multiple agents.

Algorithms such as A3C and D4PG (based on DDPG) take a distributed approach to environments with multiple agents. However, in this work I focus on PPO, TD3, DDPG, and DDPG with parameter space noise PSNE, with small adaptations for multi-agent environments.

Environment Description

Reacher

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector is a number between -1 and 1. This is a multi-agent (20) where there are several identical agents each with its own copy of the environment

Note: The old report includes the single agent experiment as well. They are located onder 'notebooks/old'

Training Traces

PPO achieves the highest score on Reacher (approx 75), while the other ones are approx 35 score.

Scoring

The tasks in this project are episodic, that is, the agent/s run for a finite number of steps on the environment.

After each episode, add up the rewards that each agent received (without discounting) to get a score for each agent. This yields N (potentially different) scores.
Take the average of these scores yielding an average score for each episode (where the average is over all the agents).

Poicy gradient continuous control algorithms implemented

Deep Deterministic Policy Gradients DDPG

DDPG, DDPG + PSNE are implemented. The readme is here

Twin Delayed Deep Deterministic Policy Gradient TD3

Readme located here

Proximal Policy Optimization PPO

In this particular version, for policy gradient loss computation (and clipping), the PPO2 version found in OpenAI's baseline implementation (in Tensorflow) approach was used. Theoretical explanatory material coming soon ...

Networks

Deterministic Actor

(Used for DDPG and its variants, TD3) is a fully connected network with:

3 hidden layers of 256,256,128 units
ReLU nonlinearities for the hidden layers
1 output layer of with tanh activation
Layer Normalization used in hidden layers (helps with PSNE approach)
Input is batch normalized Located in agents/topologies/actor.py

Stochastic Actor

Used for PPO, it is a fully connected FF network which shares the latent features with the Gaussian distribution parameter estimation (i.e. mu), as well as the value function (Vf). So, this is a two-headed network. The Gaussian head serves as the stochastic policy and Vf is further used for the advantage estimation. Latent feature body:

2 hidden layers for the shared body of latent features. Each layer of 256 units
ReLU non-linearity used for hidden feature layers
Layer Normalization used between the hidden layers

Gaussian head (i.e. policy):

One fully connected layer
tanh activation
Sigma parameter is populated at runtime depending on the use: during training it is the annealed value, while at test-time a very small value (--> greedy).
Parameters are used in a Gaussian distribution object which is then sampled accordingly

Value function (VF) head

Fully connected layer
Linear activation

Located in agents/topologies/actor.py

Dependencies

python: 3.5
tensorboardX: 1.4
tensorboard: 1.7.0
pytorch: 0.4.1
numpy: 1.15.2
Linux / OSX

Test the algorithms the actor

To test the agents on the reacher run: reacher_test_agent.py with the following option:

-a : algorithm of choice: TD3|PPO|DDPG|DDPG_PSNE

Pre-trained Models

Trained models are under:

/models/<algorithm>

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
agents		agents
assets		assets
models		models
notebooks		notebooks
python		python
Environment_setup.md		Environment_setup.md
README.md		README.md
ddpg.md		ddpg.md
reacher_test_agent.py		reacher_test_agent.py
td3.md		td3.md
test_ddpg_psne.py		test_ddpg_psne.py
train_ddpg_psne.py		train_ddpg_psne.py
unity-environment.log		unity-environment.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous Control

Introduction

Environment Description

Reacher

Training Traces

Scoring

Poicy gradient continuous control algorithms implemented

Deep Deterministic Policy Gradients DDPG

Twin Delayed Deep Deterministic Policy Gradient TD3

Proximal Policy Optimization PPO

Networks

Deterministic Actor

Stochastic Actor

Dependencies

Test the algorithms the actor

Pre-trained Models

About

Releases

Packages

Languages

gtuzi/Deep_RL_continuous_control

Folders and files

Latest commit

History

Repository files navigation

Continuous Control

Introduction

Environment Description

Reacher

Training Traces

Scoring

Poicy gradient continuous control algorithms implemented

Deep Deterministic Policy Gradients DDPG

Twin Delayed Deep Deterministic Policy Gradient TD3

Proximal Policy Optimization PPO

Networks

Deterministic Actor

Stochastic Actor

Dependencies

Test the algorithms the actor

Pre-trained Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages