Skip to content

gtuzi/Deep_RL_continuous_control

Repository files navigation

Continuous Control

Introduction

For this project, I explore the application of DDPG, DDPG with parameter-space noise variation PSNE, TD3, and PPO.

These application are applied to Reacher and Crawler environments. Two versions of Reacher are tried: one with a single agent, the other with multiple agents.

Algorithms such as A3C and D4PG (based on DDPG) take a distributed approach to environments with multiple agents. However, in this work I focus on PPO, TD3, DDPG, and DDPG with parameter space noise PSNE, with small adaptations for multi-agent environments.

Environment Description

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible.

Reacher

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector is a number between -1 and 1. This is a multi-agent (20) where there are several identical agents each with its own copy of the environment

Note: The old report includes the single agent experiment as well. They are located onder 'notebooks/old'

Training Traces
Train Comparisons

PPO achieves the highest score on Reacher (approx 75), while the other ones are approx 35 score.

Scoring

The tasks in this project are episodic, that is, the agent/s run for a finite number of steps on the environment.

  • After each episode, add up the rewards that each agent received (without discounting) to get a score for each agent. This yields N (potentially different) scores.
  • Take the average of these scores yielding an average score for each episode (where the average is over all the agents).

Poicy gradient continuous control algorithms implemented

Deep Deterministic Policy Gradients DDPG

DDPG, DDPG + PSNE are implemented. The readme is here

Twin Delayed Deep Deterministic Policy Gradient TD3

Readme located here

Proximal Policy Optimization PPO

In this particular version, for policy gradient loss computation (and clipping), the PPO2 version found in OpenAI's baseline implementation (in Tensorflow) approach was used. Theoretical explanatory material coming soon ...

Networks

Deterministic Actor

(Used for DDPG and its variants, TD3) is a fully connected network with:

  • 3 hidden layers of 256,256,128 units
  • ReLU nonlinearities for the hidden layers
  • 1 output layer of with tanh activation
  • Layer Normalization used in hidden layers (helps with PSNE approach)
  • Input is batch normalized Located in agents/topologies/actor.py
Stochastic Actor

Used for PPO, it is a fully connected FF network which shares the latent features with the Gaussian distribution parameter estimation (i.e. mu), as well as the value function (Vf). So, this is a two-headed network. The Gaussian head serves as the stochastic policy and Vf is further used for the advantage estimation. Latent feature body:

  • 2 hidden layers for the shared body of latent features. Each layer of 256 units
  • ReLU non-linearity used for hidden feature layers
  • Layer Normalization used between the hidden layers

Gaussian head (i.e. policy):

  • One fully connected layer
  • tanh activation
  • Sigma parameter is populated at runtime depending on the use: during training it is the annealed value, while at test-time a very small value (--> greedy).
  • Parameters are used in a Gaussian distribution object which is then sampled accordingly

Value function (VF) head

  • Fully connected layer
  • Linear activation

Located in agents/topologies/actor.py

Dependencies

  • python: 3.5
  • tensorboardX: 1.4
  • tensorboard: 1.7.0
  • pytorch: 0.4.1
  • numpy: 1.15.2
  • Linux / OSX
Test the algorithms the actor

To test the agents on the reacher run: reacher_test_agent.py with the following option:

  • -a : algorithm of choice: TD3|PPO|DDPG|DDPG_PSNE
Pre-trained Models

Trained models are under:

/models/<algorithm>

About

Deep RL - Continuous control using policy gradients

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published