This tutorial will review the State Of The Art (SOTA) of RL using the Deep RL Bootcamp. Techniques will get benchmarked using OpenAI gym-based environments.
(video)
Lesson treats how to solve MDPs using exact methods:
- Value iteration
- Policy iteration
Limitations are:
- Iteration over/storage for all states, actions and rewards require small and discrete state-action space
- Update equations require access to the dynamical model (of the robot or whatever the agent is)
expected utility/reward/value starting in , taking action and (thereafter) acting optimally.
This Q-value function satisfies the Bellman Equation:
For solving , an interative method named "Q-value iteration" is proposed.
Very simple, initial estimate all to 0. Within each iteration we will replace by the current estimate of and compute .
Improving the policy over the time.
- Remember, value iteration:
We don't get to choose the action, it's frozen to . We compute the value of the policy . We are now in a different MDP where every state, the action has been chosen for you. There's no choice.
(video)
Given the limitations of the previous lesson we'll react as follows: Limitations are:
- Iteration over/storage for all states, actions and rewards require small and discrete state-action space -> sampling-based approximations
- Update equations require access to the dynamical model (of the robot or whatever the agent is) -> Q/V function fitting methods.
Given Q-value iteration as:
This equation is pretty expensive from a computational perspective. Consider a robot, you'll need to know each one of the possible future states and compute the corresponding probability of reaching those states.
Rewrite it as an expectation:
This results in an algorithm called "tabular Q-learning" whose algorithm follows:
During the inital phases of learning, choosing greedily isn't optimal. If you only choose actions greedily you are restricting yourself to not explore alternative strategies.
Q-learning converges to optimal policy -- even if you're acting suboptimally!. This is called off-policy learning. Caveats:
- You have to explore enough
- You have to eventually make the learning rate small enough
- ... but not decrease it too fast, otherwise it will converge to wrong values.
The reinforcement learning performed by human race is a collective learning practice where we're not just looking at the experience of ourselves but also looking at the experience of other human beings.
Getting tabular Q-learning to work is unfeasible for most environments thereby "Approximate Q-Learning" is introduced. This typically gets represented through a parametrized Q-function: that can get represented with linear functions or with more complicated neural nets.
In some case, "Tabular Q-learning" is a special case of "Approximate Q-learning".
(video)
The whole idea behind DQN is to make Q-learning look like it's supervised learning.
Two ideas to address the correlation between updates:
- Experience replay -> better data efficiency
- Use older set of weights to compute the targets (target network). This is good because if we fix this target weights then basically we have a fixed network that's producing our estimates and we have also those experience replay buffers, something like a dataset which we're sampling data from. So what we do with this is that we feed this Q values to something that almost looks like a fixed set of targets.
When one runs DQN in a new problem it might transitions to get some reasonable results.
- Initialize replay memory to capacity
- Initialize action-value function Q with random weights
- Initialize target action-value function \hat{Q} with weights
-
For do:
- Initialize sequence and preprocessed sequence
-
For do
- With probability
$\epsilon$ select a random action$a_t$ - otherwise select
$a_t = argmax_a Q(\phi(s_t),a;\theta)$ - Execute action
$a_t$ in emulator and observe reward$r$ , and image$x_{t+1}$ - Set
$s_{t+1} = s, a_t, x_{t+1}$ and preprocess$\phi_{t+1} = \phi(s_{t+1})$ - Store transition (
$\phi_t, a_t, r_t, \phi_{t+1}$ ) in$D$ - Sample random minibatch of transition ((
$\phi_j, a_j, r_j, \phi_{j+1}$ )) - Set $$ y_j = \left{ \right. $$
- Perform a gradient descent on
$(y_j - Q(\phi_j, a_j; \theta))^2$ with respect to the network parameters$\theta$ - Every
$C$ steps reset$\hat{Q} = Q$
- With probability
- End For
- End For
Preprocessed elements correspond to 4 frames of the game stacked together as an input to the network representing the state (this worked for them).
Value-based methods tend to be more robust parameter-wise. Much more than policy gradient methods. People in DeepMind is running algorithms with barely no variations on the hyperparams.
Double DQN, an upgrade of DQN. It exploits the fact that you have two networks: the online network and the target network. The idea is to use your online network to select the best action but then you use the target network to get the value estimate. You separate the from selecting the value.
Dueling DQN, make two separate channels:
- a channel that output a single channel, the value
- and another channel that outputs one number per action, the advantage Summing this up (to get the output estimate of the Q-value) it will work much better in practice.
You can go beyond exploration, one good way of doing exploration with NNs is to add noise to the parameters of the NNs.
(video)
Learning Q
or V
can be really complicated. E.g. in a robotic grasp.
- Q implies that continuous or high state actions spaces are tricky. Reason: argmax computation is tricky. DQN has an ouput per each possible action but how can we deal with it if we have a continuous set of actions?.
DQN, one outpuot per action. Highest score. In the continuous case, we can't have an output per action. So we then need to take the action as an input. Input is action and state, output is the value of that action/state combination. We need to solve a difficult optimization problem.
Methods described will be on-policy. Dynamic programming methods (e.g. DQN) explore better and are more sample efficient.
Typically Policy optimization is easier to get it to work.
The derivation done shows that it's valid even when the reward function is discontinuous or even unknown.
The sign of the reward seems to play a relevant role in policy gradient methods. Since the gradient:
- increase probability of paths with positive reward R.
- decrease probability of paths with negative reward R.
A baseline b
is introduced to improve the formulation. Such baseline doesn't affect
while it doesn't depend on the action. The value function tells you how much reward you'll get.
Explanation at https://youtu.be/S_gwYj1Q-44?t=36m:
Algorithm: Vanilla Policy Gradient [William, 1992]
- Initialize policy (e.g. NNs) parameter and baseline
-
for do
- Collect a set of trajectories by executing the current policy
- At each timestep in each trajectory, compute
the return
$R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t}r_{t'}$ , and the advantage estimate$\hat{A_t} = R_t - b(s_t)$ . - Re-fit the baseline (recomputing the value function) by minimizing
$|| b(s_t) - R_t||^2$ , summed over all trajectories and timesteps. - Update the policy, using a policy gradient estimate
$\hat{g}$ , which is a sum of terms$\nabla_\theta log\pi(a_t | s_t,\theta)\hat(A_t)$ end for
Among the demonstrators shown, robot learns how to run in about 2000 episodes which according to them, could take 2 weeks of real training.
Simulation to real robot explained at https://youtu.be/S_gwYj1Q-44?t=50m20s. Simulating and transferring it into the real robot is an active research line. One interesting approach is to run experiments in a number of different randomized scenarios under the same policy (e.g.: using different physics engines).
Question asked about how to make a robot do specific things once it has learned basic behaviors (e.g.: walk). Answer was that problem had to be set up slightly differently. Instead of inputing to the NNs only states you would also input the direction of the goal. Paper "Hindsight Experience Replay" describes this method.
(video)
Accounts for the total amount of parameters at https://youtu.be/tqrcjHuNdmQ?t=3m34s as follows:
# 80x80 pixel images
#input layer to (first and only) hidden layer
# 200 neurons in the hidden layer
80*80*200 (weights) + 200 (biases)
# hidden to output layer
200*1 (weights) + 1 (biases)
There's no way to learn from static frames so what they do is concatenate frames together or use the difference.
Interesting way of putting Supervised Learning and Reinforcement Learning:
Supervised Learning | Reinforcement Learning |
---|---|
We try to maximize: for images and labels | we have no lables so we sample |
once we collect a batch of rollouts, we maximize: |
where is called the advantage (-1,+1) depending on the result.
Discounting is heuristic to do a modulation of the blame for a bad (or good) reward. Typically represented as .
Simple implementation of a policy gradient using numpy. Code https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5 explained nicely starting at https://youtu.be/tqrcjHuNdmQ?t=23m19s. (good exercise would be to implement it using Tensorflow).
(video)
Lesson about more advanced optimization methods to be used with Policy Gradients (PG).
Two limitations of "Vanilla Policy Gradient methods":
- hard to choose stepsizes (observation and reward distributions may change)
- sample efficiency (slow learning)
How to use all data available and compute the best policy:
- In Q-Learning we're optimizing the wrong objective. In Q-Learning you're trying to minimize some kind of Bellman error when what you care about is that your policy performs well.
- PG optimized the thing we care about but the downside is that they are not good at using all of the data we have available.
It appears that we can solve this constrained optimization problem efficiently by using conjugate gradient. This method is closely related to natural policy gradients, natural actor-critic and others.
Really theoretical. Summary of the theoretical methods at https://youtu.be/xvRrgxcpaHY?t=27m16s.
Limitations of TRPO:
- Hard to use with architectures with multiple outputs, e.g. like if the policy is outputing action probabilities and the value function then it's not clear what you should use instead of the KL divergence.
- Empirically it performs poorly on tasks requiring deep CNNs and RNNs
- Conjugate Gradient makes the implementation less flexible (a little bit more complicated)
Two alternatives to TRPO:
- KFAC: do blockwise approximation to Fisher Information Matrix (FIM)
- ACKTR: KFAC idea into A2C.
- PPO: use penalty instead of the constraint.
- Run policy for timesteps or trajectories
- Estimate advantage function at all timesteps
- DO SGD on objective for some number of epochs End For
It's:
- Much better for continuous control than TRPO, much better than atari (robotics).
- Compatible with RNNs and multiple output networks.
Talk about tips on how to make decision on RL.
Quick tips for new algorithms:
- Small environment
- Visualize everything
- If working with Hierarchical RL, we should have a way to identify what's the algorithm doing.
- Don't over optimize things when working with toy problems
- Intuition: Find out your own set of problems that you know very well and that allow you to determine if an algorithm is learning or not.
Quick tips for a new task:
- Make things easier
- Provide good input features. E.g.: start "understanding the problem by" switching to "x-y" coordinates rather than start with full RGB images.
- shape the reward function: come up with a reward function that gives you fast feedback of whether you're doing the right thing or not.
POMDP design tips:
- Visualize random policies when training, if it does "sort-of-ok" work, then RL is likely the right thing.
- Make sure everything (obs and rewards mainly) is reasonable scaled. Everything with mean 0 and std deviation of 1.
Run your baselines:
- default parameters are useless
- Recommendation:
- Cross-entropy method (Wikipedia, Gist implementation)
- Well-tuned policy gradient method (e.g.: PPO)
- Well-tuned Q-learning + SARSA method
Usually things work better when you have more samples (per batch). Bigger batches are recommended.
Explore sensitivity to each parameter. If it's too sensitive, got lucky but not robust.
Have a system for continually benchmarking your code. Run a battery of benchmarks occasionally. CI? Automate your experiments.
Standardizing data:
- Rescale observations with:
- Rescale rewards but don't shift the mean as that affects agents will to live
- Standardize prediction targets is more complicated.
Generally important parameters:
- Watch out with , it can ignore rewards delayed by x timesteps
- Action frequency, make sure it's human solvable.
General RL diagnostics:
- Look at episode returns min/max and stdev along with mean.
- Look at episode lengths (sometimes more informative that the reward), sometimes provides additional information (e.g.: solving problem faster, losing game slower)
Entropy as a diagnostic:
- If your entropy is going down really fast it means the policy is becoming deterministic and it's not going to explore anything. Be careful also if it's not going down (totally random always). How do you measure entropy? For most policies you can compute the entropy analytically. For continuous gaussian policies you can compute the differencial entropies.
- Use entropy bonus
Baseline explained variance:
Policy initialization is relevant. Specially in supervised learning. Zero or tiny final layer, to maximize entropy.
- optimize memory usage (replay buffer)
- learning rate,
- schedules are often useful ()
- converges slowly and has a misterious warmup period (DQN)
- Read older textbooks and theses
- Don't get too stucked on problems
- DQN performs pretty poorly for continuous control
- Techniques from supervised learning don't necessarily work in RL (e.g.: batch norm, dropout, big networks)
- How long do you wait until you decide if the algorithm works or not? No straight answer. For PG, it typically learns fast.
- Do you use unit tests? Only for particular mathematical things.
- Do you have guidelines on how to much the algorithm to a particular task? People have found that PG methods are probably the way to go if you don't care about sample complexity. PG is probably the safest way to go. DQN is a bit indirect of what's doing. If you care about sample complexity or need off-policy data, then DQN. Sample complexity is relevant if your simulator is expensive. People have found that DQN works well with images as inputs while PG methods work better in continuous control tasks but it might be a historical accident.
- Recommendations on textbooks:
- Optimal control and dynamic programming
- ... (didn't catch them)
- Comments on evolution strategies: Lots of PG methods, some complicated. Evolution strategies (simple algorithm) as opposed to PG. OpenAI (and others) claimed EA work as well as PG. His opinion is that the sample complexity is worse by a constant factor (1,3, 100?). This might change between problems. EA might be a good alternative for problems with time dependencies.
- Framework for hyperparameter optimization: He uses uniform sampling. Works really well.
Two papers of interest for robotics regarding the SOTA:
The value function gives the expected future discount return.
The Bellman equation in a slightly different form can be written as:
will represent the distribution of the return of taking action in state . The distribution of the Bellman equation can be expressed as:
(capital letters are representing random variables)
C51 is a specific instantiation of using this previous idea of distributions:
- C51 outputs a 51-way softmax for each Q-value.
- Bin the return into 51 equally-sized bins between -10 and 10
- The Q-value update becomes categorial (instead of the usual regression)
Pseudocode: Categorical Algorithm
... (complete algorithm, minute 8:42)
Resuls shown show that C51 improve DQN by 700%. Furthermore, results using categorical classification (rather than regression) gets us good results. It was a huge advance.
Among the good things about Distributional RL, data efficiency is greatly improved. Why does it work so well? Don't really know. Is it learning to model uncertainty?
PG gradient might obtain better results for the different environments but the hyperparams need to be tuned for each environment.
Big expectation about it. They think it'll be pretty important. Specially for cases where you have a simulator that's really expensive or you want to do something in the real world, like with robots.
An interesting new approach that includes the parametric approach (which is used in all typical methods) and a non-parametric one. Good results on the environments presented.
Intuitive appealing. It hasn't landed into massive breakthroughs just yet.
The basic idea is to capture the structure in the policy space. Goes back to 1993 with "Feudal Reinforcement Learning".
It feels like in the next few years, learning in robots will get us to the point of having robots doing real cool stuff. All the methods they looked so far were really data innefficient.
He claimed we were close to breakthrough (or already on it). Two main approaches:
- Learning in simulation and then doing transfer learning. Simulation variances to fill in the simulation gap.
- Imitation learning will be important.
Small changes in the observatoins lead to the robot failing to generalize.
We can't build a system that doesn't make mistakes. But maybe we can learn from those mistakes.
What does it take for RL to learn from its mistakes in the real world?
- Focus on diversity rather proficiency
- Self-supervised algorithms, they must have access to the reward.
- Learn quickly and efficiently
- Fix mistakes when they happen, adaptability.
Current benchmarks in robotics do not follow the goals of images (e.g.: imagenet). They are small-scale, emphasize mastery and are evaluated on performance.
By self-supervised learning with robots, Sergey means to have the robot learn a task by itself without human intervention (or with as little as possible).
The open loop (just looking once at the image) works much worse (33.7% failures) than the closed-loop (17.5% failures). There's a big benefit at doing continuous control.
Training with multiple robots improves even more the performance (minute 16:30). They first trained the whole network in one robot and then on the other robot. The end effector is the same, the rest different. Work by Julian Ibarz.
A few comments/ideas after watching this:
- Modular robots can extend their components seamlessly.
- This brings clear advantages for the construction of robots however training them with DRL becomes expensive due to the following reasons:
- Every small change in the physical structure
of the robot will require a complete retrain.
- Building up the tools to train modular robots (simulation
model, virtual drivers for the different actuators, etc)
becomes very expensive.
- Transferring the results to the real robot is complex given the flexibility of these systems.
- We present a framework/method that makes use of traditional
tools in the robotics environment (e.g. ROS, Gazebo, etc.) which
simplifies the process of building modular robots and their
corresponding tools to be trained using DRL techniques. Besides
integration with the traditional robotics tools, our
framework/method includes baseline implementations for the most
common DRL techniques for both value and policy
iteration methods. Using this framework we present the results obtained benchmarking DRL methods in a modular robot. Configurations with 3 and 4 degrees of freedom are presented while performing the same task. In addition, we present our insights about the impact of the simulation acceleration in the final reward and conclude with our hypotheses about
Future work: a) generalization between different DOFs (refer to previous work from Julian Ibarz where the overall performance got better after training in two different robots), b) further work on transfer learning with simulation variance, c) inclusion of imitation learning to accelerate the process of learning a particular task, d) inclusion of images to get a much richer signal from where to learn.
Conclusions:
- Change titles to:
- DRL methods/framework for modular robots
- An information model for modular robots
- Participate in the Amazon Picking Challenge with a modular robot and DRL?
Semantic picking results presented. Pretty nice.
(imitation learning?)
...