Pole Balancer is a Python program that uses reinforcement learning (RL) to automatically design a policy for the classic controls problem of a cart balancing a pole. Through Markov decision processes framework, we can perform reinforcement learning without having any explicit knowledge of the physics of the underlying system, in our case, the pole on the cart.
- Ubuntu 18.04+, macOS 10.15+ and Windows 10+ (64-bit)
- At least 5GB of memory
- Anaconda/Miniconda
- Python 3.6 or above
- A Python IDE (Jupyter/PyCharm)
Install the following Python packages:
- matplotlib
- numpy
- scipy
- pillow
Clone
git clone https://github.com/avrumnoor/PoleBalancer.git
Run
python polebalancer.py
A thin pole is hinged to a cart. The cart moves laterally on a smooth table surface. The program fails if either the angle of the pole deviates by more than a particular amount from the vertical position (i.e., if the pole falls over), or if the cart’s position goes out of bounds (i.e., if it falls off the end of the table).
Balance the pole with these constraints, by appropriately having the cart accelerate left and right.
- Estimate a model (i.e., transition probabilities and rewards) for the underlying MDP.
- Obtain a value function by solving Bellman’s equations for this estimataion to obtain a value function.
- Act greedily with respect to this value function.
- Initially, each state has estimated reward zero, and the estimated transition probabilities are uniform.
- As the program goes along taking actions, it will gather observations on transitions and rewards, which it can use to get a better estimate of the MDP model.
- Store the state transitions and reward observations each time, and update the model and value function/policy only periodically.
- Each time a failure occurs, re-estimate the transition probabilities and rewards as the average of the observed values (if any).
- Repeat previous steps until convergence (once several consecutive attempts (defined by the parameter
NO LEARNING THRESHOLD
) to solve Bellman’s equation all converge in the first iteration since this implies that the estimated model has stopped changing significantly).
Avrum Noor
Stanford Machine Learning Coursework