Skip to content

AdamOlsson/rl_gamblers_problem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning Part 2

Part 2 of my Reinforcement Learning (RL) series. During this series, I dwell into the field of RL by applying various methods to video games to learn and understand how an algorthm can learn to play by itself. The motivation for doing this series is simply by pure interest and to gain knowledge and experience in the field of Machine Learning.

The litterature follow throughout this series is Reinforcement Learning "An Introduction" by Ricard S. Button and Andrew G. Barto. ISBN: 9780262039246

Gambler's Problem

The Gambler's Problem is a straight forward problem which can be solved by applying Valute Iteration described in part 1 of this series. A gambler has the opportunity to make bets on the outcome of a sequence of coin flips. If the coin comes up heads he wins as much money as he put on stake and loses otherwise. The game ends when the gambler reaches 100EUR or when the gambler run out of money. The rewards are 1 if the gambler wins and 0 on all other state transitions. Some restrictions are: the gambler can't stake more than money than he has and (2) all stakes are positive integers.

Value Iteration

We define the state s of the environment as the current capital for gambler and the set of actions is defined as a ∈ {0, 1, ..., min(s, 100 - s)}. Here we've also set the probability of heads ph as 0.4.

Below is the final value function and the policy for the environmet.

There are some things that can be said about the strange policy. When selecting what action to take, the action that returns the highest reward will be the one chosen. This is simply done by the max() function. However, when multiple action have the same expected rewards, the max() function will select the first action that rewards max expected value because of how it is implemented. Turns out that one of the actions that often returns the max expected reward is the action to stake 0 which also happens to be the first action in the list of action. In these cases, the algorithm would stake 0 which explains the almost noisy looking policy. According to James Teow, here, the policy would be a lot smoother if the max() function chose the last index for the cases when the maximum estimated return can be found on multiple actions.

Releases

No releases published

Packages

No packages published

Languages