Skip to content

A from scratch implementation of the Qlearning machine learning algorithm on the classic game of pong

License

Notifications You must be signed in to change notification settings

BouzoulasDimitrios/Qlearning-Pong-Workshop-in-Cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workshop on Qleaning for Pong using C++ & SFML

Anime pong player image.

Introduction

This is an implementation of a qlearning algorithm for the classic game of pong.
In this implementation, standard C++ libraries are used in order to implement qlearning.
No external machine learning libraries are used, as the implementation does not involve a neural network.

Setup

Prerequisites: SFML

$ git clone https://github.com/BouzoulasDimitrios/Qlearning-Pong-Workshop-in-Cpp.git
$ cd Qlearning-Pong-Workshop-in-Cpp/
$ bash launch.sh

Demo

Model on the Left vs me on the Right.

Demo gif

Detailed explanaition

summary

This model was created using a reinforcement learning algorithm called Qlearning.

Qleaning uses game states in order to describe what is happening in the game.

For example the starting game state:

exaple game state.

In any given state of the PONG game, we can describe what is happening by simply listing the coordinates of the game components on the game window.

At the start of the game, we can describe the game as follows:

Right paddle: x = 1000 , y = 228 
Left  paddle: x = 10   , y = 228 
Ball: x = 507, y = 251   

As a result the whole game can be described at any state by six numbers.

How decisions are made

In simple terms, qlearning aims to create a table that will give us the correct move for every game state.

In pong, we have only 3 moves for our model. UP, DOWN, or STAY.

As a result, we get a lookup table that looks like this:

UP STAY DOWN
state 0 #c5f015 correct #f03c15 wrong #f03c15 wrong
state 1 #f03c15 wrong #c5f015 correct #f03c15 wrong
state 2 #c5f015 correct #f03c15 wrong #f03c15 wrong

But how does the algorithm know which action it should take for any given game state?
This is where training takes place, the algorithm takes random actions and gets rewarded or punished based on the actions it takes.
After training for a while, the lookup table will end up having large numeric values for the correct moves and negative ones for the wrong moves.

UP STAY DOWN
state 0 0.929 -0.892 -0.932
state 1 -0.988 0.922 -0.894
state 2 0.981 -0.878 -0.825

Now during the testing phase the only thing that our model has to do is look at the highest number for a given game state and take that action.
If it was trained correctly, it should be able to play the game.

Qlearning

Goal

The goal we have is a model that can play PONG.
Due to the simplicity of the approach, it can be difficult to make the model perform fancy moves
and teach it to hit the ball with nice angles.

As a result, the problem is approached with a simpler goal.
"Make the model always hit the ball".
That will be the goal for the model, and the reason for that is simple.
In pong if you always hit the ball you cannot lose, and as a result you will win sooner or later.

calculating game states

This is one of the first problems that come up.
In order to calculate all the possible game states, we run into a problem.
There are simply too many of them.

In case we want to calculate all of the game states, we have the following variables:

Right paddle: x, y
Left  paddle: x, y
Ball: x, y   

Now some of these variables are useless as the paddles do not move along the X axis.

As a result, we are left with the following game variables:

Right paddle: y
Left  paddle: y
Ball: x, y 

Y values range from 0 to 512.
X values range from 0 to 1024.

Calculating the game states gives us the following number:

game_states = 512 * 512 * 512 * 1024 => 
game_states = 137438953472 

This leaves us with roughly 100 billion game states, which is an unpleasantly big number.

As a result, we should try to get it smaller.
Let's start by the least relevant part, the oponents paddle height.
If my goal is to hit the ball, I do not have to know where the opponent's paddle is.

So let's calculate the ammount of game states again leaving out the oponents paddle Y position:

game_states = 512 * 512 * 1024 => 
game_states = 268435456 

Still, being in the hundreds of millions is not optimal as we need to have 3 values for every game state
leaving us with nearly 1 billion values.

So how do we optimize from here?

We could apply the following approach:
As we aim to always hit the ball and nothing else, the only thing that we need to do is to be on the same level as the ball.

We could simply use the ball's Y value and the paddle's Y value.

Now let's calculate again the game states without the ball's X:

game_states = 512 * 512 => 
game_states = 262144 

This leaves us with a number that is in the hundreds of thousands instead of hundreds of billions, which is a significant improvement.

Now what the model will see is just two heights like this:

Bar plot of paddle and ball.

The result is a model that will try to always match these heights, similar to the middle "Same height"
plot, if the actions of the model result in the paddle being below or above the ball, it gets punished.

Now that we have the values that will make up our states, we need to create some function in order to calculate the state.

Since we have only 2 variables to play with. We could approach this as a 2D matrix where:

X = Paddle's Y values
Y = Ball's Y values

And as a result, we get a table that looks like this:

States dataframe representation.

From this, we can always assign a unique number to any game state, any combination of Y values.

In order to initialize and populate the matrix, the following can be used:

const int tableSize = 512;

std::vector<std::vector<int>> lookupTable(tableSize, std::vector<int>(tableSize, 0));


void init_state_table(std::vector<std::vector<int>> &lookupTable, int tableSize){

    int value = 0;

    for (int i = 0; i < tableSize; ++i) {
        for (int j = 0; j < tableSize; ++j) {
            lookupTable[i][j] = value;
            value++;
        }
    }

    // Print the lookup table
    for (const auto& row : lookupTable) {
        for (int val : row) {
            std::cout << val << " ";
        }
        std::cout << std::endl;
    }

    return;
}

Now there's only one problem left, in case the ball is directly in the middle of the paddle, how can the model know whether it should go UP or DOWN ?

Example of the ball being in the middle of the paddle.

How should a decision be made in this state?
From what we see, the ball could be going up or down, and there's no way to tell.
But we can add a factor here that changes everything, we could factor in the ball's speed.

Ball vector analysis

The ball's speed is comprised of two values, it's speed in the X axis and Y axis.

But how should we input these values in order to avoid too many game states?

The solution to this is the following:

1) we only use the ball's speed on the Y axis as care primarilly about it's height.
2) we convert the number from the float value that it is to a binary state. 
    
    if( Y ball speed > 0 )
        select state for positive Y values
    else
        select state for negative Y values

As a result, we end up doubling our number of game states. Going from 262144 to => 524288
which is still a manageable number of states.

Now we are ready to create a look-up table:

std::size_t range = 262144;
const int NUMBER_OF_ACTIONS = 3;

std::vector<std::vector<float>> Qtable = std::vector<std::vector<float>>(2*range, std::vector<float>(NUMBER_OF_ACTIONS, 0.0f));

As a result we get the following table:

   | 0 1 2
------------- 
1  | 0 0 0 
2  | 0 0 0 
3  | 0 0 0 
4  | 0 0 0 
5  | 0 0 0 
6  | 0 0 0 
7  | 0 0 0 
8  | 0 0 0 
9  | 0 0 0 
10 | 0 0 0 
11 | 0 0 0 
12 | 0 0 0 
13 | 0 0 0 
14 | 0 0 0 

Where the numbers on the Y axis represent the game state and the numbers on the X the actions.

0 = Down
1 = Stay
2 = Up

How does Qlearning work?

Qlearning works using the flollowing steps:

1) Exploration: the model performs random actions and adjusts the values on the lookup table based on the
rewards or punishments it receives.

2) Exploitation: After the exploration stage, we enter the exploitation stage, the model keeps on learning
the difference now is that its movements use the information that it gained from the exploration stage
meaning that it does not perform random movements but selects the ones that it perceives to be correct.

3) After spending time in the exploitation stage, our model should be able to select the correct action for any given game state and perform it. In our case, that means having the same Y value as the ball at all times.

Let's start with the exploration part as it comes first:

float exploration_rate = 1;
float min_exploration_rate = 0.01;
float max_exploration_rate = 1; 
float exploration_decay_rate = 0.01;

exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * exp(-exploration_decay_rate * (r_points + 1000));

The value of the exploration rate starts out at 1 but it keeps getting smaller as the game goes on
and it goes down until it reaches the minimum exploration rate that we've allowed for.

Updating the look-up table

  • In every game loop an action is chosen
  • We read the reward/punishment and the next state of the game
  • The lookup table values are updated:

The update happens using the following function:

Update function

In code:

void update_qtable(std::vector<std::vector<float>> &Qtable, int game_state_end, int game_state_start, float learning_rate, float discount_rate, float reward, int action ){

    int next_move = static_cast<int> (*max_element(std::begin(Qtable[game_state_end]), std::end(Qtable[game_state_end])));

    Qtable[game_state_start][action] = Qtable[game_state_start][action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * next_move);

}

Connecting it all

Now to put all of these pieces together:

First we need to initialize the state table and the look-up table.

std::vector<std::vector<float>> Qtable = std::vector<std::vector<float>>(2*range, std::vector<float>(NUMBER_OF_ACTIONS, 0.0f));

std::vector<std::vector<int>> lookupTable(tableSize, std::vector<int>(tableSize, 0));

void init_state_table(lookupTable, tableSize);

Now that we have these we read in the game state, we create a struct that represents our game state
and we retrieve the values that we need from the game:

struct State {
  float paddle1, paddle2; // paddle positions
  float ball_x, ball_y;   // ball position
};

pos = pg.ball.getPosition();
l_paddle_pos = pg.left_paddle.getPosition();
pos.x += velX, pos.y += velY;

//starting state
start_state.ball_x = pos.x;
start_state.ball_y = abs(pos.y);
start_state.paddle1 = l_paddle_pos.y;

Now we have a representation of the game, and we need to calculate the game state:

int state_calc(std::vector<std::vector<int>> & lookupTable, struct State & state, float velocity, std::size_t range){

    if(velocity > 0){
        // cout<<"POSTITIVE"<<velocity<<endl; //state.ball_x, 
        return range + lookupTable[state.ball_y][state.paddle1];     
    }else{
        // cout<<"NEGATIVE"<<velocity<<endl; //state.ball_x, 
        return range - lookupTable[state.ball_y][state.paddle1];     
    }

    return lookupTable[state.ball_y][state.paddle1];

}

game_state_start = state_calc(lookupTable,  start_state, velY, range);

Now that we have the state of the game, we need to select an action:

int select_action(int game_state, std::vector<std::vector<float>> &Qtable, float exploration_rate){

    int action;
    exploration_rate = 0;
    if(0.2 > exploration_rate){
        auto it = std::minmax_element(Qtable[game_state].begin(), Qtable[game_state].end());
        int max_idx = std::distance(Qtable[game_state].begin(), it.second);
        return max_idx;
    }

    action = rand() % 3;
    std::cout<<"ACTION  RANDOM = "<<action<<std::endl;
    return action;

}

This function factors in the exploration rate, which means that for the first part of the training, the model
will be behaving randomly and thus 'exploring' the environment, afterwards when the exploration rate is low it
will start selecting actions based on what it has learned and continue from there.

Now we also need to calculate what would be the correct action:

int correct_move_calculator(State state1, float vel){

    if(vel >= 0){
        if(state1.paddle1 + 28 > state1.ball_y)
            return 1;
        else if(state1.paddle1 + 28 < state1.ball_y)
            return 2;
    }else{
        if(state1.paddle1 + 28 > state1.ball_y)
            return 0;
        else if(state1.paddle1 + 28 < state1.ball_y)
            return 1;
    }

    return 0; // avoid compiler warning
}

After we select our action and the game loop is over, we compare it to the correct action,

    reward = (action == correct_move) ? 1 : -1;

If the action matches the correct one the model is rewarded else it's punished.

After every game loop, we also update the loopup table values based on our reward/punishment:

update_qtable(Qtable, game_state_end, game_state_start, learning_rate, discount_rate, reward, action);

After a certain number of loops, the model will be at a state where it can play the game on its own. Performing the perfect move always and staying on level with the ball at all times.

With the weights that I have currently saved, the model should be impossible to beat, but feel free to try to do so.

conclusions

A reinforcement learning model can be trained to play pong with only 3 values from the game, the paddle's height, the ball's height, and the speed of the ball on the Y axis.

All of the data for the game can be retrieved from the model using the functions provided. Feel free to explore different ways to train the model in the least amount of time possible, and with the smallest amount of game states possible.

Resources:

medium.com

video series