commit

recohut · Oct 24, 2021 · be1a2db · be1a2db
1 parent 2d0282c
commit be1a2db
Show file tree

Hide file tree

Showing 8 changed files with 9 additions and 1 deletion.
diff --git a/docs/T256744_Real_Time_Bidding_in_Advertising.ipynb b/docs/T256744_Real_Time_Bidding_in_Advertising.ipynb
diff --git a/docs/T532530_Predicting_rewards_with_the_state_value_and_action_value_function.ipynb b/docs/T532530_Predicting_rewards_with_the_state_value_and_action_value_function.ipynb
@@ -0,0 +1 @@
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"T532530 | Predicting rewards with the state-value and action-value function","provenance":[],"collapsed_sections":[],"toc_visible":true,"authorship_tag":"ABX9TyMxvE2iYamvfXD+fqlU2yGW"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","metadata":{"id":"3UFLh0qq8AAs"},"source":["# Predicting rewards with the state-value and action-value function"]},{"cell_type":"markdown","metadata":{"id":"eNUJVfOs8anT"},"source":["## Setup"]},{"cell_type":"code","metadata":{"id":"riJzjMYl8bpI"},"source":["!pip install -q numpy==1.19.2"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"M7cAHYFO8dYW"},"source":["import numpy as np"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Dl_lis7o8RJE"},"source":["## The environment - a simple grid world"]},{"cell_type":"markdown","metadata":{"id":"m5KD8ygN-I-_"},"source":["To demonstrate, let me use a rediculously simple grid-based environment. This consists of 5 squares, with a cliff on the left-hand side and a goal position on the right. Both are terminating states."]},{"cell_type":"code","metadata":{"id":"lv_jA3lj-JVI"},"source":["starting_position = 1 # The starting position\n","cliff_position = 0 # The cliff position\n","end_position = 5 # The terminating state position\n","reward_goal_state = 5 # Reward for reaching goal\n","reward_cliff = 0 # Reward for falling off cliff\n","\n","def reward(current_position) -> int:\n"," if current_position <= cliff_position:\n"," return reward_cliff\n"," if current_position >= end_position:\n"," return reward_goal_state\n"," return 0\n","\n","def is_terminating(current_position) -> bool:\n"," if current_position <= cliff_position:\n"," return True\n"," if current_position >= end_position:\n"," return True\n"," return False"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"_XiXx656-PQa"},"source":["## The Agent\n","In this simple environment, let us define an agent with a simple random strategy. On every step, the agent randomly decides to go left or right."]},{"cell_type":"code","metadata":{"id":"sTE6A2sT-Qbl"},"source":["def strategy() -> int:\n"," if np.random.random() >= 0.5:\n"," return 1 # Right\n"," else:\n"," return -1 # Left"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Nr6g4kix8FrC"},"source":["## State-value function"]},{"cell_type":"markdown","metadata":{"id":"gmxivI4C8I7B"},"source":["The state-value function is a view of the expected return with respect to each state.\n","\n","$V_{\\pi}(s) \\doteq \\mathbb{E}_{\\pi}[ G \\vert s] = \\mathbb{E}_{\\pi}\\bigg[ \\sum^{T}_{k=0} \\gamma^k r_{k} \\vert s \\bigg]$\n","\n","You could estimate the expectation in a few ways, but the simplest is to simply average over all of the observed rewards. To investigate how this equation works, you can perform the calculation on a simple environment that is easy to validate."]},{"cell_type":"markdown","metadata":{"id":"9qC3wJ_d-RkT"},"source":["### Experiment\n","Let’s iterate thousands of times and record what happens.\n","\n","The key to understanding this algorithm is to truly understand that we want to know the return, from a state, on average. Say that out loud. The return, from a state, on average.\n","\n","You’re not rewarding on every step. You’re only rewarding when the agent reaches a terminal state. But when you are in the middle of this environment, for example, there is a 50/50 chance of ending up at the goal. You might also end up off the cliff. So in this instance, the expected value of that state is half way between the maximum reward, 5, and the minimum reward, 0.\n","\n","Note that in this implementation 0 and 5 are terminating states, only 1-4 are valid states, so given four states the mid-point is actually in-between states. This will become clear when you inspect the values later.\n","\n","If you were to implement this, you need to keep track of which states have been visited and the eventual, final reward. So the implementation below has a simple buffer to keep track of positions."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"eamzOvzk-igj","executionInfo":{"status":"ok","timestamp":1634452780091,"user_tz":-330,"elapsed":733,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"bf0df559-7c1b-4602-a07b-7b0045f5be5d"},"source":["np.random.seed(42)\n","\n","# Global buffers to perform averaging later\n","value_sum = np.zeros(end_position + 1)\n","n_hits = np.zeros(end_position + 1)\n","\n","n_iter = 10\n","for i in range(n_iter):\n"," position_history = [] # A log of positions in this episode\n"," current_position = starting_position # Reset\n"," while True:\n"," # Append position to log\n"," position_history.append(current_position)\n","\n"," if is_terminating(current_position):\n"," break\n"," \n"," # Update current position according to strategy\n"," current_position += strategy()\n","\n"," # Now the episode has finished, what was the reward?\n"," current_reward = reward(current_position)\n"," \n"," # Now add the reward to the buffers that allow you to calculate the average\n"," for pos in position_history:\n"," value_sum[pos] += current_reward\n"," n_hits[pos] += 1\n"," \n"," # Now calculate the average for this episode and print\n"," expected_return = ', '.join(f'{q:.2f}' for q in value_sum / n_hits)\n"," print(\"[{}] Average reward: [{}]\".format(i, expected_return))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["[0] Average reward: [0.00, 0.00, nan, nan, nan, nan]\n","[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00]\n","[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00]\n","[3] Average reward: [0.00, 2.00, 5.00, 5.00, 5.00, 5.00]\n","[4] Average reward: [0.00, 1.67, 5.00, 5.00, 5.00, 5.00]\n","[5] Average reward: [0.00, 1.43, 5.00, 5.00, 5.00, 5.00]\n","[6] Average reward: [0.00, 1.11, 3.75, 5.00, 5.00, 5.00]\n","[7] Average reward: [0.00, 0.91, 3.00, 5.00, 5.00, 5.00]\n","[8] Average reward: [0.00, 0.83, 3.00, 5.00, 5.00, 5.00]\n","[9] Average reward: [0.00, 0.77, 3.00, 5.00, 5.00, 5.00]\n"]},{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:30: RuntimeWarning: invalid value encountered in true_divide\n"]}]},{"cell_type":"markdown","metadata":{"id":"060aqlcZ-i2N"},"source":["I’ve capped the number of episodes to 10 so you can see the evolution of the value estimates. But I encourage you to run this yourself and change this to 10,000.\n","\n","Note that I have chosen a random seed that stubles right on the second episode. In general you would expect it to reach the goal every 1-in-5. Try changing the seed to see what happens?\n","\n","You can see that with each episode the value estimate gets closer and closer to the true value (which should be integer 0 to 5). For example, when you are in the state next to the goal (the box next to the end) the you would expect that the agent should stumble towards the goal more often than not. Indeed, 4 out of 5 times it does, which means that the average return is 5 (the goal) multipled by 4/5."]},{"cell_type":"markdown","metadata":{"id":"kLf_ghKd_A68"},"source":["### Discussion\n","\n","It’s worth going through this code line by line. It truly is fundamental. This algorithm allows you to estimate the value of being in each state, purely by experiencing those states.\n","\n","The key is to remember that the goal is to predict the value FROM each state. The goal is always to reach a point where you are maximizing rewards, so your agent needs to know how far away from optimal it is. This distinction can be tricky to get your head around, but once you have it’s hard to think any other way.\n","\n","You can even use this in your life. Imagine you wanted to achieve some goal. All you have to do is predict the expected return from being in each new state. For example, say you wanted to get into reinforcement learning. You could go back to university, read the books, or go and watch TV. Each of these have value, but with different costs and lengths of time. The expected return of watching TV is probably very low. The expected return of reading the book is high, but doesn’t guarantee a job. Going back to university still doesn’t guarantee a job, but it might make it easier to get past HR, but it takes years to achieve. Making decisions in this way is known as using the expected value framework and is useful throughout business and life."]},{"cell_type":"markdown","metadata":{"id":"3On6ji1T_Wua"},"source":["## Action-value function"]},{"cell_type":"markdown","metadata":{"id":"WtcqoqhV_j8g"},"source":["The action-value function is a view of the expected return with respect to a given state and action choice. The action represents an extra dimension over and above the state-value function. The premise is the same, but this time you need to iterate over all actions as well as all states. The equation is also similar, with the extra addition of an action, a:\n","\n","$ Q_{\\pi}(s, a) \\doteq \\mathbb{E}_{\\pi}[ G \\vert s, a ] = \\mathbb{E}_{\\pi}\\bigg[ \\sum^{T}_{k=0} \\gamma^k r_{k} \\vert s, a \\bigg] $"]},{"cell_type":"markdown","metadata":{"id":"Shqg2yyM_qPR"},"source":["### Experiment\n","\n","First off, there’s far more exploration to do, because we’re not only iterating over states, but also actions. You’ll need to run this for longer before it converges.\n","\n","Also, we’re going to have to store both the states and the actions in the buffer."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PeLxEK_NAMj4","executionInfo":{"status":"ok","timestamp":1634453219318,"user_tz":-330,"elapsed":537,"user":{"displayName":"Sparsh Agarwal","photoUrl":"https://lh3.googleusercontent.com/a/default-user=s64","userId":"13037694610922482904"}},"outputId":"07b65b33-4885-46b3-ae00-94b2c7e5531a"},"source":["np.random.seed(42)\n","\n","# Global buffers to perform averaging later\n","# Second dimension is the actions\n","value_sum = np.zeros((end_position + 1, 2))\n","n_hits = np.zeros((end_position + 1, 2))\n","\n","# A helper function to map the actions to valid buffer indices\n","def action_value_mapping(x): return 0 if x == -1 else 1\n","\n","\n","n_iter = 10\n","for i in range(n_iter):\n"," position_history = [] # A log of positions in this episode\n"," current_position = starting_position # Reset\n"," current_action = strategy()\n"," while True:\n"," # Append position to log\n"," position_history.append((current_position, current_action))\n","\n"," if is_terminating(current_position):\n"," break\n"," \n"," # Update current position according to strategy\n"," current_position += strategy()\n","\n"," # Now the episode has finished, what was the reward?\n"," current_reward = reward(current_position)\n"," \n"," # Now add the reward to the buffers that allow you to calculate the average\n"," for pos, act in position_history:\n"," value_sum[pos, action_value_mapping(act)] += current_reward\n"," n_hits[pos, action_value_mapping(act)] += 1\n"," \n"," # Now calculate the average for this episode and print\n"," expect_return_0 = ', '.join(\n"," f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])\n"," expect_return_1 = ', '.join(\n"," f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])\n"," print(\"[{}] Average reward: [{} ; {}]\".format(\n"," i, expect_return_0, expect_return_1))"],"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["[0] Average reward: [nan, 5.00, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]\n","[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]\n","[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]\n","[3] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]\n","[4] Average reward: [0.00, 1.67, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]\n","[5] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]\n","[6] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]\n","[7] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, nan, nan, nan]\n","[8] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, 0.00, nan, nan]\n","[9] Average reward: [0.00, 1.25, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, 0.00, nan, nan]\n"]},{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:37: RuntimeWarning: invalid value encountered in true_divide\n","/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:39: RuntimeWarning: invalid value encountered in true_divide\n"]}]},{"cell_type":"markdown","metadata":{"id":"OHQFiVP8AOLP"},"source":["### Discussion\n","\n","I’ve capped the number of episodes to 10 again. I encourage you to run this yourself and change this to 10,000.\n","\n","You can see that the results are similar, except for the fact that one of the actions (the action heading towards the cliff) is always zero, as you might expect.\n","\n","So what’s the point of this if the result is basically the same? The key is that enumerating the action simplifies latter algorithms. With the state-value function your agent has to figure out how to get to better states in order to maximise the expected return. However, if you have the actions at hand, you can simply pick the next best action!"]}]}
diff --git a/docs/T729495_GAN_User_Model_for_RL_based_Recommendation_System.ipynb b/docs/T729495_GAN_User_Model_for_RL_based_Recommendation_System.ipynb
diff --git a/docs/T798984_Comparing_Simple_Exploration_Techniques:_ε_Greedy,_Annealing,_and_UCB.ipynb b/docs/T798984_Comparing_Simple_Exploration_Techniques:_ε_Greedy,_Annealing,_and_UCB.ipynb
diff --git a/docs/_images/T729495_1.png b/docs/_images/T729495_1.png
diff --git a/docs/_images/T729495_2.png b/docs/_images/T729495_2.png
diff --git a/docs/_images/T798984_1.png b/docs/_images/T798984_1.png
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -35,4 +35,8 @@ parts:
  - file: T219174_Recsim_Catalyst
  - file: T079222_Solving_Multi_armed_Bandit_Problems
  - file: T734685_Deep_Reinforcement_Learning_in_Large_Discrete_Action_Spaces
- - file: T257798_Off_Policy_Learning_in_Two_stage_Recommender_Systems
+ - file: T257798_Off_Policy_Learning_in_Two_stage_Recommender_Systems
+ - file: T798984_Comparing_Simple_Exploration_Techniques:_ε_Greedy,_Annealing,_and_UCB
+ - file: T532530_Predicting_rewards_with_the_state_value_and_action_value_function
+ - file: T256744_Real_Time_Bidding_in_Advertising
+ - file: T729495_GAN_User_Model_for_RL_based_Recommendation_System