-
Notifications
You must be signed in to change notification settings - Fork 0
/
paper2.tex
304 lines (197 loc) · 29.7 KB
/
paper2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
\documentclass[scrartcl, 10.5 pt, conference]{ieeeconf}
\IEEEoverridecommandlockouts
\overrideIEEEmargins
\usepackage{bm}
\usepackage{bbm}
\usepackage{algorithm,algpseudocode}
\usepackage{algorithm}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amsmath}
\newcommand{\B}[1]{\mathbf{#1}}
\title{\LARGE \bf
Dynamic actions in stochastic environments \\
\large \& application with neural network approximations-based reinforcement learning
}
\author{Thomas Vicente$^{1}$% <-this % stops a space
\thanks{*Candidate for the Master in Data Science at the Barcelona Graduate School of Economics.}% <-this % stops a space
}
\begin{document}
\maketitle
\thispagestyle{empty}
\pagestyle{empty}
\begin{abstract}
I propose an algorithm allowing an artificial entity to optimize its sequence of actions in a stochastic environment. It notably includes a signal function that should allow more flexible implementations than current approximation-based reinforcement learning algorithms. In the fifth section, I present an implementation of the algorithm, it is a tic-tac-toe robot player recognizing underlying games from photographs thanks to pattern recognition via a convolutional neural network.
\end{abstract}
\section*{Acknowledgment}
I would like to thank my advisor, Professor Gábor Lugosi, for his help and for making himself available. I would also like to thank Gergely Neu, post-doctoral fellow at the Artificial Intelligence Research Group from the Pompeu Fabra University, for sharing his knowledge.
\section{Introduction}
A complete exercise would be to build an autonomous entity that is progressively able to take long term-oriented decisions based on information coming from real world sensors. This entity needs two abilities: learning from its own actions and recognizing useful patterns in the current state. The first ability typically corresponds to reinforcement learning (RL) algorithms. The second ability requires a regressor or a classifier function that is able to handle the complexity of stochastic state's patterns, provided that they exist.
\subsection{Learning by action}
RL emerges easily from other Machine Learning techniques as the training data is generated thanks to the actions of the learning entity. Thus, the \textit{learner} uses its history of actions to evaluate what is the best move to do next. It needs for this to have faced repeatedly identical or similar situations. Such setting has its roots in the \textit{n-armed bandit problem} where after each choice a numerical reward is received which is chosen from a stationary probability distribution that depends on the selected action. The learner records what actions led to a maximal gain. In most cases the goal is to maximize a long term gain by taking a correct course of actions.
An RL learner is thus particularly dependent on a state that retains all relevant information, a \textit{Markov} state. For example, a checkers position - the current configuration of all the pieces on the board - would serve as a Markov state because it summarizes everything important about the complete sequence of positions that led to it. Much of the information about the sequence is lost, but all that really matters for the future of the game is retained.
A reinforcement learning task that satisfies the Markov property is called a Markov Decision Process (MDP). If the state and action spaces are finite, the MDP framework notably formalizes the probability that a state occurs given a previous state an action, and the probability of reward after the new state occurred.
\subsection{Pattern recognition}
Equipping the learner with an approximation function can enable it to evaluate an infinite set of states but also a finite set of underlying states having an infinite set of representations. In Figure~\ref{fig:surf}, I illustrate the fact that humans are able to recognize the same underlying situation and take the correct course of actions.
\begin{figure}
\begin{center}
\includegraphics[scale=.35]{"surf"}
\caption{Waves are a stochastic process but, with training, we are able to act consistently for a new wave}
\label{fig:surf}
\end{center}
\end{figure}
The \textit{imperfect representations} case is the one that interest us here. We thus would like our learner to overcome the imperfect information it receives and learn how to behave consistently during a whole episode to maximize its gain. This requires the use of regression or classification functions that are can process efficiently patterns in the different states.
In section V., a modified version of the classical tic-tac-toe robot will provide a good example for this principle. It required a regression function that could take particularly well advantage of the structure of image data.
\section{Literature review}
The proposed algorithm possesses similarities with the concepts presented in this section. I also am going to show that, to my knowledge, it is a novel approach in approximation-based RL ; this would notably be due to the algorithm's minimal requirements for the monitoring of the states' values.
A classical RL technique that is close to the proposed algorithm, is the State-Action-Reward-State-Action algorithm (SARSA). It uses approximations to update the value of a whole set of actions called a \textit{policy}. SARSA is well suited when the value of each action can be monitored. It notably updates the parameters of the approximation function after each movement thanks to a change in value from $t-1$ to $t$, known as a temporal difference. The update includes a gradient term, computed from the difference between the state's value approximation and the real state's value. The proposed algorithm is fundamentally close to SARSA, but it not suppose the learning of a whole policy.
In that sense, the recent Deep Q networks (DQN) are closer to what is proposed. This \textit{off-policy} algorithm uses neural network approximations of the states' value to decide on an action. And as the learner makes better moves, the neural network is fed with more useful information. And thanks to its \textit{experience replay} feature, when training the network, random mini-batches from the replay memory are used instead of the most recent transition. Such efficient implementation makes the DQN very adaptive to different tasks, as demonstrated for the learning of multiple arcade games while using the same network architecture. I also implemented such memory device for the application of the proposed algorithm. But I try to have a more flexible approach than this algorithm, which supposes the constant monitoring of the states' values.
This is done thanks to a mechanism that is related to the True Online Temporal Difference Learning from Seijen et al.. The authors introduced a formal way to update the values in a backward manner for each state of the episode, when the latter ends. This is done in a discounted way so states farther from the end of an episode get a value with a smaller magnitude. In the proposed algorithm, however, as mentioned earlier, we want to avoid using a permanent control of the states' values during the learning phase. It then differs from the mentioned paper as we cannot update our parameters via traces. Instead, the improving estimation of the new states' value is done exclusively via the approximation function.
Hopefully the proposed algorithm offers a better flexibility those from the previous literature. Removing the assumption that the state's values are permanently monitored during the training phase is made possible by the implementation of a \textit{signal function} coupled with a \textit{discount function}. The role of the first function is to detect an objective signal thanks to a deterministic function or a well tuned sensor. When triggered, it assigns a gain or loss-related value to the final state of the episode. The second function will allow to assign values to each state of an episode. This mechanism is supposed to allow a practical use of the algorithm when evaluating non-idealized states - thus not only video games screen shots - involving stochastic perturbations that are for instance recorded via a visual sensor.
\section{Algorithm}
\subsection{Description}
We can summarize our setting as an MDP having the form
$$\langle \mathcal{X},\Omega, P(\B{x_c},\B{x_{t+1}}), \B{y},s(\B{x_c}),d(\gamma, \Delta, \delta, s(\B{x_c})) \rangle$$
where $\mathcal{X}$ is the set of finite underlying states, $\Omega$ is the finite set of possible actions, $P(\B{x_c})$ is the probability that the process moves to a new state $\B{x_{t+1}}$ depending on the current state and the decision of a particular action, $s()$ is the signal function and $d()$ is the discount function.
I am going to describe chronologically what happens during an episode of the algorithm in action. Note that all the objects containing $X$ or $x$ in their names are $m$-dimensional objects, where $m$ is the number of features representing the considered states.
The learner observes a new stochastic state $\B{x_t}(p_t)$ - note that this stochastic state can be affected by previous moves from the learner. It then simulates actions within the finite set of possible actions $\Omega(\B{x_t})$. $\Omega(\B{x_t})$ can be set deterministically. For instance, in the context of a game, there is a well determined set of actions that can be tuned in advance. In the case of a mobile robot physically exploring the real 3D world, the set of possible actions has to be restricted somehow, possibly by the time allowed to go in a particular direction.
All the simulated actions are stored in a temporary matrix $\B{X'}$. Each observation in this matrix is evaluated by the a approximation function $f(\B{w},\B{X'})$. This function can be a regressor or a classifier which weights $w$ are randomly initialized before the first episode. It then approximates each simulated state's value. Among the simulated states, the chosen state, $\B{x_c}$ will be, with probability $1-\epsilon$ the one which has the highest approximated value. And with probability $\epsilon$, it will be a randomly chosen state from $\B{X'}$. If both cases, we append the returned state to the training set $\B{X}$.
The learner possesses a function $s(\B{x_c})$. It returns different values when a particular signal, function of the state generated by the learner's last move, is triggered. Its value is positive if a \textit{gain} is sensed, and negative value if a \textit{loss} is sensed. The signal function should at return one possible value. A typical example for $\{gain;loss\}$ would be respectively $\{1;-1\}$. While this signal is off, the learner counts the number of actions since the signal was last triggered. When the signal is triggered, the current episode is over. Each action of the episode receives a value given by the discount function $d(\gamma, \Delta, \delta, s(\B{x_c}))$ where $\gamma$ is a discount rate, $\Delta$ is the number of actions in the episode and $\delta$ is the index of the action in the episode - the last action in the episode thus has $\delta=\Delta$. The discount function should insure that actions that are closer in time from the triggered signal get values that are closer from $s(\B{x_c})$. Each value of the episode are then appended to the training target vector $\B{y}$.
We then get a training sample on which are going to apply our approximation function $f(\B{w}, \B{X})$ which aim is to minimize the loss $l(f(\B{w}, \B{X}),\B{y})$. From this process, we are going to retain set of parameters corresponding to the minimal loss to update $w$. The updated $w$ is going to be used in the next episode to predict the value of simulated actions.
Below is presented a formal version of the proposed algorithm.
\begin{algorithm}
\begin{algorithmic}[1]
\vspace{3mm}
\While{the learning process is on}
\State record $\B{x_t}(p_t)$
\State $\B{X'} :=\B{X'_{\o}}$ empty $m$-columns matrix
\For{$\B{x'}$ in $\Omega(\B{x_t})$}
\State append $\B{x'}$ to $\B{X'}$
\EndFor
\State with probability $1-\epsilon$
\State \hspace{0.5cm} $\B{x_c}:=\arg\max_{\B{x'} \in \B{X'}} \in f(\B{w},\B{X'})$
\State with probability $\epsilon$
\State \hspace{0.5cm} $\B{x_c}$ := random sample $\B{x'} \in \B{X'}$
\State append $\B{x_c}$ to $\B{X}$
\If{$s(\B{x_c})$ != $0$}
\For{$\delta$ in $1:\Delta$}
\State $v$ = $d(\gamma, \Delta, \delta, s(\B{x_c}))$
\State append $v$ to $\B{y}$
\EndFor
\State $\B{w} := \arg\min_{\B{w}}(l(f(\B{w},\B{X}),\B{y}))$
\State $\Delta := 0$
\Else
\State $\Delta$ += $1$
\EndIf
\EndWhile
\end{algorithmic}
\caption{}
\label{alg:algo}
\vspace{3mm}
\begin{small} $m$: number of features, $\B{X}=\B{X_{\o}}$: empty $m$-columns matrix, $\B{y}=\B{y_{\o}}$: empty target vector, $\B{w}=\B{w_0}$: random weights for the approximation function, $\gamma \in [0,1]$ discount rate, $\Delta = 0$: number of actions, $f()$: approximation function, $p_t(\B{x_c})$: stochastic process, function of the last existing chosen state, $\B{X'}$: $m$-columns simulated states matrix, $\B{x_t}(p_t) \in {\rm I\!R^m}$: current state vector, $\Omega_t(\B{x_t})$: set of possible actions, $\epsilon$: randomness rate, $s(\B{x_c}) \in {\rm I\!R}$: signal value potentially triggered by the modified state \end{small}
\end{algorithm}
\subsection{Remarks}
\subsubsection{Approximation function}
One interesting aspect is the role of the regression function. Contrarily to typical applications, the function's task is ultimately not to predict an action's value with maximum accuracy, even though it is often related to our objective, but rather to give the "real value" to an action. Imagine for instance a game where an intermediary state leads to both a gain or a loss. With enough iterations, such states should get a non-tendential value from the function.
It is crucial to make the choice of the approximation function based on the task in question. What matters here is the type of the patterns that we wish the learner to discover and based on these patterns' complexity, which is often proportional to $m$ dimension of the states. Then, if these states are approximately linearly related to the target value, a linear approximation function can be applied. But if the states are strongly affected by stochastic processes, other functions should be considered.
Typically in this paper, we consider real-world representations as states. In other words, our input could be images, sounds or texts. In that case, we might consider using neural networks which are often celebrated as a particularly accurate family of algorithms for such data. Using neural networks, we can see our learner as imitating how humans would act when facing specific tasks. The challenge is then to recognize equivalent real-world states thanks to pattern recognition. Considering the algorithm described above, $\B{w}$ would in that case be the set of a neural network's hidden layers.
\subsubsection{Signal function and discounted values}
The signal function allows to reduce the assumptions on the basic abilities of the learner. While most approximation-based reinforcement learning algorithms suppose the permanent presence of a value-monitoring function, here we only assume the presence of discrete, time-independent signals that can have one or more values.
The use of discounted values enables each state to get a value at the end of each episode. In that process, tuning correctly the $\gamma$ parameter is important. In Figure~\ref{fig:discounts}, we represent a set of actions that to a gain - $s(\B{x_c} = 1$ here, and the discount function that we apply is $\gamma^{\Delta-\delta}s(\B{x_c})$.
\begin{figure}
\begin{center}
\includegraphics[scale=.4]{"discounts"}
\caption{Different discount rates affecting intermediary states}
\label{fig:discounts}
\end{center}
\end{figure}
In the first case $\gamma=0.7$, in the second case $\gamma=0.3$. We see that with a larger discount factor, more emphasis is put on the final result of the set of actions. If relevant, $\gamma$ should be set considering that two intermediary actions can lead to both a gain and to a loss.
With this function, triggering a new signal can take time. But the learner still looks for best discounted value according to the current state. Thus, with less assumptions on the monitoring of the real value of the states, we still benefit from something common to RL algorithms, the discovery and use of intermediary objectives.
\section{Behavior of the algorithm}
Here is developed a simple insight on why and how the proposed algorithm might lead to optimal long term decisions. For this, we summarize the important steps of the algorithm. Firstly, it evaluates the best possible action to take after observing a stochastic state $\B{x_t}(p_t)$. Secondly, it appends the latter or a randomly chosen possible action to the training features matrix $\B{X}$. If the signal function $s(\B{x^*})$ is triggered, it backwardly assigns values to the states of the episode thanks to the discount function $d(\gamma, \Delta, \delta, s(\B{x^*})$. These values are finally appended to the training target variable $\B{y}$. Then the training set is used as an input for the approximation function $f()$ to find $\B{w} = \arg\min_{\B{w}}(l(f(\B{w},\B{X}),\B{y}))$. In the next episode, the algorithm evaluates a new stochastic state and uses the new $\B{w}$ to make take the next decision.
An important assumption has to be satisfied for the algorithm to function correctly: the presence of patterns in the stochastic states. Without patterns, the approximation function is unable to optimize the value of $\B{w}$ with time. That said, it does not make sense to use the algorithm in such setting. By opposition, an application of choice is when we have imperfect representations of the same underlying states, such has real-world representations of a particular task.
We also assume that $s(\B{x^*})$ is able to give an objective information about the current state. This is a strong assumption. However, we can imagine a practical solution for this. This signal function could result from a pre-trained supervised approximation function that is able to evaluate with high precision the features’ values triggering each of the possible signals. Another solution, providing perfect information, is a manual monitoring at the end of each episode.
Finally, we require a correct tuning of the algorithm’s different elements. The approximation function should be complex enough to recognize the possible patterns in $\B{x_t}(p_t)$. $\Omega(\B{x_t})$ should be well designed. The discount function $d(\gamma, \Delta, \delta, s(\B{x^*}))$ should also be well designed.
With these conditions fulfilled, we firstly expect, as the training sample is fed after each episode, a better evaluation of the different states. Secondly, we expect that, asymptotically, underlying equivalent states will get the same value. Formally, we anticipate:
$$\B{w} \xrightarrow[n \to \infty]{} \B{w^*}$$
where $n$ is the number of previously observed states and $w^*$ is the optimal set of weights. In most cases, we expect the convergence process to be non-linear.
Evaluating well the value of equivalent states is only a first step to obtain a useful learner in practice. To maximize the long-term gains, the algorithm needs to have a relatively balanced experience between gains and losses. First, the tuning of $\gamma$ should allow the regression function to evaluate states faster, via the learner's actions. If necessary, we need to insure that
$d(\gamma, \Delta, \delta, s_{gain}) \simeq d(\gamma, \Delta, \delta, s_{loss})$
if a $\delta$th intermediary action possibly leads to a gain in one episode but also to a loss in another episode. Second, tuning $\epsilon$ well will insure that the learner explores potentially good actions fast enough ; an episode number-dependent value for $\epsilon$ might be desirable. If these final assumptions are insured, we expect:
$$\B{x_c} \xrightarrow[n \to \infty]{} \B{x^*}$$
where $x^*$ is the optimal action for a particular time $t$.
\section{Implementation for a tic-tac-toe robot player}
Tic-tac-toe is a two players game where the players are represented by the symbols X and O in a three by three grid. Each player takes turns marking the spaces with their symbol. The player who places three of his symbol in a horizontal, vertical, or diagonal row wins the game. A draw is a possible outcome.
Robots playing this game are a classical application in RL. But this time we want the learner to play according to what is seen on a real-world image. The learner will then evaluate game boards that are sequentially drawn with a pen. The information available to the robot corresponds to images transformed into vectors. Ultimately, the learner has to discover that the images shown in Figure~\ref{fig:gameA} represent the same game.
\begin{figure}
\begin{center}
\includegraphics[scale=.14]{"gameA"}
\includegraphics[scale=.35]{"gameA_"}
\caption{After enough games, the learner is able to play the same way after visualizing either of these images}
\label{fig:gameA}
\end{center}
\end{figure}
The images of the board are thus stochastic representations of a finite set of underlying states. The stochastic processes affecting particularly these representations are the variations in the handwriting and the brightness level of the images.
\subsection{Particularities of the algorithm's implementation}
The general algorithm presented in the previous section had to be adapted for this application, in particular regarding the approximation function, the signal function and the discount function.
First note that the stochastic state $\B{x_t}(p_t)$ is resulting from another player's move. The states are the fruit of \textit{rival} actions, either from a human player or from another robot. So in terms of strategy optimization, the learner has to anticipate a possibly unfavorable environment during the game and act accordingly. In particular, in this implementation, the learner plays after the rival, so it can win only with three or four moves.
The chosen approximation function is a convolutional neural network detailed in Figure~\ref{fig:neuralnet}. This regression function outputs real values. The choice for this regression function was driven by the complexity of the relationship existing between the representations of the states and the value assigned to these states. The convolutional layer greatly enhanced the precision of the approximations with this training set formed by vectorized images. Indeed, this architecture proved to capture well patterns in the infinite number of possible representations of $255168$ possible ways to finish the game.
\begin{figure}
\begin{center}
\includegraphics[scale=.3]{"NNview"}
\caption{Additionally, the number of iterations is 200, the learning rate is 0.0001, the learning rule is Rmsprop which divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight, the threshold under which the validation error change is assumed to be stable is 0.0001, the mini-batch size is of 200 observations, and the number of iterations after which training should return when the validation error remains (near) constant is 10. Note that the output is then transformed so it belongs to the range $[-1;1]$.}
\label{fig:neuralnet}
\end{center}
\end{figure}
The signal function is triggered when a game ends. If the learner wins, it returns $1$, $-0.1$ if there is a draw and $-1$ if the learner loses or if it does an illegal move. Notice that the latter corresponds only to playing on top of the other player as the learner knows where it last played.
In this implementation, the discount function $d(\gamma, \Delta, \delta, s(\B{x_c}))$ has the form
$$\gamma^{\Delta-\delta}s(\B{x_c})$$
More precisely, I picked $\gamma=0.3$ so we get neutral values, close to $0$, for early moves regardless of the game's outcome. The possible discounted values for moves leading to a win thus are: $1$, $0.3$, $0.09$ and $0.081$. For a draw: $-0.1$, $-0.01$, $-0.001$ and $-0.0001$. For a loss: $-1$, $-0.3$, $-0.09$ and $-0.081$. When an illegal move happens, there is no discount computation and this move gets the value $-1$.
Finally, I implemented an adaptive randomness factor. It has the very \textit{ad hoc} form
$$\epsilon = 20000/log_{10}(S)^6,$$
where $S$ is the size of the data in bytes. I chose this formula so it equals $\epsilon = 1\%$ after 10000 games, the number of games I trained the learner with. The other robot is penalized with $\epsilon = 300000/log_{10}(S)^6$ so our learner has time to experience wins in the beginning. This parameter has to be set sensibly for an optimized training to obtain a perfect player.
\subsection{Necessary adjustments}
A few adjustments were to be made. First, as mentioned earlier, the learning phase is done against a Q-learning robot. After each of the players' move, an image representing the current state of the board is generated. This is image is then reduced, as shown in Figure~\ref{fig:transition}, and appended to the training set.
\begin{figure}
\begin{center}
\includegraphics[scale=.4]{"transition"}
\caption{Reduction of the original training images to grayscale 7x7 images}
\label{fig:transition}
\end{center}
\end{figure}
Second, our learner has the disadvantage of receiving imperfect information about the game compared to the other robot, knowing exactly where the moves are made. To compensate for this, our learner takes advantage of the equivalence between rotated states. After each move, it stores the four rotations the game in the training matrix.
Third, as we do not have an objective sensor at our disposal, the signal received by the learner is the same as the one coming from the opponent robot. It is an objective information indicating the final state of game when it happens.
Finally, the learning process, with the convolutional neural network, does not happen after each game but after a certain number of games. This is purely for computational constraints. However, this does not affect \textit{in fine} the performance of the learner.
\subsection{Performance analysis}
This section aims at assessing how this implementation of the proposed algorithm allows the learner to optimize its strategy.
The following plots show the evolution the learner's performance during the learning phase. We also study how the proportions of wins and losses, and draws by deduction, vary according to the levels of $\epsilon$ for each type of robot. Note that the robot updates its parameters every 500 games, and can potentially improve only after this.
The blue line corresponds to the proportion of won games for the last 500 games. The red line corresponds to the proportion of lost games for the last 500 games.
INCLUDE PLOTS
We see that a more exploratory learner improves its winning rate faster but is then relatively limited towards the last phases.
A more exploratory opponent helps the learner to win and thus discover what are the winning strategies. However, making the opponent too exploratory would limit the performance of the learner when playing against a trained human.
Notice that a learner trained against a random opponent is able, after 100 games, to play easy moves, typically the last move to win, against a human player.
The last plot shows the average number of moves before winning for the last 100 games.
PLOT AVG MOVES
The learner can only win in three or four moves. Against a random player, we that the learner has an average number of moves to win closer to three with more games.
\section{Conclusion}
The tic-tac-toe player implementation provided a good demonstration for the proposed algorithm. However, more work has to be done concerning the proof of convergence for the approximation function and, more importantly, to prove the asymptotic gain-maximizing behavior of the learner.
This work has been done considering recent advances in reinforcement learning and artificial neural networks. It also aims at anticipating a direction that applied machine learning could take in the next few years. I notably expect similar methods to have an impact in various industries and for private usages in the near future.
For instance, such learning algorithm, adapted for augmented reality glasses could help people achieving precise tasks. We could even imagine such system to help doctors realize high precision surgery - recently a robot repaired a pig's bowel more accurately than a doctor. This would require a careful tuning of the reinforcement algorithm. And a way to improve their structure and its functions' parameters could be to observe the way natural organisms evolve in their environment. Such inspiration was a always a good guide to build agnostic, highly adaptive machines.
\addtolength{\textheight}{-12cm} % This command serves to balance the column lengths
% on the last page of the document manually. It shortens
% the textheight of the last page by a suitable amount.
% This command does not take effect until the next page
% so it should come on the page before the last. Make
% sure that you do not shorten the textheight too much.
\begin{thebibliography}{99}
\bibitem{c1} Melo et al. (2008), An Analysis of Reinforcement Learning with Function Approximation,
\bibitem{c2} Mnih et al. (2013), Playing Atari with Deep Reinforcement Learning, arXiv:1312.5602 [cs.LG]
\bibitem{c3} Mnih et al. (2015) Human-level control through deep reinforcement learning, Nature 518, 529–533
\bibitem{c4} Munos et al. (2016), Safe and efficient off-policy reinforcement learning, arXiv:1606.02647v1 [cs.LG]
\bibitem{c5} Seijen et al. (2015), True Online Temporal-Difference Learning, arXiv:1512.04087v1 [cs.AI]
\bibitem{c6} Shaul et al. (2015), Universal Value Function Approximators, JMLR: W and CP volume 37
\bibitem{c7} Sutton and Barto, Reinforcement learning: An introduction, Vol. 2. No. 1, MIT press, 2012
\end{thebibliography}
\end{document}