-
Notifications
You must be signed in to change notification settings - Fork 0
/
geneticNeuralNet.tex
268 lines (196 loc) · 14.7 KB
/
geneticNeuralNet.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
\documentclass[twocolumn]{article}
\usepackage{lipsum,scrextend,mathtools,float,xspace,algorithm,algpseudocode,tabularx,soul}
\usepackage[table]{xcolor}
\usepackage[backend=biber,sorting=none,style=ieee]{biblatex}
\addbibresource{geneticNeuralNet.bib}
\newcommand\code[1]{
\begin{minipage}{\textwidth}
\texttt{\begin{addmargin}[0ex]{0ex}\scriptsize#1\end{addmargin}}
\end{minipage}
}
\newcommand\pcode[5]{
\begin{algorithm}[H] \caption{#1} \label{pcode:#2} \begin{algorithmic}[1]
\Function{#3}{#4}
#5
\EndFunction
\end{algorithmic} \end{algorithm}
}
%figure(file name, label, scale, caption)
\newcommand\fig[5]{
\begin{figure}[H]
\begin{center}\includegraphics[scale=#3]{#1}\end{center}
\caption{#4}\label{fig:#2}
\end{figure}
}
\newcommand\figRef[1]{Figure \ref{fig:#1}\xspace}
\newcommand\todo[1]{{\sethlcolor{green}\hl{\textbf{TODO} #1}}}
\newcommand\fixme[1]{{\sethlcolor{yellow}\hl{\textbf{FIXME} #1}}}
\newcommand\xxx[1]{{\sethlcolor{red}\hl{\textbf{XXX} #1}}}
\title{jGeneticNeuralNet: Training Neural Networks with a Genetic Algorithm in Java}
\author{Version 0.1 \\ Charles Eric Wimberley}
\begin{document}
\maketitle
\begin{abstract}
Genetic neural networks, first characterized in 1989, have generally taken a back seat to training algorithms with a better time complexity. However, the more complex neural networks being used for image, video, and audio processing today use varying structures. Gradient descent algorithms are good at quantitative error reduction, but qualitative properties such as network structure must be determined with alternative algorithms.
Genetic algorithms are capable of evolving both quantitative properties of networks such as edge weights and bias, as well as quantitative properies such as which activation function to use and network structure. jGeneticNeuralNet is a Java implementation of a neural network and associated training algorithms for classification and regression that uses such a genetic algorithm to train networks.
\end{abstract}
\section{Introduction}
Genetic neural networks (GNNs) are learning algorithms that use neural networks to associate an input with an output. Miller et al were some of the first researchers to train neural networks with genetic algorithms~\cite{MillerToddHedge}. Rather than backpropagation, which determines the weights and biases within the network using partial derivatives, GNNs mutate the weights and biases randomly in a population of networks over many generations.
\fig{images/visualization.png}{networkVis}{0.18}{
A visualization of a classification network with 8 input variables (green) and 10 output classes (blue). There are 5 hidden layers of 10 nodes each (black);
}
Genetic algorithms (GAs) are algorithms that use random mutations of solutions to optimimize some function of a system. They mimic genetic evolution of organisms in nature. While a random search of the problem space may seem less than optimal, progress is saved as a set of the best solutions so far, and the process can be easily parallelized~\cite{Tanese:1989:DGA:915973}.
More recent implementations allow for the structure of the network to mutate~\cite{LamStructure}. This allows networks to form more optimal structures for a particular problem space, as well as to optimize networks for size and complexity. For example, a GNN could automatically form a Convolutional Neural Network (CNN), which is optimal for problems such as handwritten character recognition~\cite{ConvolutionalCharacterClassification}, and classification on other complex data.
\section{Network Model}
For extensibility and ease of implementation, neurons are implemented as objects. InputNeurons, HiddenNeurons, and OutputNeurons extend the neuron class. Each neuron has the following fields:
\begin{itemize}
\item A unique ID
\item A bias
\item The activation function to use
\item A map from IDs to input neurons
\item A map from IDs to output neurons
\item A map from IDs to weights (IDs correspond to output map above)
\end{itemize}
Input neurons also include a field to contain the input feature. The activation function is overridden to return this input feature without any calculation. A neuron calculates its output by summing the outputs of all input neurons multiplied by their respective weights, adding the bias, and passing the result to an activation function~\cite{Russell:2003:AIM:773294}.
$$\varphi((\sum_{n=1}^{N}n \times w_n)+bias)$$
A number of different activation functions are implemented. Functions with a range between 0.0 and 1.0 are particularly useful for probability regressions (the predicted probability of a class). Output neurons for the classification problem use these activation functions.
%y=(sin(x*pi-pi/2)+1)/2
$$\varphi_{sin}(x) = (sin(x*\pi-\pi/2)+1)/2$$
\fig{images/sin.png}{sinact}{0.5}{
A plot of the $\varphi_{sin}(x)$ activation function.
}
The arctan activation function can also be used as a bounded output function for classification or probability regressions.
%y=arctan(x)/Pi+0.5
$$\varphi_{arctan}(x) = arctan(x)/\pi+0.5$$
\fig{images/tan.png}{tanact}{0.5}{
A plot of the $\varphi_{arctan}(x)$ activation function.
}
A simple step function is implemented as follows.
\[
\varphi_{step}(x) =
\begin{dcases}
1,& \text{if } x\geq 0\\
0, & \text{otherwise}
\end{dcases}
\]
For regressions, some functions with a larger range are implemented, such as the linear and squred activations below.
$$\varphi_{linear}(x) = x$$
$$\varphi_{squared}(x) = x^2$$
\section{Genetic Algorithm}
Networks are trained with a genetic algorith, which first produces a large number of random networks. Each network is tested for fitness by determining its average error rate on the training data, and the population of networks is mutated to generate networks with improved fitness. By selecting only the networks with the lowest error at each step (the most fit networks), the population of networks slowly moves towards a low error rate.
\subsection{Mutation}
Mutation can affect edge weight, node bias, node activation function, and network structure. Numeric values are changed by a random fraction of the learning rate, which is set as a hyperparameter.
\fig{images/originalRegression.png}{originalStructure}{0.2}{
A simple regression network that is fully connected.
}
\figRef{originalStructure} above shows a fully connected network that could be used for a regression model. After a few generations of random mutation and fitness selection, the network is no longer fully connected (\figRef{mutatedStructure}).
\fig{images/mutatedRegression.png}{mutatedStructure}{0.2}{
A mutated network with a different connection structure.
}
The visualization in \figRef{enhancedVis} below shows a network that was trained on the Iris dataset~\cite{IrisDataset}. It shows the variety of edge weights, biases and activation functions that have been selected by the genetic algorithm.
\fig{images/enhancedVisualization.png}{enhancedVis}{0.2}{
A visualization with lowest edge weights in blue and highest edge weights in red. Biases and activation functions are shown as text on the individual nodes.
}
\subsection{Fitness}
While it is common for genetic algorithms to maximize a fitness variable, the GNN training algorithm minimizes error instead. This means that the ``fitness'' function is really the inverse of what is normally considered fitness.
\pcode{Fitness Algorithm}{fitness}{fitness}{$training$}{
\State $avgError = 0.0$
\For{$datum$ in $training$}
\State $e = abs(datum.out - predict(datum.in))$
\State $avgError = avgError + e$
\EndFor
\State return $averageError / len(training)$
}
\section{Optimizations}
Training genetic neural networks is computationally expensive. Thousands of networks must be evaluated during selection of the final model. Therefore, the network training algorithm employs several important optimizations:
\begin{itemize}
\item Each network is trained in a seperate thread
\item Memoization of neuron output
\item Memoization of average network error
\item Most fit networks produce more offspring
\item Random sub-sampling of the training data during training error estimation
\end{itemize}
\subsection{Multithreaded Training}
Networks are judged for fitness by computing the average error on training samples. This can be computed in an embarassingly parellel fashion by running one thread per network. As shown in the pseudocode below, a job to compute the fitness of a network is executed in a thread pool, which can be configured with a maximum number of threads depending on hardware.
\pcode{Training Algorithm}{train}{train}{$c$}{
\State $pool=$ fitness computing thread pool
\State $pop=$ network queue ordered by fitness
\State $offspring=$ job queue ordered by fitness
\State $survivors=$ network queue ordered by fitness
\For{$generation < c.maxGens$}
\For{$nework$ in $population$}
\State $pool.execute(new Job(network))$
\EndFor
\State $pool.shutdown()$
\While{$pool.isRunning()$}
\State $sleep(10)$
\EndWhile
\State i = 0
\While{$!pool.isEmpty()$}
\State $offspring.add(pool.getJob(i))$
\State i = i + 1
\EndWhile
\While{$len(survivors) < c.population$}
\State $survivors.add(offpring.poll().network)$
\EndWhile
\State $population = survivors$
\EndFor
}
Inside each thread the original network is cloned, mutated, and then evaluated for average error. This minimizes the amount of work that must be performed in the main thread.
\subsection{Activation Function Memoization}
Highly connected networks use the output from the same neuron for multiple neurons in the next layer. Rather than recalculating the output of these neurons, the activation function output is memoized. Each neuron only needs to store two variables in order to accomplish this: the previous activation output and a ``memoized'' flag. Note that this optimization is important for both training and production use of the network.
\fig{images/memoization.png}{memoization}{0.4}{
Two neurons use the output of neuron b. Instead of recomputing b each time, the output of b can be memoized. Note that $x_i$ refers to the bias for that neuron.
}
The structure from \figRef{memoization} occurs a large number of times in layered neural networks. While this optimization cannot compete with matrix multiplication with specialized hardware, it does mitigate some of the performance issues associated with object oriented neural networks. The instructions to compute the output of this network with and without memoization are shown below. \\
\noindent Without memoization: \\
\indent Step 1: Compute $\varphi(a+x1)$ \\
\indent Step 2: Compute $\varphi(b+x2)$ \\
\indent Step 3: Compute $\varphi(a+x1)$ \\
\indent Step 4: Compute $\varphi(b+x3)$ \\
\noindent With memoization: \\
\indent Step 1: Compute $\varphi(a+x1)$ \\
\indent Step 2: Compute $\varphi(b+x2)$ \\
\indent Step 3: Compute $\varphi(b+x3)$
For a fully connected neural network with $N$ layers and $M$ neurons per layer, memoization reduces predicion complexity from $O(M \times M \times N)$ to $O(M \times N)$. In other words, instead of computing the neuronal output for each conection for each layer, neuronal output is computed once per neuron. Not all networks are fully connected, but this is the worst case scenario for a non-memoized implementation.
\subsection{Fitness-Based Offspring Rate}
More fit networks produce more offspring than less fit networks. This is based on the idea that a mutant of a more fit network is more likely to produce a better network than a less fit network. Not only should this decrease the number of generations required to reach a sufficiently trained model, but it should reduce the number of total networks tested for fitness as well. This results in a shorter training time.
\section{Methods}
\subsection{Classification}
The Iris dataset~\cite{IrisDataset} was used to train classifiers. Future implementaions may use 10-fold cross validation, however the current algorithm simply leaves out 1/5th of the input data for testing. Models were judged based on their 3-class accuracy. Two-hundred networks with 3 hidden layers of 8 neurons each were trained for 15, 30, 60, 120, 240 and 480 generations respectively. The confusion matrix and accuracy for one such model is shown below.
\begin{center}
\begin{tabular}{ l c c c }
Expected & versicolor & virginica & setosa \\
versicolor & 8 & 1 & 0 \\
virginica & 0 & 10 & 0 \\
setosa & 0 & 0 & 12
\end{tabular}
\end{center}
Accuracy: 0.967741935483871
\subsection{Regression}
The $x^2$ function was used to generate training data for a regression network. Each network had 2 hidden layers and 6 neurons per layer. Like the classification benchmark, 1/5th of the input data was set aside for testing, and two-hundred networks were trained for 15, 30, 60, 120, 240 and 480 generations. Some expected outputs and predicted outputs are compared in the table below, along with the accompanying mean squared error.
\begin{center}
\begin{tabular}{ l c }
Expected & Predicted \\
0.0 & 1.20346 \\
1.0 & 2.06567 \\
4.0 & 5.16131 \\
9.0 & 10.0320 \\
16.0 & 16.9270
\end{tabular}
\end{center}
Mean squared error: 0.3971038018343635
\section{Results}
Classification accuracy for the 3 class Iris problem is shown in \figRef{classificationBenchmark} below. Accuracy increases as exepected with generation number until generation 240. However there was no noticable difference between 240 and 480 generations in terms of average or variance. Some training attempts appear to get stuck in local minima. The average accuracy seems to stabalize at around 0.9.
\fig{images/AccuracyByGeneration.png}{classificationBenchmark}{0.45}{
Classification accuracy after 15, 30, 60, 120, 240, and 480 generations.
}
Mean squared error for a regression of the $x^2$ function (\figRef{regressionBenchmark}) resulted in many more outliers than classification. However, the distribution of model error improves with increased generations.
\fig{images/ErrorByGeneration.png}{regressionBenchmark}{0.45}{
Regression mean squared error after 15, 30, 60, 120, 240, and 480 generations.
}
\section{Conclusion}
While a genetic algorithm is unlikely to beat backpropagation in terms of time complexity, it has the advantage of optimizing network structure and other qualitative network properties. The algorith was able to train high quality models on known good data sets. Future improvements could include cross-fold validation, computation of AUC ROC, and area under the precision/recall curve to give a more in depth evaluation of models.
Occasionally, the training algorithm appears to get stuck in a local minimum. It is unclear if there is a solution to this problem besides starting the training process over. Some enhancements need to be made to the library before production release, including a minimum training cuttoff option, cross-fold validation, and categorical input variables.
\printbibliography
\end{document}