forked from the-virtual-brain/tvb-pin-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
c-cuda.tex
139 lines (117 loc) · 5.02 KB
/
c-cuda.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
Several of the core components (integrators, mass models, coupling
functions) have targeted towards a C source code backend, which has
allowed for the compilation of simulations to native code loaded
either as a shared library accessed via the \texttt{ctypes} modules
or as CUDA kernels accessed via the PyCUDA \cite{PyCUDA}.
While such an approach may provide speed ups, they depend on the
presence of a C compiler and, in the case of GPU, the CUDA toolkit and
a compatible graphics card, and in the future, prepackaged versions of TVB
will include precompiled objects for most kinds of simulations.
A common approach in Python numerical libraries to eliminating Python
overhead is to rewrite code in Cython. However, an explicit goal in
the case of TVB was
to employ thread parallelism on the GPU, with C code as a fallback
possibility.
The approach used in compiling a simulation to native code thus takes advantage
of the fact that code that must be generated for CUDA is quite similar to C,
and thus a generic template abstracts much of the boilerplate between the two.
For each part of the simulator, a generic function is customized with a class
specific kernel; for example, in the case of a neural mass model, we have in
the Python class
\begin{lstlisting}[caption={The Generic2dOscillator listing},
label={lst:g2dOscil}]
class Generic2dOscillator(Model):
tau = FloatArray(...)
# etc.
device_info = model_device_info(
pars=[tau, a, b, c, d, I],
kernel="""
float tau = P(0)
, a = P(1) ; // etc
// state variables
, v = X(0)
, w = X(1)
// aux variables
, c_0 = I(0) ;
// derivatives
DX(0) = d *
(tau * (w - v*v*v + 3.0*v*v + I + c_0));
DX(1) = d *
((a + b*v + c*v*v - w) / tau);
"""
)
\end{lstlisting}
\noindent where the device\_info attribute is used to specify how the
class's mathematical description fits into the general model function:
\begin{lstlisting}[caption={The Listing},label={lst:wrapper}]
/* wrapper for model specific code computing RHSs of diff-eqs */
__device__
void model_dfun(
float * _dx, float *_x, float *mmpr, float *input)
{
#define X(i) _x[n_thr*i]
#define DX(i) _dx[n_thr*i]
#define P(i) mmpr[n_thr*i]
#define I(i) input[i]
// begin model code
\$model_dfun
// end model specific code
#undef X
#undef DX
#undef P
#undef I
\end{lstlisting}
\noindent where the C preprocessor defines allow the model specific
kernel to easily reference the correct parts of the multidimensional
per-thread arrays (in the case of the GPU).
\begin{figure}
{\includegraphics[width=0.48\textwidth]{images/gpu_dxdt.png}}
\caption{
Right A typical parameter space exploration, 32 x 32 grid of
coupling strength (y-axis) v. neural excitability (x-axis).
This grid of simulations was run on both TVB's Python/NumPy
implementation and the new GPU backend for 200 ms simulation
time with otherwise default parameters. The former took ~2
hours and the latter ~ 1 min. Left Quantitative comparison of
solutions and instantaneous derivatives is shown for an even
sampling of the parameter space across k where a = -2, because
this slice showed the most error on the GPU.
}
\label{fig:gpu_dxdt}
\end{figure}
\begin{figure}
{\includegraphics[width=0.48\textwidth]{images/gpu_pse.png}}
\caption{}
\label{fig:gpu_pse}
\end{figure}
\note[sk]{One figure or two, re captions}
\note[sk]{Specify hardware (GPU/CPU) and whether Numpy is mkl-linked, to
provide a more solid foundation for timing comparisons...}
\note[sk]{It would be interesting to see a longer run (say a few seconds)
to show if/how-quickly the error grows...}
As can be seen in the listing \note[sk]{can listings be numbered and
labelled...}\note[lp]{Done see example and edit}, the calculations
in native code are performed with 32-bit floating point numbers, and it
is reasonable to ask if this is numerically accurate. In Fig
\ref{fig:gpu_pse}, we present a parameter space exploration performed with
both the pure Python NumPy simulator and the GPU simulator, showing the
isocontours of average standard deviation in the parameter space. Some
deviation can be identified visually in parts of the parameter space, and
in in Fig \ref{fig:gpu_dxdt}, we show in more detail time series of
the Python and GPU solutions.
\begin{figure}
{\includegraphics[width=0.48\textwidth]{images/gpu_acceleration.png}}
\caption{}
\label{fig:gpu_acceleration}
\end{figure}
This approach allows significant acceleration of parameter sweeps in the
case of the GPU by taking
advantage of the fact that in many cases, only numerical values vary
between different threads and not memory access patterns. Where one of the
dimensions of a parameter sweep implies changing memory access patterns,
for example conduction speed, it is advantageous to reorder the parameters,
so that such memory varying parameters only change between grids of GPU
threads and not within.
In Fig \ref{fig:gpu_acceleration}, we plot the speedup brought by the GPU
over the Python NumPy simulator as a function of the number of simulations
performed simulataneously on the GPU.