This is due at:
22:00, 14th Nov 2016
This exercise looks at less regular types of computation, and looks at situations where you might have to re-organise the code in order to make it faster.
This is another new coursework for this year, and is designed to act as a bit of a bridge between the highly structured CW1-CW3, and the more open-ended CW4 and CW5. It replaces the amazing makefile/pipeline coursework, which was great (I think), except that it was too time-consuming (plus lots of students had difficulty implementing FIRs). I've run through it a couple of times, and adjusted it down for time (made the GPU part at the end optional). Let me know where there are problems in the spec.
The overall application domain is a very simple multi-layer perceptron network. We only look at the feed-forward part, i.e. classification part. At each layer a vector of data is presented, the layer transforms it into a new vector of data. The output of one layer goes into the input of the next layer, and so on.
The purposes of this course is to accelerate and optimise things even if you don't fully understand them, so don't worry if you've never looked at neural networks. It is better to look at the code. So first build all the tools:
make tools
This should leave you with a number of programs in the
bin
directory:
- `bin/run_network` : This takes a list of layers (encoded
as binary files) then runs data through them.
- `bin/generate_sparse_layer` : This creates an instance of
a layer with a given input and output size, and writes
it to stdout as a binary file.
- `bin/print_layer_as_text` : Takes a binary layer description
and prints it as human readable text.
- `bin/print_network_as_dot` : Takes a list of layers (binary
descriptions) and prints them in the [GraphViz/dot](http://www.graphviz.org/) format.
In order to get a feel for this, create a layer with 8 inputs and 12 outputs:
mkdir -p working
bin/generate_sparse_layer 8 12 > w/layer_8_12.bin
This should have generated a file called w/layer_8_12.bin
,
which you can view as text:
bin/print_layer_as_text < w/layer_8_12.bin
You should see some information about the nework, and in particular the list of synapses, or weights. There will be many lines looking like this:
98 : 10 <- 6 (0.660889)
this gives the connectivity of the graph. In this case
it is saying that synapse 98
takes input 6
and transfers
it to output 10
with weight 0.660889
. This particular
layer is sparse, which means that even though there are
8*12=96 possible paths, only 9 of them exist, so the layer
has a sparsity factor of 0.09375. Sparsity is often useful
for reducing compute load, but the resulting irregularity
makes it more difficult to optimise.
Optional: We can also view the network as a graph if GraphViz is installed. First generate a dot file from the graph:
bin/print_network_as_dot w/layer_8_12.bin > w/layer_8_12.dot
then render it as a png:
dot w/layer_8_12.dot -Tpng > w/layer_8_12.png
This results in the following image on my machine:
In this instance it is so sparse that some neurons aren't connected, but others have multiple connections.
A layer is applied to data by turning it into an instance
of the Layer
class, found in include/layer.hpp
.
Each instance of Layer
contains the set of connections for
one layer, and is able to execute that layer through Layer::execute
.
The reference version of Layer::execute
is given by the
class SimpleLayer
, which is in
src/layers/simple_layer.hpp
.
Looking in SimpleLayer::execute
, the key code is:
void execute(
const int8_t *pIn, // Values of input neurons in -127..+127
int8_t *pOut // Values of output neurons in -127..+127
) const
{
// Create a working vector of nOut values
std::vector<int32_t> acc(m_nOut, 0);
// Loop over all synapses (edges)
for(unsigned i=0; i<m_synapses.size(); i++){
int32_t contrib = m_synapses[i].weight * pIn[ m_synapses[i].src];
acc[ m_synapses[i].dst ] += contrib >> (23-16);
}
// Transform all outputs
for(unsigned i=0; i<m_nOut; i++){
pOut[i] = sigmoid( acc[i] ); // compress with sigmoid function
}
}
There are two loops in here:
-
Iterate over the synapses to transfer contributions from the input layer to the output layer.
-
Compress the accumulated input values to a value between 0 and 1 using a sigmoid function, in this case
sigmoid(x) = 1/(1+exp(-x))
.
This example uses fixed-point rather than floating-point, as it makes it easier to check whether we are getting the right results. If we used floating-point then the order of execution matters, but with fixed-point it should always be the same.
To run the application, generate 1024 bytes of random input data (updated as suggested by @ awai54st) :
cat /dev/urandom | head -c 1024 > w/random1024.bin
You can then run the filter as a single layer network:
cat w/random1024.bin | bin/run_network w/layer_8_12.bin > w/ref1024.bin
You'll have ended up with a 1536-byte output file, because the output
layer is larger than the input size. Internally bin/run_network
will cut
the data up into batches that match the size of the data.
The idea is to have multiple layers, not just one layer, and this presents a performance bottleneck. Generate 8 layers, each of size 16x16:
for i in $(seq -w 0 16); do
mkdir -p w
bin/generate_sparse_layer 16 16 > w/n16_${i}.bin;
done
You should have ended up with 16 files, which can then be used as a network:
cat w/random1024.bin | bin/run_network \
w/n16_00.bin w/n16_01.bin w/n16_02.bin w/n16_03.bin \
w/n16_04.bin w/n16_05.bin w/n16_06.bin w/n16_07.bin \
w/n16_08.bin w/n16_09.bin w/n16_10.bin w/n16_11.bin \
w/n16_12.bin w/n16_13.bin w/n16_14.bin w/n16_15.bin \
> w/ref_n16_1024.bin
This will have run the same data through a 16 layer network, rather than just 1.
Optional: We can generate a dot graph of this network:
bin/print_network_as_dot \
w/n16_00.bin w/n16_01.bin w/n16_02.bin w/n16_03.bin \
w/n16_04.bin w/n16_05.bin w/n16_06.bin w/n16_07.bin \
w/n16_08.bin w/n16_09.bin w/n16_10.bin w/n16_11.bin \
w/n16_12.bin w/n16_13.bin w/n16_14.bin w/n16_15.bin \
> w/n16.dot
then render it to a png:
dot w/n16.dot -Tpng > w/n16.png
I get this mess:
The overall computational goal is to make the networks deeper and wider, so in general we want more layers, and each layer should be bigger.
Task: Create a script called scripts/create_n512.sh
with
the following contents:
#!/bin/bash
for i in $(seq -w 0 32); do
mkdir -p w
bin/generate_sparse_layer 512 512 > w/n512_${i}.bin;
done
If you run this script it should generate 32 layers, each of which are 512 wide.
Task: Create a script called scripts/run_n512_P1.sh
with the following
contents:
#!/bin/bash
bin/run_network \
w/n512_00.bin w/n512_01.bin w/n512_02.bin w/n512_03.bin \
w/n512_04.bin w/n512_05.bin w/n512_06.bin w/n512_07.bin \
w/n512_08.bin w/n512_09.bin w/n512_10.bin w/n512_11.bin \
w/n512_12.bin w/n512_13.bin w/n512_14.bin w/n512_15.bin \
w/n512_16.bin w/n512_17.bin w/n512_18.bin w/n512_19.bin \
w/n512_20.bin w/n512_21.bin w/n512_22.bin w/n512_23.bin \
w/n512_24.bin w/n512_25.bin w/n512_26.bin w/n512_27.bin \
w/n512_28.bin w/n512_29.bin w/n512_30.bin w/n512_31.bin
Note that this script provides no input and no output for bin/run_network
,
so it will come from the input and output of the script. To run it, generate
a bigger chunk of input data:
cat /dev/urandom | head -c 1048576 > w/random1M.bin
and then pipe it through:
cat w/random1M.bin | scripts/run_n512_P1.sh > /dev/null
Depending on your system, this might take a while (up to minutes). To get a sense of what is going on, it can be useful to use the tool pv, which displays progress of data through the stream:
cat w/random1M.bin | scripts/run_n512_P1.sh | pv > /dev/null
This shows you the bandwidth of the data flowing out of the
pipeline, as well as total data that has passed through. We
could alternatively have placed pv
at the beginning:
cat w/random1M.bin | pv | scripts/run_n512_P1.sh > /dev/null
which for the P1 version will make little difference.
Task: Create a script called scripts/run_n512_P2.sh
with the following
contents:
#!/bin/bash
bin/run_network \
w/n512_00.bin w/n512_01.bin w/n512_02.bin w/n512_03.bin \
w/n512_04.bin w/n512_05.bin w/n512_06.bin w/n512_07.bin \
w/n512_08.bin w/n512_09.bin w/n512_10.bin w/n512_11.bin \
w/n512_12.bin w/n512_13.bin w/n512_14.bin w/n512_15.bin \
| bin/run_network \
w/n512_16.bin w/n512_17.bin w/n512_18.bin w/n512_19.bin \
w/n512_20.bin w/n512_21.bin w/n512_22.bin w/n512_23.bin \
w/n512_24.bin w/n512_25.bin w/n512_26.bin w/n512_27.bin \
w/n512_28.bin w/n512_29.bin w/n512_30.bin w/n512_31.bin
Be careful about the line endings - each of the \
characters at the
end of the line means that the command continues, and the pipe |
character
is joining them together.
Run your new pipeline, and see if the performance is any different:
cat w/random1M.bin | scripts/run_n512_P2.sh | pv > /dev/null
You can also look at the result with time
to see what is going on,
and also may want to look at your task manager (e.g. top
or htop
).
Task: Create a script called scripts/run_n512_P4.sh
, which splits
the network into four stages.
Task: Create a script called scripts/run_n512_P8.sh
, which splits
the network into eight stages.
Task: Create a script called scripts/run_n512_P16.sh
, which splits
the network into sixteen stages.
Task: Create a script called scripts/run_n512_P32.sh
, which splits
the network into 32 stages.
Hopefully you'll see different results, and depending on your number of cores, you should see pretty decent speed-ups. Note that it depends on the amount of data flowing through - if there is not enough data, then eventually the speed-up will reduce.
Task: Generate a plot of P (i.e. which script was used) versus
sustained bandwidth (i.e. bytes/sec) for these scripts on a c4.8xlarge
AWS instance. Note that you may want to leave this till all of
your development work is done, so that you don't keep starting
and stopping instances. Save the plot as results/pipeline_p_vs_bandwidth.pdf
.
The takeaway lesson from this should be that pipelines can be quite effective at doing load-balancing and parallelising sequential components. It is often easier to split a task up into parallel stages, rather than trying to parallelise an individual stage. It's also possible to put things like decompressors and compressors on the front and end of the pipeline, reducing the size of the data on disk.
While pipeline parallelism in the shell can work well, there is a lot of overhead due to moving data between processes, so it is still worthwhile exploring how to make an irregular application faster using our current methods. At the moment we've only looked at simple dense iteration spaces, so some extra thought will be needed for this application.
Look at the existing implementation of SimpleLayer::execute
,
and consider the two loops:
-
Is there any data-sharing between iterations?
-
Do any iterations write to the same locations?
-
Can they be parallelised?
(This is another "in your head" thing, you don't need to write it down).
Task: create a new layer engine called par_for_naive
which
parallelises both loops without regard to correctness:
-
copy the implementation
src/layers/simple_layer.cpp
into a new file calledsrc/layers/par_for_naive_layer.cpp
. -
Change the class name from
SimpleLayer
toParForNaiveLayer
-
Adjust the factory function at the bottom of the file (i.e.
CreateSimpleLayer
->CreateParForNaiveLayer
. -
Go into
src/layers/layer_factory.cpp
, and make sure the new engine is created if the engine type is "par_for_naive". -
Parallelise the two loops.
Once you have got this setup, you should be able to do:
cat w/random1024.bin | \
bin/run_network w/n512_00.bin:par_for_naive \
> w/random1024_par_for_naive.out
in order to select the new engine (note the :par_for_naive
suffix
on the layer), or:
cat w/random1024.bin | \
bin/run_network w/n512_00.bin:simple \
> w/random1024_simple.out
to select the original (default) engine.
The diff
command can be used to tell if two files are
different. Use it to convince yourself that w/random1024_par_for_naive.out
and w/random1024_simple.out
are different (or at least
that sometimes they are different).
There is a clear problem in the first for loop, where multiple
tasks are writing to the acc
array at the same time. One
solution is to make the accesses atomic
.
Fortunately we are dealing with integer data, rather than floating-point,
so you can just replace:
std::vector<int32_t> acc(m_nOut, 0);
with:
std::vector<tbb::atomic<int32_t> > acc(m_nOut, 0);
Task: create a new implementation called ParForAtomic
with engine name par_for_atomic
, and make the acc
array
atomic.
We'll benchmark them all at the end, but it is worth looking at how different the speed is, even informally. On my four-core machine I found the atomic version was about half the speed of the naive version.
The cost of the atomics can be attributed to two elements:
-
An atomic add has a fixed penalty over a non-atomic add, even if there are no conflicting writes
-
There is a variable penalty which rises as parallel tasks tend to "fight" over access to the cache line holding particular values. This tends to get worse with more cores.
The best way of parallelising irregular application is to transform them into a more regular form of safe parallelism. The algorithm as written combines two kinds of parallelism:
-
edge oriented parallelism, which gives an iteration space over the edge array. This is problematic because each output neuron maps onto multiple edges, so the iterations are not independent.
-
output oriented parallelism, where we loop over the
acc
array and map each element onto a single element ofpOut
.
If we could restructure the algorithm so that both loops mapped onto the output iteration space, then we could dispense with the locks.
Task: sketch the data-dependencies found in
the execute
function, including reads, writes,
and operations. Save the graph as results/dependency_sketch.pdf
.
You can use a very small layer with only a handful of nodes. I don't care how you do this, or what form it is in. Hand-drawn and photographed is fine as long as it is readable, and the file isn't too big (e.g. not more than a few hundred KB), and it is a pdf. I mainly want some evidence that you have thought about it, and it may give us something to talk about in the oral.
Task: draw a sketch of the "cone" of dependencies
associated with one point in the output iteration
space, including only those parts needed to calculate
that output. So if you considere one output neuron
as the "tip" of the cone, then the rest of the cone
sweeps backwards to the relevant input neurons.
Save the graph as results/output_dependency_cone.pdf
.
For both these sketches you may be confused as to what exactly I'm asking for, but I'm not asking for anything specific. I certainly have an idea about what I would draw, but there are many other valid possibilities. All I want is something that captures something useful about the problem and how to solve it.
Hopefully the process of thinking about it makes it clear how to transform the problem such that you can parallelise over the output space.
Task: create a layer engine called ClusteredLayer
which uses a single for
loop over the outputs. Note
that there will be another nested loop within the outer
for loop, but the execute
function will only directly contain
one loop.
Hint: you probably want to pre-calculate information derived from the synapse vector, rather than using the synapse vector directly. What other data-structures could you build in the constructor?
Task: create another engine called ParForClusteredLayer
which
adds the parallel_for loop.
The reason for doing the sequential version then parallelising it is to make sure the logic is right. Try not to skip ahead until you're reasonably sure the sequential one works.
Use an AWS c4.8xlarge instance to evaluate the performance of the different implementations:
-
simple
-
par_for_naive
-
par_for_atomic
-
clustered
-
par_for_clustered
Task: produce a plot exploring execution time (y) against input
ratio (input-layer-size / output-layer-size) for a single layer,
called results/single_layer_ratio_vs_time.pdf
.
By ratio I do mean nIn/nOut
- the reason is that it changes the
amount of work per output item, and also the amount of atomic
contention.
Task: produce a plot which uses a single layer with nIn==nOut,
and explores scaling of n (x-axis) versus type (y-axise), called
results/single_layer_n_vs_time.pdf
.
Your final github push before 22:00 on 14th of Nov is considered to be your submitted version. I'll be pulling them from there directly.
If you wish to submit something to blackboard, then feel free to commit the text hash of the relevant commit.
Note: this is an optional suggested extra, as it makes sure that you have a very good handle on the OpenCL APIs. It took it out as a required element, as some people (not students) thought it was a bit complicated or time-consuming for this exercise.
Once you have the clustered version, you have an algorithm which works over the outputs with a single parallel_for loop. This means it has also transformed it into a form which will work in a GPU. However, you will need to think carefully about data-movement:
-
Any data-structures about the graph will need to be built and copied to the GPU in the constructor, otherwise there will be significant overead in each call to
execute
. -
You can't use pointers-to-pointers or classes in the kernel, so anything like vectors of vectors or pointers to vectors will need to be turned into linear sequences of bytes.
Apart from that, the process is straightforward, though it needs a lot of boiler-plate OpenCL code on the software side.
If you do this, don't expect an amazing speed-up. If you look at the way that it accesses memory then it doesn't map to a GPU very well. Another problem is that different work-items take different amounts of time...
The neuron pipeline is designed to allow different engines to be used together, so it is possible to mix and match TBB and OpenCL engines. Depending on the characteristics of each layer, it may be worth moving them to or from hardware...