-
Notifications
You must be signed in to change notification settings - Fork 50
Example: Basic Neural Network
This page describes design iterations from basic C code to working FPGA implementation of a simple single layer neural net inference computation (MNIST). This is meant to be an easy to understand demo, not a state of the art/optimized design.
- C Code
- PipelineC Code
- Initial Simulation and Synthesis Results
- Lower Latency
- Higher Throughput
- Hardware Demo
Beginning with Andrew Carter's MNIST Neural Network in Plain C the inference/hypothesis loop looks like so:
for (i = 0; i < MNIST_LABELS; i++) {
activations[i] = network->b[i];
for (j = 0; j < MNIST_IMAGE_SIZE; j++) {
activations[i] += network->W[i][j] * PIXEL_SCALE(image->pixels[j]);
}
}
For each label (ten digits 0-9) an activation is initialized to a trained bias value. Then for each pixel, the pixel value multiplied by a trained weight value is accumulated. Not shown is a softmax
that occurs after this loop to select the maximally activated neuron.
A slightly expanded version of the original C code can be compiled by PipelineC:
// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1)
{
float b = bias[i]; // Array lookup
activation[i] = b; // Array write
for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
{
pixel_t p = pixel[j]; // Array lookup
float scaled_pixel = (float)p * (1.0/255.0); // FP mul
float w = weight[i][j]; // Array lookup
float act_inc = w * scaled_pixel; // FP mul
activation[i] = activation[i] + act_inc; // Array RMW (FP add)
}
}
However, as one would very slowly find out running the tool on that code: PipelineC unrolls all loops by default
. Essentially the PipelineC tool is preparing combinatorial logic for pipelining. But in this case - the pipeline of the entirely unrolled above functionality is massive: MNIST_LABELS*MNIST_IMAGE_SIZE=7840
. That means ex. 7840 FP multipliers
for just the w * scaled_pixel
inner loop operation. (again not shown after the above loop is a softmax like loop to return the i
of the maximum value in the activation
array).
Because of the scale of this computation, and limited FPGA resources, it becomes apparent that some sort of iterative approach, dividing the computation into chunks that execute on fewer resources repeatedly over time is appropriate. Thus introducing the finite state machine style of writing PipelineC code.
Writing in FSM style PipelineC requires at some point using the __clk()
function to control what is happening sequentially over time in one clock cycle to the next. See FSM style docs for how this works, but summarizing:
If a for loop is not unrolled, then there must be some marker of where 'this clock' ends when forming the combinatorial logic of each loop iteration (otherwise could have combinatorial loops). This essentially means somewhere in the loop body(s) there needs to be one or more __clk()
function. For example:
for(i = 0; i < MNIST_LABELS; i+=1)
{
float b = bias[i]; // Array lookup
activation[i] = b; // Array write
for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
{
__clk(); // Combinatorial logic dividing marker
pixel_t p = pixel[j]; // Array lookup
float scaled_pixel = (float)p * (1.0/255.0); // FP mul
float w = weight[i][j]; // Array lookup
float act_inc = w * scaled_pixel; // FP mul
activation[i] = activation[i] + act_inc; // Array RMW (FP add)
}
}
meaning that roughly only the bias lookup
float b = bias[i]; // Array lookup
activation[i] = b; // Array write
the inner section of loop
pixel_t p = pixel[j]; // Array lookup
float scaled_pixel = (float)p * (1.0/255.0); // FP mul
float w = weight[i][j]; // Array lookup
float act_inc = w * scaled_pixel; // FP mul
activation[i] = activation[i] + act_inc; // Array RMW (FP add)
(and the not shown final maximum finding operation after the loop) are combinatorial logic
.
However, there is yet one more step regarding memory/arrays/RAMs/etc before the design is really practical for synthesis.
Summarizing documentation on how arrays work: ex. the weight lookup as written in basic C weight[i][j]
describes pipelineable combinatorial logic for a MNIST_LABELS*MNIST_IMAGE_SIZE = 7840 to one multiplexer
where i
and j
are used as the mux selection input. This is a huge chunk of logic for no particular reason, in reality this array 'memory' can be limited to being a more efficient single port RAM/ROM structure.
PipelineC has built in RAM/ROM functions that are structured to map directly to RAMs/ROMs synthesis tools expect. For example, the weight lookup related code looks like:
// Flatten two indices into one dimensional array
static float weight[MNIST_LABELS*MNIST_IMAGE_SIZE] = ...;
...
uint32_t addr = i*MNIST_IMAGE_SIZE + j;
...
// RAM built in function
// Address, write data, write enable
weight_i_j = weight_RAM_SP_RF_0(addr, 0.0, 0);
// RAM Single Port Read First 0 Cycle (comb. logic)
// Write port is unused - is actually a ROM
doing something similar for bias and pixel lookups make this design practical to synthesize now.
// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1)
{
float b = bias_RAM_SP_RF_0(i, 0.0, 0); // ROM lookup
activation[i] = b; // Array write
for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
{
__clk(); // Combinatorial logic dividing marker
pixel_t p = pixel_RAM_SP_RF_0(j, 0, 0); // ROM lookup
float scaled_pixel = (float)p * (1.0/255.0); // FP mul
float w = weight_RAM_SP_RF_0(i*MNIST_IMAGE_SIZE + j, 0.0, 0); // ROM lookup
float act_inc = w * scaled_pixel; // FP mul
activation[i] = activation[i] + act_inc; // Array RMW (FP add)
}
}
Do take note that the activation array was not changed into a RAM/ROM structure. All combinatorial logic (ex. built in RAM functions) create a new function instance per invocation/call in code. Using some activation_RAM_SP_RF_0()
function multiple times, first for reads, then for later writes does not work as expected / is not yet addressed here (See more advanced single instance shared resources in FSM style docs). Instead, for now, the relatively small activation
MNIST_LABELS=10 to one de/multiplexing
logic written using basic C arrays is fine to use.
Starting up a simulation with --sim --comb --modelsim
shows that the inner j
loop takes 1 clock cycle each:
Since MNIST_LABELS*MNIST_IMAGE_SIZE=7840
it will take ~7840 cycles
to complete the state machine. But how long is each clock cycle?
When synthesized on a Xilinx xc7a35ticsg324-1l
Artix 7 FPGA, the critical path has an FMAX of ~22MHz
:
Source: ...[j][0]/C
Destination: ...[activation][0][20]/D
Data Path Delay: 41.428ns (logic 19.422ns (46.882%) route 22.006ns (53.118%))
Logic Levels: 48 (CARRY4=11 DSP48E1=4 LUT1=2 LUT2=2 LUT3=1 LUT4=3 LUT5=8 LUT6=16 RAMS32=1)
Meaning that the operating frequency is most limited by the amount of logic+routing it takes to determine an activation
array value based on some j
value. This is very likely the long read-modify-write path activation[i] = activation[i] + act_inc; // Array RMW (FP add)
in code (where act_inc
depends on j
).
Resources used - hello bulky FP32 in FPGA
Report Cell Usage:
+------+----------+------+
| |Cell |Count |
+------+----------+------+
|1 |CARRY4 | 144|
|2 |DSP48E1 | 4|
|3 |LUT1 | 71|
|4 |LUT2 | 123|
|5 |LUT3 | 240|
|6 |LUT4 | 655|
|7 |LUT5 | 690|
|8 |LUT6 | 1291|
|9 |MUXF7 | 120|
|10 |RAM128X1S | 32|
|11 |RAM16X1S | 40|
|12 |RAM256X1S | 984|
|13 |RAM32X1S | 32|
|14 |FDRE | 489|
+------+----------+------+
Rounding to a ~22MHz = 45 ns period
clock, this FSM taking ~7840 cycles
means the total latency
of the state machine, the time to infer one MNIST image->label, is ~45ns*7840=352.8 us
(~2800 frames per second).
Computation can be spread out across space/time|resources/latency. The lowest possible latency implementation would have the entire unrolled computation in combinatorial logic - however this also uses prohibitively many resources for implementation in hardware.
Instead consider how the existing inner loop does a single w * scaled_pixel
multiplication per j
loop iteration. If instead each j loop does N times more work
, i.e. N weight*pixel multiplications, then it would take N times fewer iterations of the j
loop to complete the computation.
In order to do N weight*pixel multiplications in parallel you must change the code to operate on N pixels and weights at a time.
For ex. one resource intensive way to read multiple pixels at the same time:
// 2 pixels at a time with two copies of the pixel RAM
pixel_t pixel0 = pixel_RAM_SP_RF_0(j, 0, 0); // ROM lookup
pixel_t pixel1 = pixel_RAM_SP_RF_0(j+1, 0, 0); // ROM lookup
...
// loop j+=2
The above snippet shows invoking the pixel_RAM_SP_RF_0
built in function twice. This does read two pixel values in parallel, but wastefully uses two copies of the same single port RAM.
Instead divide memory into groups of N values at a time. I.e. Use structs that each contain N values, this way a single instance of a single port RAM can be used instead.
typedef struct n_pixels_t
{
pixel_t data[N_PIXELS_PER_ITER];
}n_pixels_t;
...
static n_pixels_t pixel[(MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER)];
...
for(j = 0; j < (MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER); j+=1)
{
...
// ROM lookup
n_pixels_t pixels = pixel_RAM_SP_RF_0(j, null_pixels, 0);
...
}
In the above code there are N_PIXELS_PER_ITER
times fewer j
loop iterations because each iteration reads N_PIXELS_PER_ITER
pixel values from a single RAM instance at the same time.
A similar structure is used for the weights RAM. After having N pixels and weights read from memory a loop is used to do the N
parallel weight * pixel
multiplications:
// Compute N activation increments, W*Pixel in parallel
float act_increments[N_PIXELS_PER_ITER];
uint32_t n_iter;
for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1)
{
float scaled_pixel = (float)pixels.data[n_iter] * (1.0/255.0); // FP mul
float act_inc = weights.data[n_iter] * scaled_pixel; // FP mul
act_increments[n_iter] = act_inc;
}
Since there are now N activation increment values, those must be summed into the activation[i]
array index similar to the original single increment activation[i] = activation[i] + act_inc
.
// Sum the increments (built in binary tree function)
float act_inc_total = float_array_sum(act_increments);
// And do the final Array RMW (FP add)
activation[i] = activation[i] + act_inc_total;
The entire N-times wider code looks like:
// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1)
{
float b = bias_RAM_SP_RF_0(i, 0.0, 0); // ROM lookup
activation[i] = b; // Array write
for(j = 0; j < (MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER); j+=1)
{
__clk(); // Combinatorial logic dividing marker
n_pixels_t pixels = pixel_RAM_SP_RF_0(j, null_pixels, 0); // ROM lookup
// ROM lookup
n_floats_t weights = weight_RAM_SP_RF_0(i*(MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER) + j, null_floats, 0);
// Compute N activation increments, W*Pixel in parallel
float act_increments[N_PIXELS_PER_ITER];
uint32_t n_iter;
for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1)
{
float scaled_pixel = (float)pixels.data[n_iter] * (1.0/255.0); // FP mul
float act_inc = weights.data[n_iter] * scaled_pixel; // FP mul
act_increments[n_iter] = act_inc;
}
// Sum the increments (built in binary tree function)
float act_inc_total = float_array_sum(act_increments);
// And do the final Array RMW (FP add)
activation[i] = activation[i] + act_inc_total;
}
}
Since there are now N weight*pixel multiplications existing as parallel combinatorial logic (used in each j
loop iteration) it is expected that resources
will scale linearly with N
. However there are also more addition operators used to sum the N activation increments: a binary tree via built in float_array_sum
function; expected to also grow in resources like log2(N)
.
Its not just the amount of additional resources that is important but also the structure/dataflow of the resources: Each of the N weight*pixel multiplications occurs in parallel, meaning the total delay/depth of the weight*pixel combinatorial logic
for the entire for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1)
loop does not vary (~in theory) with increasing N - it is always as deep/long as one multiply operation. That would make the depth/critical path/FMAX of the circuit roughly the same as N increases. However the extra addition operators used in float_array_sum
do not all occur in parallel, they are arranged in a binary tree, and such a tree has a depth that grows like log2(N)
. This means we expect the critical path to grow / FMAX to decrease in a way that in more prominent for smaller N and less for larger N.
Below is a table of trying various N N_PIXELS_PER_ITER
values:
In this way it is possible to lower latency by scaling up resource usage up to as much combinatorial logic as chip resources allow. More complex designs can use advanced FSM style concepts like shared resources and multiple FSM 'threads' to utilize additional parallelism across limited shared resources.
Future work can take advantage of PipelineC's built in autopipelining functionality. Pipeline the N times wider logic, use it to do multiple of these j
loop iterations at once.
This section describes using an Arty dev board and free trial of Xilinx's Tri-Mode Ethernet MAC to do basic "real time" inference on MNIST images using the above described NN design.
This example is from a series of examples designed for the Arty Board. See that page for general instructions on using the Arty board with PipelineC generated files.
The Arty dev board does not have a camera interface built in/easily available. For that reason an IP Webcam connected to a host PC is used to streams pixels to the FPGA over a basic data-link layer only 10/100Mbps Ethernet connection (included on dev board).
In previous work FPGA logic was developed that would allow receiving+filtering+sending of basic Ethernet frames. See 'software' Ethernet helper code in eth_sw.c that specifies the network interface to use and how to send and receive frames. The FPGA design uses a free trial of Xilinx's Tri-Mode Ethernet MAC.
In this demo, "video" of MNIST image pixels is sent to the FPGA, and a stream of inference output predictions is sent back to the host PC. Typically ~'streaming video' would not be suitable to for this bandwidth, however, the small 8 bit grayscale MNIST image size of 28x28 pixels makes this possible.
Accessing a webcam video stream is most easily done in a high level language like Python. However, the PipelineC basic Ethernet demo supporting software, and autogenerated helper code is all in C. For that reason, a simple Python script called video_to_pixels.py is used to read from the webcam, crop, filter, resize as needed, and then send the 28x28 MNIST image pixels over a pipe (stdout in this case) to an awaiting C program. The C program pixels_to_eth.c reads in the pixel updates from the Python stream and writes them out to the FPGA over Ethernet.
Once pixels begin flowing, the FPGA is constantly repeating its inference state machine. The output predictions are sent out over Ethernet to the host PC where a C program eth_to_prediction.c reads frames from the FPGA and prints predictions to the console.
The main file for this work is eth_app.c. It follows the same RX and TX main function structure as earlier Ethernet work (includes some filtering/buffering) like the loopback demo but also includes some neural network specific logic as well:
Pixel updates are received over Ethernet (in the RX clock domain) and buffered inside of an async FIFO:
// FIFO to hold inputs buffered from the TEMAC AXIS stream
pixels_update_t inputs_fifo[16];
#include "clock_crossing/inputs_fifo.h" // Autogenerated
On the other side of that FIFO (in the user configured 'Neural Net' clock domain) resides the frame buffer RAM to hold pixels while the inference is happening.
In the original above neural network code pixel memory was not really mentioned beyond being a RAM that pixels were read from during inference. But of course pixels needs to get into this RAM somehow (weights+biases are still hard coded into ROM). For this reason the pixel memory needs a little extra work to support being written with data received over Ethernet while simultaneously being read by the neural net FSM doing its work.
First the pixel memory must be declared as being a dual port RAM (a separate read and write port):
static pixel_t pixel[MNIST_IMAGE_SIZE];
...
uint8_t rdata = pixel_RAM_DP_RF_0(raddr, waddr, wdata, we); // ROM lookup, built in function template
Then in order to easily share this single pixel_RAM_DP_RF_0
instance simultaneously among multiple 'calling sites' in code (i.e. read side vs write side, Ethernet data loading vs. neural net inference sides) the RAM can be globally wired into helper functions (design practices like wrapping a dual port RAM globally-visible like this should be built-in someday...)
// A shared single instance main function for the dual port pixel memory
// With global wires and helper functions for individual ports
// Read port
uint16_t pixel_mem_raddr;
#include "clock_crossing/pixel_mem_raddr.h" // Autogenerated
pixel_t pixel_mem_rdata;
#include "clock_crossing/pixel_mem_rdata.h" // Autogenerated
// Write port
uint16_t pixel_mem_waddr;
#include "clock_crossing/pixel_mem_waddr.h" // Autogenerated
pixel_t pixel_mem_wdata;
#include "clock_crossing/pixel_mem_wdata.h" // Autogenerated
uint1_t pixel_mem_we;
#include "clock_crossing/pixel_mem_we.h" // Autogenerated
MAIN_MHZ(shared_pixel_mem, NN_CLOCK_MHZ)
void shared_pixel_mem()
{
static pixel_t pixel[MNIST_IMAGE_SIZE];
// Read port
uint16_t raddr;
pixel_t rdata;
// Write port
uint16_t waddr;
pixel_t wdata;
uint1_t we;
WIRE_READ(uint16_t, raddr, pixel_mem_raddr)
WIRE_READ(uint16_t, waddr, pixel_mem_waddr)
WIRE_READ(pixel_t, wdata, pixel_mem_wdata)
WIRE_READ(uint1_t, we, pixel_mem_we)
uint8_t rdata = pixel_RAM_DP_RF_0(raddr, waddr, wdata, we); // ROM lookup, built in function template
WIRE_WRITE(pixel_t, pixel_mem_rdata, rdata)
}
void pixel_mem_write(uint16_t addr, pixel_t data, uint1_t enable)
{
WIRE_WRITE(uint16_t, pixel_mem_waddr, addr)
WIRE_WRITE(pixel_t, pixel_mem_wdata, data)
WIRE_WRITE(uint1_t, pixel_mem_we, enable)
}
pixel_t pixel_mem_read(uint16_t addr)
{
WIRE_WRITE(uint16_t, pixel_mem_raddr, addr)
pixel_t rdata;
WIRE_READ(pixel_t, rdata, pixel_mem_rdata)
return rdata;
}
The pixel_mem_write
helper function is used by a new pixel_writer()
state machine. This state machine waits for pixel updates to arrive in the receive FIFO, reads them out, and then writes them into pixel memory.
// Logic to read from inputs fifo and use the write port to write to pixel memory
void pixel_writer()
{
// Inf loop
while(1)
{
// Wait to get pixels update from FIFO
pixels_update_t pixels_update;
uint1_t got_pixels_update = 0;
while(!got_pixels_update)
{
// Read incoming inputs from rx_main
inputs_fifo_read_t input_read = inputs_fifo_READ_1(1);
pixels_update = input_read.data[0];
got_pixels_update = input_read.valid;
__clk();
}
// Then write updated pixels
// Ex. something like
pixel_mem_write(addr, pixels_update.pixels, 1);
__clk();
}
}
This Ethernet based demo uses a slightly modified version of the original neural network code called neural_network_eth_app.c. It has been modified such that the read from pixel ROM is now replaced with use of the helper function:
pixel_t p = pixel_mem_read(j); // RAM lookup
Finally, predictions from the neural network need to be queued up for transmission to the host PC. An async FIFO is used to hold 'prediction response' outputs crossing from the user 'Neural Net' clock domain to the Ethernet transmit clock domain.
// Outputs fifo of prediction responses
pred_resp_t outputs_fifo[16];
#include "clock_crossing/outputs_fifo.h" // Autogenerated
The instantiation of the neural net state machine inside neural_network_eth_app.c is wired up to repeatedly run, putting its prediction responses into the output FIFO like so:
// Run repeatedly
inference_fsm_basic_INPUT_t i;
i.input_valid = 1;
i.output_ready = 1;
inference_fsm_basic_OUTPUT_t o = inference_fsm_basic_FSM(i);
// Write output predictions into fifo
pred_resp_t resp[1];
resp[0].pred = o.return_output;
outputs_fifo_write_t output_write = outputs_fifo_WRITE_1(resp, o.output_valid);
// (overflow ignored since writes are continuously occuring)
Follow general steps for running the PipelineC tool generating VHDL and getting Arty board bitstreams built.
Build instructions for the C programs can be found at the top of each source file.
Once your IP webcam is online, the FPGA bitstream has been loaded, and C programs compiled: in one console begin the process that shows the video stream and sends pixels to the FPGA:
./video_to_pixels.py | ./pixels_to_eth
In another console start the process that reads predictions back from the FPGA:
./eth_to_pred