Example: Basic Neural Network

This page describes design iterations from basic C code to working FPGA implementation of a simple single layer neural net inference computation (MNIST). This is meant to be an easy to understand demo, not a state of the art/optimized design.

C Code
PipelineC Code
- Basic FSMs
- Memory, Arrays, RAMs, ROMs, etc
Initial Simulation and Synthesis Results
Lower Latency
Higher Throughput
- Auto-pipelining
Hardware Demo

C Code

Beginning with Andrew Carter's MNIST Neural Network in Plain C the inference/hypothesis loop looks like so:

for (i = 0; i < MNIST_LABELS; i++) {
    activations[i] = network->b[i];
    for (j = 0; j < MNIST_IMAGE_SIZE; j++) {
        activations[i] += network->W[i][j] * PIXEL_SCALE(image->pixels[j]);
    }
}

For each label (ten digits 0-9) an activation is initialized to a trained bias value. Then for each pixel, the pixel value multiplied by a trained weight value is accumulated. Not shown is a softmax that occurs after this loop to select the maximally activated neuron.

PipelineC Code

Initial C copy

A slightly expanded version of the original C code can be compiled by PipelineC:

// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1) 
{
    float b = bias[i]; // Array lookup
    activation[i] = b; // Array write
    for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
    {
      pixel_t p = pixel[j]; // Array lookup
      float scaled_pixel = (float)p * (1.0/255.0); // FP mul
      float w = weight[i][j]; // Array lookup
      float act_inc = w * scaled_pixel; // FP mul
      activation[i] = activation[i] + act_inc; // Array RMW (FP add)
    }
}

However, as one would very slowly find out running the tool on that code: PipelineC unrolls all loops by default. Essentially the PipelineC tool is preparing combinatorial logic for pipelining. But in this case - the pipeline of the entirely unrolled above functionality is massive: MNIST_LABELS*MNIST_IMAGE_SIZE=7840. That means ex. 7840 FP multipliers for just the w * scaled_pixel inner loop operation. (again not shown after the above loop is a softmax like loop to return the i of the maximum value in the activation array).

Because of the scale of this computation, and limited FPGA resources, it becomes apparent that some sort of iterative approach, dividing the computation into chunks that execute on fewer resources repeatedly over time is appropriate. Thus introducing the finite state machine style of writing PipelineC code.

Basic Finite State Machines

Writing in FSM style PipelineC requires at some point using the __clk() function to control what is happening sequentially over time in one clock cycle to the next. See FSM style docs for how this works, but summarizing:

If a for loop is not unrolled, then there must be some marker of where 'this clock' ends when forming the combinatorial logic of each loop iteration (otherwise could have combinatorial loops). This essentially means somewhere in the loop body(s) there needs to be one or more __clk() function. For example:

for(i = 0; i < MNIST_LABELS; i+=1) 
{
    float b = bias[i]; // Array lookup
    activation[i] = b; // Array write
    for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
    {
      __clk(); // Combinatorial logic dividing marker
      pixel_t p = pixel[j]; // Array lookup
      float scaled_pixel = (float)p * (1.0/255.0); // FP mul
      float w = weight[i][j]; // Array lookup
      float act_inc = w * scaled_pixel; // FP mul
      activation[i] = activation[i] + act_inc; // Array RMW (FP add)
    }
}

meaning that roughly only the bias lookup

float b = bias[i]; // Array lookup
activation[i] = b; // Array write

the inner section of loop

comb0

pixel_t p = pixel[j]; // Array lookup
float scaled_pixel = (float)p * (1.0/255.0); // FP mul
float w = weight[i][j]; // Array lookup
float act_inc = w * scaled_pixel; // FP mul
activation[i] = activation[i] + act_inc; // Array RMW (FP add)

(and the not shown final maximum finding operation after the loop) are combinatorial logic.

However, there is yet one more step regarding memory/arrays/RAMs/etc before the design is really practical for synthesis.

Synthesis Friendly RAMs/ROMs

Summarizing documentation on how arrays work: ex. the weight lookup as written in basic C weight[i][j] describes pipelineable combinatorial logic for a MNIST_LABELS*MNIST_IMAGE_SIZE = 7840 to one multiplexer where i and j are used as the mux selection input. This is a huge chunk of logic for no particular reason, in reality this array 'memory' can be limited to being a more efficient single port RAM/ROM structure.

PipelineC has built in RAM/ROM functions that are structured to map directly to RAMs/ROMs synthesis tools expect. For example, the weight lookup related code looks like:

// Flatten two indices into one dimensional array
static float weight[MNIST_LABELS*MNIST_IMAGE_SIZE] = ...;
...
uint32_t addr = i*MNIST_IMAGE_SIZE + j;
...
// RAM built in function
//   Address, write data, write enable
weight_i_j = weight_RAM_SP_RF_0(addr, 0.0, 0);
// RAM Single Port Read First 0 Cycle (comb. logic)
// Write port is unused - is actually a ROM

doing something similar for bias and pixel lookups make this design practical to synthesize now.

// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1) 
{
    float b = bias_RAM_SP_RF_0(i, 0.0, 0); // ROM lookup
    activation[i] = b; // Array write
    for(j = 0; j < MNIST_IMAGE_SIZE; j+=1)
    {
      __clk(); // Combinatorial logic dividing marker
      pixel_t p = pixel_RAM_SP_RF_0(j, 0, 0); // ROM lookup
      float scaled_pixel = (float)p * (1.0/255.0); // FP mul
      float w = weight_RAM_SP_RF_0(i*MNIST_IMAGE_SIZE + j, 0.0, 0); // ROM lookup
      float act_inc = w * scaled_pixel; // FP mul
      activation[i] = activation[i] + act_inc; // Array RMW (FP add)
    }
}

Do take note that the activation array was not changed into a RAM/ROM structure. All combinatorial logic (ex. built in RAM functions) create a new function instance per invocation/call in code. Using some activation_RAM_SP_RF_0() function multiple times, first for reads, then for later writes does not work as expected / is not yet addressed here (See more advanced single instance shared resources in FSM style docs). Instead, for now, the relatively small activation MNIST_LABELS=10 to one de/multiplexing logic written using basic C arrays is fine to use.

Initial Results

Simulation

Starting up a simulation with --sim --comb --modelsim shows that the inner j loop takes 1 clock cycle each: sim1

Since MNIST_LABELS*MNIST_IMAGE_SIZE=7840 it will take ~7840 cycles to complete the state machine. But how long is each clock cycle?

Synthesis

When synthesized on a Xilinx xc7a35ticsg324-1l Artix 7 FPGA, the critical path has an FMAX of ~22MHz:

Source:          ...[j][0]/C 
Destination:     ...[activation][0][20]/D
Data Path Delay: 41.428ns  (logic 19.422ns (46.882%)  route 22.006ns (53.118%))
Logic Levels:    48  (CARRY4=11 DSP48E1=4 LUT1=2 LUT2=2 LUT3=1 LUT4=3 LUT5=8 LUT6=16 RAMS32=1)

Meaning that the operating frequency is most limited by the amount of logic+routing it takes to determine an activation array value based on some j value. This is very likely the long read-modify-write path activation[i] = activation[i] + act_inc; // Array RMW (FP add) in code (where act_inc depends on j).

Resources used - hello bulky FP32 in FPGA

Report Cell Usage: 
+------+----------+------+
|      |Cell      |Count |
+------+----------+------+
|1     |CARRY4    |   144|
|2     |DSP48E1   |     4|
|3     |LUT1      |    71|
|4     |LUT2      |   123|
|5     |LUT3      |   240|
|6     |LUT4      |   655|
|7     |LUT5      |   690|
|8     |LUT6      |  1291|
|9     |MUXF7     |   120|
|10    |RAM128X1S |    32|
|11    |RAM16X1S  |    40|
|12    |RAM256X1S |   984|
|13    |RAM32X1S  |    32|
|14    |FDRE      |   489|
+------+----------+------+

Rounding to a ~22MHz = 45 ns period clock, this FSM taking ~7840 cycles means the total latency of the state machine, the time to infer one MNIST image->label, is ~45ns*7840=352.8 us (~2800 frames per second).

Lower Latency

Computation can be spread out across space/time|resources/latency. The lowest possible latency implementation would have the entire unrolled computation in combinatorial logic - however this also uses prohibitively many resources for implementation in hardware.

Instead consider how the existing inner loop does a single w * scaled_pixel multiplication per j loop iteration. If instead each j loop does N times more work, i.e. N weight*pixel multiplications, then it would take N times fewer iterations of the j loop to complete the computation.

In order to do N weight*pixel multiplications in parallel you must change the code to operate on N pixels and weights at a time.

For ex. one resource intensive way to read multiple pixels at the same time:

// 2 pixels at a time with two copies of the pixel RAM
pixel_t pixel0 = pixel_RAM_SP_RF_0(j, 0, 0); // ROM lookup
pixel_t pixel1 = pixel_RAM_SP_RF_0(j+1, 0, 0); // ROM lookup
...
// loop j+=2

The above snippet shows invoking the pixel_RAM_SP_RF_0 built in function twice. This does read two pixel values in parallel, but wastefully uses two copies of the same single port RAM.

Instead divide memory into groups of N values at a time. I.e. Use structs that each contain N values, this way a single instance of a single port RAM can be used instead.

typedef struct n_pixels_t
{
  pixel_t data[N_PIXELS_PER_ITER];
}n_pixels_t;
...
static n_pixels_t pixel[(MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER)];
...
for(j = 0; j < (MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER); j+=1)
{
  ...
  // ROM lookup
  n_pixels_t pixels = pixel_RAM_SP_RF_0(j, null_pixels, 0); 
  ...
}

In the above code there are N_PIXELS_PER_ITER times fewer j loop iterations because each iteration reads N_PIXELS_PER_ITER pixel values from a single RAM instance at the same time.

A similar structure is used for the weights RAM. After having N pixels and weights read from memory a loop is used to do the N parallel weight * pixel multiplications:

// Compute N activation increments, W*Pixel in parallel
float act_increments[N_PIXELS_PER_ITER];
uint32_t n_iter;
for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1)
{
  float scaled_pixel = (float)pixels.data[n_iter] * (1.0/255.0); // FP mul
  float act_inc = weights.data[n_iter] * scaled_pixel; // FP mul
  act_increments[n_iter] = act_inc;
}

Since there are now N activation increment values, those must be summed into the activation[i] array index similar to the original single increment activation[i] = activation[i] + act_inc.

// Sum the increments (built in binary tree function)
float act_inc_total = float_array_sum(act_increments);
// And do the final Array RMW (FP add)
activation[i] = activation[i] + act_inc_total;

The entire N-times wider code looks like:

// Loop computing activation for each label
for(i = 0; i < MNIST_LABELS; i+=1) 
{
    float b = bias_RAM_SP_RF_0(i, 0.0, 0); // ROM lookup
    activation[i] = b; // Array write
    for(j = 0; j < (MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER); j+=1)
    {
      __clk(); // Combinatorial logic dividing marker
      n_pixels_t pixels = pixel_RAM_SP_RF_0(j, null_pixels, 0); // ROM lookup
      // ROM lookup
      n_floats_t weights = weight_RAM_SP_RF_0(i*(MNIST_IMAGE_SIZE/N_PIXELS_PER_ITER) + j, null_floats, 0); 
      // Compute N activation increments, W*Pixel in parallel
      float act_increments[N_PIXELS_PER_ITER];
      uint32_t n_iter;
      for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1)
      {
        float scaled_pixel = (float)pixels.data[n_iter] * (1.0/255.0); // FP mul
        float act_inc = weights.data[n_iter] * scaled_pixel; // FP mul
        act_increments[n_iter] = act_inc;
      }
      // Sum the increments (built in binary tree function)
      float act_inc_total = float_array_sum(act_increments);
      // And do the final Array RMW (FP add)
      activation[i] = activation[i] + act_inc_total;
    }
}

comb1

Synthesis Results

Since there are now N weight*pixel multiplications existing as parallel combinatorial logic (used in each j loop iteration) it is expected that resources will scale linearly with N. However there are also more addition operators used to sum the N activation increments: a binary tree via built in float_array_sum function; expected to also grow in resources like log2(N).

Its not just the amount of additional resources that is important but also the structure/dataflow of the resources: Each of the N weight*pixel multiplications occurs in parallel, meaning the total delay/depth of the weight*pixel combinatorial logic for the entire for(n_iter=0; n_iter<N_PIXELS_PER_ITER; n_iter+=1) loop does not vary (~in theory) with increasing N - it is always as deep/long as one multiply operation. That would make the depth/critical path/FMAX of the circuit roughly the same as N increases. However the extra addition operators used in float_array_sum do not all occur in parallel, they are arranged in a binary tree, and such a tree has a depth that grows like log2(N). This means we expect the critical path to grow / FMAX to decrease in a way that in more prominent for smaller N and less for larger N.

Below is a table of trying various N N_PIXELS_PER_ITER values: nwide

In this way it is possible to lower latency by scaling up resource usage up to as much combinatorial logic as chip resources allow. More complex designs can use advanced FSM style concepts like shared resources and multiple FSM 'threads' to utilize additional parallelism across limited shared resources.

Higher Throughput

Autopipelining

Future work can take advantage of PipelineC's built in autopipelining functionality. Pipeline the N times wider logic, use it to do multiple of these j loop iterations at once.

Hardware Demo

This section describes using an Arty dev board and free trial of Xilinx's Tri-Mode Ethernet MAC to do basic "real time" inference on MNIST images using the above described NN design.

This example is from a series of examples designed for the Arty Board. See that page for general instructions on using the Arty board with PipelineC generated files.

Setup

Camera

The Arty dev board does not have a camera interface built in/easily available. For that reason an IP Webcam connected to a host PC is used to streams pixels to the FPGA over a basic data-link layer only 10/100Mbps Ethernet connection (included on dev board).

Ethernet

In previous work FPGA logic was developed that would allow receiving+filtering+sending of basic Ethernet frames. See 'software' Ethernet helper code in eth_sw.c that specifies the network interface to use and how to send and receive frames. The FPGA design uses a free trial of Xilinx's Tri-Mode Ethernet MAC.

In this demo, "video" of MNIST image pixels is sent to the FPGA, and a stream of inference output predictions is sent back to the host PC. Typically ~'streaming video' would not be suitable to for this bandwidth, however, the small 8 bit grayscale MNIST image size of 28x28 pixels makes this possible.

Camera Data: Python-C Data Pipes

Accessing a webcam video stream is most easily done in a high level language like Python. However, the PipelineC basic Ethernet demo supporting software, and autogenerated helper code is all in C. For that reason, a simple Python script called video_to_pixels.py is used to read from the webcam, crop, filter, resize as needed, and then send the 28x28 MNIST image pixels over a pipe (stdout in this case) to an awaiting C program. The C program pixels_to_eth.c reads in the pixel updates from the Python stream and writes them out to the FPGA over Ethernet.

Prediction Data:

Once pixels begin flowing, the FPGA is constantly repeating its inference state machine. The output predictions are sent out over Ethernet to the host PC where a C program eth_to_prediction.c reads frames from the FPGA and prints predictions to the console.

Code Structure

The main file for this work is eth_app.c. It follows the same RX and TX main function structure as earlier Ethernet work (includes some filtering/buffering) like the loopback demo but also includes some neural network specific logic as well:

Input FIFO

Pixel updates are received over Ethernet (in the RX clock domain) and buffered inside of an async FIFO:

// FIFO to hold inputs buffered from the TEMAC AXIS stream
pixels_update_t inputs_fifo[16];
#include "clock_crossing/inputs_fifo.h" // Autogenerated

On the other side of that FIFO (in the user configured 'Neural Net' clock domain) resides the frame buffer RAM to hold pixels while the inference is happening.

Shared Dual Port Pixel Memory

In the original above neural network code pixel memory was not really mentioned beyond being a RAM that pixels were read from during inference. But of course pixels needs to get into this RAM somehow (weights+biases are still hard coded into ROM). For this reason the pixel memory needs a little extra work to support being written with data received over Ethernet while simultaneously being read by the neural net FSM doing its work.

First the pixel memory must be declared as being a dual port RAM (a separate read and write port):

static pixel_t pixel[MNIST_IMAGE_SIZE];
...
uint8_t rdata = pixel_RAM_DP_RF_0(raddr, waddr, wdata, we); // ROM lookup, built in function template

Then in order to easily share this single pixel_RAM_DP_RF_0 instance simultaneously among multiple 'calling sites' in code (i.e. read side vs write side, Ethernet data loading vs. neural net inference sides) the RAM can be globally wired into helper functions (design practices like wrapping a dual port RAM globally-visible like this should be built-in someday...)

// A shared single instance main function for the dual port pixel memory
// With global wires and helper functions for individual ports
// Read port
uint16_t pixel_mem_raddr;
#include "clock_crossing/pixel_mem_raddr.h" // Autogenerated
pixel_t pixel_mem_rdata;
#include "clock_crossing/pixel_mem_rdata.h" // Autogenerated
// Write port
uint16_t pixel_mem_waddr;
#include "clock_crossing/pixel_mem_waddr.h" // Autogenerated
pixel_t pixel_mem_wdata;
#include "clock_crossing/pixel_mem_wdata.h" // Autogenerated
uint1_t pixel_mem_we;
#include "clock_crossing/pixel_mem_we.h" // Autogenerated
MAIN_MHZ(shared_pixel_mem, NN_CLOCK_MHZ)
void shared_pixel_mem()
{
    static pixel_t pixel[MNIST_IMAGE_SIZE];
    // Read port
    uint16_t raddr;
    pixel_t rdata;
    // Write port
    uint16_t waddr;
    pixel_t wdata;
    uint1_t we;
    WIRE_READ(uint16_t, raddr, pixel_mem_raddr)
    WIRE_READ(uint16_t, waddr, pixel_mem_waddr)
    WIRE_READ(pixel_t, wdata, pixel_mem_wdata)
    WIRE_READ(uint1_t, we, pixel_mem_we)
    uint8_t rdata = pixel_RAM_DP_RF_0(raddr, waddr, wdata, we); // ROM lookup, built in function template
    WIRE_WRITE(pixel_t, pixel_mem_rdata, rdata)
}
void pixel_mem_write(uint16_t addr, pixel_t data, uint1_t enable)
{
    WIRE_WRITE(uint16_t, pixel_mem_waddr, addr)
    WIRE_WRITE(pixel_t, pixel_mem_wdata, data)
    WIRE_WRITE(uint1_t, pixel_mem_we, enable)
}
pixel_t pixel_mem_read(uint16_t addr)
{
    WIRE_WRITE(uint16_t, pixel_mem_raddr, addr)
    pixel_t rdata;
    WIRE_READ(pixel_t, rdata, pixel_mem_rdata)
    return rdata;
}

The pixel_mem_write helper function is used by a new pixel_writer() state machine. This state machine waits for pixel updates to arrive in the receive FIFO, reads them out, and then writes them into pixel memory.

// Logic to read from inputs fifo and use the write port to write to pixel memory
void pixel_writer()
{
    // Inf loop
    while(1)
    {
        // Wait to get pixels update from FIFO
        pixels_update_t pixels_update;
        uint1_t got_pixels_update = 0;
        while(!got_pixels_update)
        {
            // Read incoming inputs from rx_main
            inputs_fifo_read_t input_read = inputs_fifo_READ_1(1); 
            pixels_update = input_read.data[0]; 
            got_pixels_update = input_read.valid;
            __clk();   
        }

        // Then write updated pixels
        // Ex. something like
        pixel_mem_write(addr, pixels_update.pixels, 1);
        __clk();    
    }
}

This Ethernet based demo uses a slightly modified version of the original neural network code called neural_network_eth_app.c. It has been modified such that the read from pixel ROM is now replaced with use of the helper function:

pixel_t p = pixel_mem_read(j); // RAM lookup

Output FIFO

Finally, predictions from the neural network need to be queued up for transmission to the host PC. An async FIFO is used to hold 'prediction response' outputs crossing from the user 'Neural Net' clock domain to the Ethernet transmit clock domain.

// Outputs fifo of prediction responses
pred_resp_t outputs_fifo[16];
#include "clock_crossing/outputs_fifo.h" // Autogenerated

The instantiation of the neural net state machine inside neural_network_eth_app.c is wired up to repeatedly run, putting its prediction responses into the output FIFO like so:

// Run repeatedly
inference_fsm_basic_INPUT_t i;
i.input_valid = 1; 
i.output_ready = 1;
inference_fsm_basic_OUTPUT_t o = inference_fsm_basic_FSM(i);
// Write output predictions into fifo
pred_resp_t resp[1];
resp[0].pred = o.return_output;
outputs_fifo_write_t output_write = outputs_fifo_WRITE_1(resp, o.output_valid);
// (overflow ignored since writes are continuously occuring)

Running the Demo

Follow general steps for running the PipelineC tool generating VHDL and getting Arty board bitstreams built.

Build instructions for the C programs can be found at the top of each source file.

Once your IP webcam is online, the FPGA bitstream has been loaded, and C programs compiled: in one console begin the process that shows the video stream and sends pixels to the FPGA:

./video_to_pixels.py | ./pixels_to_eth

In another console start the process that reads predictions back from the FPGA:

./eth_to_pred

Check out the demo video on Twitter! demoimg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly