Skip to content
Julian Kemmerer edited this page Nov 12, 2024 · 215 revisions

This page describes creating a RISC-V CPU in PipelineC:

Followed by accelerating a 'Game of Life' demo, doing more in raw hardware (mostly derived finite state machines) and less as compiled RISC-V CPU instructions:

Finally, lessons learned from the Game of Life demo regarding how to easily connect PipelineC hardware to the RISC-V CPU were summarized:

CPU Design Intent

  • Single cycle: pipelining is not an initial focus
    • PipelineC's feed-forward dataflow pure function autopipelining does not apply to CPU computations with writeback/feedback signals that require specialized manual stall logic for each pipeline stage added.
  • 0 latency LUTRAM memories used
    • BRAMs arent used yet since >=1 clock latency typical
  • Shared instruction and data memory
    • Keeps linker script simple
  • Only partial RV32I instructions implemented
    • As much as was needed for demos
  • Primary focus is on integration of PipelineC hardware modules with software C functions

Top Level

The single cycle of execution is contained within the MAIN function risc_v() in risc-v.c:

// Empty start
uint32_t risc_v()
{
  static uint32_t pc = 0;
  ...
  // Dummy output the current
  // program counter value
  return pc;
}

Memory and Register File RAMs

In verisons preceding what is documented here a few inefficient things were done. Using basic C arrays the data memory and register file were implemented as simple arrays of registers/signals + many muxes. This is highly resource inefficient because it lacks the the proper structure for inferring register files or LUT/block RAMs by the synthesis tool. The root of the problem comes from the register file where the write back data is not known until after inputting the addresses for register reads. See more details on this issue here.

The below code uses helper functions from reg_file.h and mem_decl.h. These modules essentially make one RAM module with multiple ports look like one or more separate modules/functions that can be used in different places (working around any feedback/apparent dependency ordering).

The register file is exposed as three helper functions, two read ports reg_read0 and reg_read1, and one write port reg_write:

// Register file reads, from port0 and port1
uint32_t rd_data1 = reg_read0(decoded.src1);
uint32_t rd_data2 = reg_read1(decoded.src2);
// Second half of reg file
// Reg file write, assemble inputs
reg_file_wr_in_t reg_wr_inputs;
reg_wr_inputs.wr_en = decoded.reg_wr;
reg_wr_inputs.wr_addr = decoded.dest;
// Determine data to write back
reg_wr_inputs.wr_data = ...;
reg_write(reg_wr_inputs);

The shared data+instruction memory is exposed as a read only port for instructions inst_read and a read|write port for general loads and stores mem_read_write:

// Read instruction at PC
uint32_t inst = inst_read(pc>>2);
// Memory stage, assemble inputs
mem_rw_in_t mem_rw_in;
mem_rw_in.addr = exe.result;
mem_rw_in.wr_data = rd_data2;
mem_rw_in.wr_en = decoded.mem_wr;
// Actual connection to memory read+write port
uint32_t data_mem_rd_val = mem_read_write(mem_rw_in);

Logical Stages (single cycle, not pipelined)

The body of that MAIN risc_v() C function maps directly the classic CPU pipeline stages found in computer architecture books (though this design is not pipelined).

Program Counter

// Program counter
static uint32_t pc = 0;
uint32_t pc_plus4 = pc + 4;
printf("PC = 0x%X\n", pc);

Fetch Instruction

// Read instruction at PC
uint32_t inst = inst_read(pc>>2);
printf("Instruction: 0x%X\n", inst);

Decode Instruction

// Decode the instruction to control signals
decoded_t decoded = decode(inst);

Register File Read

// Register file reads, from port0 and port1
uint32_t rd_data1 = reg_read0(decoded.src1);
uint32_t rd_data2 = reg_read1(decoded.src2);
if(decoded.print_rs1_read){
  printf("Read RegFile[%d] = %d\n", decoded.src1, rd_data1);
}
if(decoded.print_rs2_read){
  printf("Read RegFile[%d] = %d\n", decoded.src2, rd_data2);
}

Execute / ALU

// Execute stage
execute_t exe = execute(pc, pc_plus4, decoded, rd_data1, rd_data2);

Memory (Read or Write)

// Memory stage, assemble inputs
mem_rw_in_t mem_rw_in;
mem_rw_in.addr = exe.result; // TEMP addr always from execute module
mem_rw_in.wr_data = rd_data2;
mem_rw_in.wr_en = decoded.mem_wr;
// Actual connection to memory read+write port
uint32_t data_mem_rd_val = mem_read_write(mem_rw_in);
if(decoded.mem_wr){
  printf("Write Mem[0x%X] = %d\n", mem_rw_in.addr, mem_rw_in.wr_data);
}
if(decoded.mem_rd){
  printf("Read Mem[0x%X] = %d\n", mem_rw_in.addr, data_mem_rd_val);
}  

Write Back to Registers

// Second half of reg file
// Reg file write, assemble inputs
reg_file_wr_in_t reg_wr_inputs;
reg_wr_inputs.wr_en = decoded.reg_wr;
reg_wr_inputs.wr_addr = decoded.dest;
// Determine data to write back
if(decoded.mem_to_reg){
  printf("Write RegFile: MemRd->Reg...\n");
  reg_wr_inputs.wr_data = data_mem_rd_val;
}else if(decoded.pc_plus4_to_reg){
  printf("Write RegFile: PC+4->Reg...\n");
  reg_wr_inputs.wr_data = pc_plus4;
}else{
  if(decoded.reg_wr)
    printf("Write RegFile: Execute Result->Reg...\n");
  reg_wr_inputs.wr_data = exe.result;
}
if(decoded.reg_wr){
  printf("Write RegFile[%d] = %d\n", decoded.dest, reg_wr_inputs.wr_data);
}
// Actual connection to reg file write port
reg_write(reg_wr_inputs);

Branch / Jump PC Logic

// Branch/Increment PC
if(decoded.exe_to_pc){
  printf("Next PC: Execute Result = 0x%X...\n", exe.result);
  pc = exe.result;
}else{
  // Default next pc
  printf("Next PC: Default = 0x%X...\n", pc_plus4);
  pc = pc_plus4;
}

How to build C -> RISC-V binary

Much of what is inside the gcc_test directory was so happily shown to me by the fantastic microrv32 project.

In particular the make flow that drives elf2bin.py which converts standard executable and linkable format (ELF) files to the initialization data needed inside the shared instruction and data memory.

Produce a mem_init.h file using the make flow that compiles main.c:

# From inside gcc_test examples dir
export PATH=$PATH:/opt/riscv/bin # your local RISC-V toolchain install
make hex

How to build PipelineC -> RISC-V CPU hardware

The gcc_test/mem_init.h memory initialization header from the software C compile is used when declaring the shared data and instruction memory in mem_decl.h:

// Combined instruction and data memory initialized from gcc compile
#define MEM_SIZE_IN_BYTES 2048
#define MEM_NUM_WORDS (MEM_SIZE_IN_BYTES/4)
#include "gcc_test/mem_init.h"
// Need a RAM with one read port for instructions, one r/w port for data mem
DECL_RAM_DP_RW_R_0(
  uint32_t,
  the_mem,
  MEM_NUM_WORDS,
  MEM_INIT
)

Simulation

# From inside gcc_test examples dir
# Have main.c setup to use fibonacci demo.
make hex # Compile the main.c code into CPU memory initialization mem_init.h
cd ../../../ # Into PipelineC repo
# Build the CPU hardware and used cocotb to launch a ghdl simulation
./src/pipelinec ./examples/risc-v/risc-v.c --comb --sim --cocotb --out_dir ~/pipelinec_output --makefile ./examples/risc-v/cocotb/Makefile
Clock: 348
PC = 0x000001B8
Next PC: Default = 0x000001BC...
Instruction: 0x00E7A023
SW: addr = r15 + 0, mem[addr] <- r14 
Read RegFile[15] = 268435456
Read RegFile[14] = 5
SW: addr = 268435456 + 0 = 268435456, mem[0x10000000] <- 5 
Write Mem[0x10000000] = 5
CPU Halted: returned 5
   348.50ns INFO     cocotb.regression                  run_cpu passed
   348.50ns INFO     cocotb.regression                  **************************************************************************************
                                                        ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                        **************************************************************************************
                                                        ** test_top.run_cpu               PASS         348.50           2.90        120.00  **
                                                        **************************************************************************************
                                                        ** TESTS=1 PASS=1 FAIL=0 SKIP=0                348.50           3.36        103.71  **
                                                        **************************************************************************************

Synthesis Results

The design can be synthesized for specific FPGA targets by specifying a #pragma PART. For example, when synthesized for a Xilinx 7 Series Artix part ex. #pragma PART "xc7a100tcsg324-1" (with instruction+data memory size = 2048 bytes):

./src/pipelinec ./examples/risc-v/risc-v.c --comb
+------+----------+------+
|      |Cell      |Count |
+------+----------+------+
|1     |CARRY4    |    74|
|2     |LUT1      |     2|
|3     |LUT2      |   133|
|4     |LUT3      |    93|
|5     |LUT4      |   282|
|6     |LUT5      |   127|
|7     |LUT6      |   347|
|8     |RAM128X1D |   128|
|9     |RAM32M    |    12|
|10    |FDRE      |   142|
+------+----------+------+
Data Path Delay: 14.028ns  (logic 3.909ns (27.866%)  route 10.119ns (72.134%))
Logic Levels: 19  (CARRY4=3 LUT2=1 LUT4=4 LUT5=2 LUT6=7 MUXF7=1 RAMD64E=1)
Source:      risc_v_...pc_reg...              
Destination: mem_...the_ram_reg...
~FMAX: 68.306 MHz

Basic Memory Mapped Peripherals

In C code it is possible to mark a variable as residing at a specific address in data memory:

// The output/stop/halt peripheral
#define RETURN_OUTPUT_ADDR 0x10000000
static volatile uint32_t* RETURN_OUTPUT = (uint32_t*)RETURN_OUTPUT_ADDR;

The above example defined an address that, is used similar to return x;, doing *RETURN_OUTPUT = x; is setup to stop the simulator and pass the return value on for special handling.

These specially mapped addresses are handled inside the load-store memory unit code mem_map.c. Specifically the mem_map_module used inside the main mem module.

mem_map_module has a module IO that looks like so:

typedef struct mem_map_out_t
{
  uint1_t addr_is_mapped;
  uint32_t rd_data;
}mem_map_out_t;
mem_map_out_t mem_map_module(
  uint32_t addr,
  uint32_t wr_data,
  uint1_t wr_en)

It takes an address and write signals, it outputs if the address was mapped to a peripheral and any read data from that address.

Inside the mem module mem_map_module is used before the regular data mem RAM instantiation:

typedef struct mem_out_t
{
  uint32_t inst_read;
  uint32_t mem_read_write;
}mem_out_t;
typedef struct mem_rw_in_t{
  uint32_t addr;
  uint32_t wr_data;
  uint1_t wr_en;
}mem_rw_in_t;
mem_out_t mem(
  uint32_t rd_addr,
  mem_rw_in_t mem_rw
){
  // Check for write to memory mapped IO addresses
  mem_map_out_t mem_map_out = mem_map_module(
    mem_rw.addr,
    mem_rw.wr_data,
    mem_rw.wr_en);

  // Mem map write does not write actual RAM memory
  mem_rw.wr_en &= !mem_map_out.addr_is_mapped;

  ...etc

The normal write enable to data memory is gated by this ~bypass to any memory mapped IO peripherals (since these addresses don't have to reside in real existing data memory ranges). Similarly, not shown, the output read data can come from the data memory or from the peripheral if mapped.

Blinking LEDs demo

Four LEDs were memory mapped.

// LEDs
#define LEDS_ADDR 0x10000004
static volatile uint32_t* LEDS = (uint32_t*)LEDS_ADDR;

With this branch/entry added to the CPU's mem_map_module:

#include "leds/leds_port.c" // for 'leds'
...
  if(addr==LEDS_ADDR){
    // LEDs
    o.addr_is_mapped = 1;
    o.rd_data = leds;
    if(wr_en){
      leds = wr_data;
    }
  }

Finally in C code you can blink.c LEDs like so:

#include "../mem_map.h"
void main() {
  int count = 0;
  while(1){
     // 4b leds get slow changing upper bits
     *LEDS = count >> 22;
     count += 1;
  }
}

Game of Life Demo

The Game of Life demo is essentially copy-pasted example C code for the familiar core functionality:

// Demo application that plays Conway's Game of Life
// https://www.geeksforgeeks.org/program-for-conways-game-of-life/
// Returns the count of alive neighbours around r,c
int32_t count_live_neighbour_cells(int32_t r, int32_t c);
// Game of Life logic to determine if cell at x,y lives or dies
int32_t cell_next_state(int32_t x, int32_t y);

The this implementation uses a single frame buffer and two line buffers in order to do the Game of Life computation, a snippet from main.c:

    for(y=2; y<FRAME_HEIGHT; y+=1)
    { 
      // Write line is 2 lines delayed
      y_write = y - 2;
      for(x=0; x<FRAME_WIDTH; x+=1)
      {
        cell_alive_next = line_buf_read(y_minus_2_line_sel, x);
        frame_buf_write(x, y_write, cell_alive_next);
      }

      // Use now available y_minus_2_line_sel line buffer
      // to store next state from current reads
      for(x=0; x<FRAME_WIDTH; x+=1)
      {
        cell_alive_next = cell_next_state(x, y);
        line_buf_write(y_minus_2_line_sel, x, cell_alive_next);
      }

      // Swap buffers (flip sel bits)
      y_minus_2_line_sel = y_minus_2_line_sel==0 ? 1 : 0;
      y_minus_1_line_sel = y_minus_1_line_sel==0 ? 1 : 0;
    }

Frame Buffer

The Game of Life demo uses a grid of 640x480 black and white pixels. frame_buffer.c contains the RAM that holds those pixels flattened into a 640*480=307200 element array.

// Helper function to flatten 2D x,y positions into 1D RAM addresses
uint32_t pos_to_addr(uint16_t x, uint16_t y)
{
  return (x*FRAME_HEIGHT) + y;
}

Hardware

Inside of frame_buffer.c the MAIN frame_buf_function() instance is used to create VGA timing signals and connect to the pixel RAM for reading pixels to display. This design is very similar to past examples done with VGA graphics.

  // VGA timing for fixed resolution
  vga_signals_t vga_signals = vga_timing();

  // Convert VGA timing position
  // and application x,y pos to RAM addresses
  uint32_t frame_buffer_addr = pos_to_addr(frame_buffer_x_in, frame_buffer_y_in);
  uint32_t vga_addr = pos_to_addr(vga_signals.pos.x, vga_signals.pos.y);

  // Do RAM lookup
  // First port is for user application, is read+write
  // Second port is read only for the frame buffer vga display
  frame_buf_ram_outputs_t frame_buf_outputs
    = frame_buf_ram(frame_buffer_addr, 
                         frame_buffer_wr_data_in, 
                         frame_buffer_wr_enable_in,
                         vga_addr);

Most importantly is the interface exposed for the CPU's memory map to use this hardware. The user supplies a x,y position, a single write pixel + enable bit - and as an output receives a pixel read from that position.

// Frame buffer application/user global wires for use once anywhere in code
//  Inputs
uint1_t frame_buffer_wr_data_in;
uint16_t frame_buffer_x_in;
uint16_t frame_buffer_y_in;
uint1_t frame_buffer_wr_enable_in; // 0=read
//  Outputs
uint1_t frame_buffer_rd_data_out;

These global wires can be used from anywhere else in the design.

Memory Map

Inside of the CPU's mem_map.c mem_map_module() module the frame buffer entry looks like:

  static uint1_t frame_buffer_wr_data_in_reg;
  static uint16_t frame_buffer_x_in_reg;
  static uint16_t frame_buffer_y_in_reg;
  static uint1_t frame_buffer_wr_enable_in_reg;
  static uint1_t frame_buffer_rd_data_out_reg;
  ...
  // Connect frame buffer inputs from registers for better fmax
  frame_buffer_wr_data_in = frame_buffer_wr_data_in_reg;
  frame_buffer_x_in = frame_buffer_x_in_reg;
  frame_buffer_y_in = frame_buffer_y_in_reg;
  frame_buffer_wr_enable_in = frame_buffer_wr_enable_in_reg;
  ...
  // Defaults for single cycle pulses
  frame_buffer_wr_enable_in_reg = 0;
  ...
  else if(addr==FRAME_BUF_X_ADDR){
    // Frame buffer x
    o.addr_is_mapped = 1;
    o.rd_data = frame_buffer_x_in_reg;
    if(wr_en){
      frame_buffer_x_in_reg = wr_data;
    }
  }else if(addr==FRAME_BUF_Y_ADDR){
    // Frame buffer y
    o.addr_is_mapped = 1;
    o.rd_data = frame_buffer_y_in_reg;
    if(wr_en){
      frame_buffer_y_in_reg = wr_data;
    }
  }else if(addr==FRAME_BUF_DATA_ADDR){
    // Frame buffer data
    o.addr_is_mapped = 1;
    o.rd_data = frame_buffer_rd_data_out_reg;
    frame_buffer_wr_enable_in_reg = wr_en;
    if(wr_en){
      frame_buffer_wr_data_in_reg = wr_data;
    }
  }

Notice the declaration of the static _reg register variables. These break the path to+from the frame buffer ensuring that the CPU is only limited by its own critical single cycle path from PC->Writeback.

Software

The user is expected to load the x,y values and then write or read a single pixel of data. The application C code from frame_buffer.h looks like so:

void frame_buf_write(int32_t x, int32_t y, int32_t wr_data){
  *FRAME_BUF_X = x;
  *FRAME_BUF_Y = y;
  *FRAME_BUF_DATA = wr_data;
}
int32_t frame_buf_read(int32_t x, int32_t y){
  *FRAME_BUF_X = x;
  *FRAME_BUF_Y = y;
  return *FRAME_BUF_DATA;
}

Given the above memory mapped interface, address values are written, and then a few cycles later the read/write data is used. The additional two cycles of latency that occurs by having frame buffer input and output registers in the memory map is not an issue since there is a delay between inputs (store word instruction) and outputs (load word instruction) of at least a few cycles already.

This frame buffer code uses entire 32b words for single bit pixel values. Its possible to pack the 640*480=307200 one bit pixels=38400 bytes=9600 32b words into data memory, and write/read words of 32 pixels|bits at a time through more software coordination. But this work is intentionally kept simple and not going in the direction of being optimized for CPUs specifically. Instead further custom hardware accelerators is the main goal of these experiments.

Finally - the Game of Life demo also include two more small RAMs, one for each of two lines buffered in computing the next frame. These line buffers are opposed to having two full frame buffers. The code/interfaces/etc for using these line buffers looks just like the frame buffer RAM for the most part.

Zero Cycle Latency LUTRAM

At first, similar to the CPU register file and memory, the frame buffer used a zero latency LUTRAM for the single bit pixel frame buffer, depth = 640*480=307200.

Additional input and output registers were added to the signals to and from the frame buffer LUTRAM. These registers were defined where the RAM signals are used, in mem_map.c (also less to deal with inside frame_buf_function) and looks like so:

  // In+out registers for frame buffer wires, for better fmax
  static frame_buffer_inputs_t cpu_frame_buffer_in_reg;
  static frame_buffer_outputs_t cpu_frame_buffer_out_reg;
...
  // Connect frame buffer inputs from registers for better fmax
  cpu_frame_buffer_in = cpu_frame_buffer_in_reg;
  // Some defaults for single cycle pulses
  cpu_frame_buffer_in_reg.valid = 0;
...
  // Connect frame buffer outputs to registers for better fmax
  cpu_frame_buffer_out_reg = cpu_frame_buffer_out;
  line_bufs_out_reg = line_bufs_out;

As synthesized for a Xilinx 7 Series Artix part ex. #pragma PART "xc7a100tcsg324-1" (with instruction+data LUTRAM = 2048 bytes):

Source:          top_inst/risc_v.../pc
Destination:     reg_file.../..the_ram
Data Path Delay: 16.419ns  (logic 2.895ns (17.632%)  route 13.524ns (82.368%))
Logic Levels:    16  (LUT3=1 LUT4=2 LUT6=11 MUXF7=1 RAMD64E=1)

With an FMAX ~= 55MHz, rendering was a slow ~3.5 seconds per Game of Life frame.

+------+----------+------+
|      |Cell      |Count |
+------+----------+------+
|1     |CARRY4    |   153|
|2     |DSP48E1   |     2|
|3     |LUT1      |   105|
|4     |LUT2      |   483|
|5     |LUT3      |   252|
|6     |LUT4      |   560|
|7     |LUT5      |   366|
|8     |LUT6      |  4312|
|9     |MUXF7     |   607|
|10    |MUXF8     |   297|
|11    |RAM128X1D |  2528|
|12    |RAM128X1S |     2|
|13    |RAM16X1D  |     6|
|14    |RAM256X1S |     4|
|15    |RAM32M    |    31|
|16    |FDRE      |   837|
|17    |FDSE      |     6|
+------+----------+------+

Extra Frame Buffer Ports

The frame buffer started with just two ports: 1) Read+write port for the CPU and 2) A read-only port for VGA continuous pixel streaming. One easy way to make room for another user of the RAM is to add a third port (a more complicated way involved multiplexing few ports at a faster clock rate to mimic more ports).

Similar to other ports connected to the frame buffer RAM frame_buf_ram, a third port was added the frame_buf_function in frame_buffer.c.

// Global wires connected to frame buffer extra ports
frame_buffer_inputs_t frame_buffer_extra_in_ports[FRAME_BUFFER_NUM_EXTRA_PORTS];
frame_buffer_outputs_t frame_buffer_extra_out_ports[FRAME_BUFFER_NUM_EXTRA_PORTS];
...
  // Do RAM lookups
  // First port is for user CPU application, is read+write
  // Second port is read only for the frame buffer vga display
  // Third+ ports are generic extra ports
  frame_buf_ram_outputs_t ram_outputs
    = frame_buf_ram(// Port0
                    ...
                    // Port1   
                    ...
                    // Port2, first extra port
                    frame_buffer_extra_in_ports[0].addr,
                    frame_buffer_extra_in_ports[0].wr_en,
                    frame_buffer_extra_in_ports[0].wr_data,
                    frame_buffer_extra_in_ports[0].valid
                    );

Internally it seems the synthesis tool, to create this triple port RAM, is likely creating a duplicate copy of the RAM for the extra port. Because of this, the frame buffer as zero latency LUTRAM no longer is practical to fit in the medium sized demo Artix 7 100t part.

One Cycle Latency Block RAM

The dataflow through the frame_buf_function in frame_buffer.c was originally written for a zero latency frame buffer RAM lookup.

To accommodate the extra cycles of delay through the frame buffer when FRAME_BUFFER_LATENCY>=1 additional registers are added to align input signals with the output from the RAM. Ex. The address+valid bit is delayed to aligned with the output read data from the RAM.

  #if FRAME_BUFFER_LATENCY > 0
  // Delay input signals to match ram output latency
  static frame_buffer_inputs_t cpu_signals_delay_regs[FRAME_BUFFER_LATENCY] ;
  frame_buffer_inputs_t cpu_signals_into_delay = cpu_signals;
  // Drive signals from end/output delay regs
  cpu_signals = cpu_signals_delay_regs[FRAME_BUFFER_LATENCY-1];
  // Shift array up to make room, and then put new at buf[0]
  ARRAY_SHIFT_UP(cpu_signals_delay_regs, FRAME_BUFFER_LATENCY, 1)
  cpu_signals_delay_regs[0] = cpu_signals_into_delay;
  #endif

Finally when FRAME_BUFFER_LATENCY=1 the total latency for the CPU to access the frame buffer is FRAME_BUFFER_LATENCY + 2 cycles of input+output registers = 3 CPU cycles to access frame buffer. Luckily this is still longer than the few cycles between loads and stores that already exist using the memory mapped frame_buf_write|read implementations in frame_buffer.h - so no changes to C code for extra stalling is needed.

Frame Buffer from FSMs

A derived state machine PipelineC function called frame_buf_read_write inside frame_buffer.c was written that uses frame_buffer_extra_in|out_ports global wires connected to frame buffer extra RAM ports in order to do reads+writes from derived FSM style code.

#define FRAME_BUFFER_RW_FSM_STYLE_PORT 0
#define frame_buf_read_write_in_port frame_buffer_extra_in_ports[FRAME_BUFFER_RW_FSM_STYLE_PORT]
#define frame_buf_read_write_out_port frame_buffer_extra_out_ports[FRAME_BUFFER_RW_FSM_STYLE_PORT]
// Helper FSM func for read/writing the frame buffer using global wires
uint1_t frame_buf_read_write(uint16_t x, uint16_t y, uint1_t wr_data, uint1_t wr_en)
{
  // Start transaction by signalling valid inputs into RAM for a cycle
  frame_buf_read_write_in_port.valid = 1;
  frame_buf_read_write_in_port.wr_data = wr_data;
  frame_buf_read_write_in_port.x = x;
  frame_buf_read_write_in_port.y = y;
  frame_buf_read_write_in_port.wr_en = wr_en;
  // (0 latency data is available now)
  __clk();
  // Cleared in the next cycle
  frame_buf_read_write_in_port.valid = 0;

  // Now is 1 cycle latency, maybe need extra cycle 
  #if FRAME_BUFFER_LATENCY == 2
  __clk();
  #endif

  // Finally account for additional IO regs
  #ifdef FRAME_BUFFER_EXTRA_PORTS_HAVE_IO_REGS
  __clk(); // In delay
  __clk(); // Out delay
  #endif

  // Data is ready now
  return frame_buf_read_write_out_port.rd_data;
}
// Derived fsm from func, there can only be a single instance of it
#include "frame_buf_read_write_SINGLE_INST.h"
// Use macros to help avoid nested multiple instances of function by inlining
#define frame_buf_read(x, y) frame_buf_read_write(x, y, 0, 0)
#define frame_buf_write(x, y, wr_data) frame_buf_read_write(x, y, wr_data, 1)

CPU Clock vs Frame Buffer Clock

The frame buffer was run at 480p pixel clock = 25MHz. To keep the initial design simple, the CPU also was run on pixel clock. However as the design gets more complex the CPU may not always want to run on pixel clock - likely wanting to run faster. So the design was modified to allow for any arbitrarily slow pixel clock rate. A faster-than-pixel-clock clock is used for the CPU and frame buffer RAM, this allows memory bandwidth to be as fast as the CPU FMAX allows.

Because the CPU clock will now be faster than pixel clock, a simple asynchronous FIFO clock crossing is used to buffer pixels that have been read in advance and are waiting to be displayed at pixel clock rate.

VGA timing is generated and pixel RAM lookups occur as they did before, except this time running on a faster clock. The result of the RAM lookup is not immediately streamed out for display, but instead written into an ASYNC FIFO. The read side of the FIFO on the slow pixel clock is constantly reading a steam of pixel to drive the display.

The code for this setup can be found in vga/vga_pmod_async_fifo.c. Which replaced the simple vga/vga_pmod.c.

Counting Neighbor Cells

Using extra frame buffer ports, accelerators specific to the Game of Life algorithm can be added. Counting live neighbor cells is the first unit of functionality to move into hardware. Reading from frame buffer is made possible via the FSM function frame_buf_read_write discussed above.

Derived Hardware FSM

int32_t count_live_neighbour_cells(int32_t r, int32_t c){
  // Software implementation (which can also be used to derive HW FSM)
  // https://www.geeksforgeeks.org/program-for-conways-game-of-life/
  int32_t i, j;
  int32_t count=0;
  for(i=r-1; i<=r+1; i+=1){
      for(j=c-1;j<=c+1;j+=1){
          if(!( ((i==r) and (j==c)) or ((i<0) or (j<0)) 
             or ((i>=FRAME_WIDTH) or (j>=FRAME_HEIGHT)))){
            int32_t cell_alive = frame_buf_read(i, j);
            if(cell_alive==1){
              count+=1;
            }
          }
      }
  }
  return count;
}

The above code from the Game of Life main.c file can be compiled and run by the CPU, but also can be used to derive a hardware state machine. Currently in count_neighbors_hw.c there is a bit of ugly wrapper code needed to make a stand alone instance of a derived state machine function:

// Hooks to derive hardware FSMs from C code
// Derived fsm from func
#include "count_live_neighbour_cells_FSM.h"
// Wrap up main FSM as top level
#pragma MAIN count_live_neighbour_cells_wrapper
// With global wires for IO
count_live_neighbour_cells_INPUT_t count_neighbors_fsm_in;
count_live_neighbour_cells_OUTPUT_t count_neighbors_fsm_out;
void count_live_neighbour_cells_wrapper()
{
  count_neighbors_fsm_out = count_live_neighbour_cells_FSM(count_neighbors_fsm_in);
}

Global wires count_neighbors_fsm_in|out are presented as the interface to+from the hardware instance of the count_live_neighbour_cells state machine.

Memory Map

For convenience input and output structs types were added to count_neighbors_hw.h:

typedef struct count_neighbors_hw_in_t{
  int32_t valid;
  // Inputs
  int32_t x;
  int32_t y;
}count_neighbors_hw_in_t;
typedef struct count_neighbors_hw_out_t{
  int32_t valid;
  // Outputs
  int32_t count;
}count_neighbors_hw_out_t;

A variable of each type is memory mapped to allow the CPU access to the counting neighbors hardware module inputs and outputs:

// Memory map variables
static volatile count_neighbors_hw_in_t* COUNT_NEIGHBORS_HW_IN 
    = (count_neighbors_hw_in_t*)COUNT_NEIGHBORS_HW_IN_ADDR;
static volatile count_neighbors_hw_out_t* COUNT_NEIGHBORS_HW_OUT 
    = (count_neighbors_hw_out_t*)COUNT_NEIGHBORS_HW_OUT_ADDR;

The CPU's memory map module reads+writes those count_neighbors_hw_in|out_t structs via conversions to-from memory byte arrays:

else if( (addr>=COUNT_NEIGHBORS_HW_IN_ADDR) & (addr<(COUNT_NEIGHBORS_HW_IN_ADDR+sizeof(count_neighbors_hw_in_t))) ){
    o.addr_is_mapped = 1;
    // Convert to bytes
    count_neighbors_hw_in_t_bytes_t count_neighbors_hw_in_bytes
      = count_neighbors_hw_in_t_to_bytes(count_neighbors_in_reg);
    uint32_t bytes_offset = addr - COUNT_NEIGHBORS_HW_IN_ADDR;
    // Assemble rd data bytes
    uint32_t i;
    uint8_t rd_bytes[4];
    for(i=0;i<4;i+=1){
      rd_bytes[i] = count_neighbors_hw_in_bytes.data[bytes_offset+i];
    }
    o.rd_data = uint8_array4_le(rd_bytes);
    // Drive write bytes
    for(i=0;i<4;i+=1){
      if(wr_byte_ens[i]){
        count_neighbors_hw_in_bytes.data[bytes_offset+i] = wr_data >> (i*8);
      }
    }
    // Convert back to type
    count_neighbors_in_reg = bytes_to_count_neighbors_hw_in_t(count_neighbors_hw_in_bytes);
...
   // Same for output signals too

Memory mapped registers are connected to the hardware FSM module instance inputs and outputs using the globally visible count_neighbors_fsm_in|out wires from count_neighbors_hw.c. These connections include auto-clearing valid input bits so software can be sure inputs were received, and know when outputs are ready without confusion of collision/duplication with previous invocations of the FSM/function.

  static count_neighbors_hw_in_t count_neighbors_in_reg;
  static count_neighbors_hw_out_t count_neighbors_out_reg;
  count_neighbors_fsm_in.input_valid = count_neighbors_in_reg.valid;
  count_neighbors_fsm_in.r = count_neighbors_in_reg.x;
  count_neighbors_fsm_in.c = count_neighbors_in_reg.y;
  // Clear input to fsm once accepted ready
  if(count_neighbors_fsm_out.input_ready){
    count_neighbors_in_reg.valid = 0;
  }
...
  // Connect outputs from fsm to regs
  count_neighbors_fsm_in.output_ready = 1;
  if(count_neighbors_fsm_in.input_valid & count_neighbors_fsm_out.input_ready){
    // Starting fsm clears output regs valid bit
    count_neighbors_out_reg.valid = 0;
  }else if(count_neighbors_fsm_out.output_valid){
    // Output from FSM updated return values and sets valid flag
    count_neighbors_out_reg.count = count_neighbors_fsm_out.return_output;
    count_neighbors_out_reg.valid = 1;
  }

Finally inside count_neighbors_hw.c the CPU's hardware accelerated version of the count_live_neighbour_cells function using memory mapped registers now looks like:

// Software access to hardware count neighbors fsm
int32_t count_live_neighbour_cells(int32_t r, int32_t c){
  // Set all input values
  COUNT_NEIGHBORS_HW_IN->x = r;
  COUNT_NEIGHBORS_HW_IN->y = c;
  // Set valid bit
  COUNT_NEIGHBORS_HW_IN->valid = 1;
  // (optionally wait for valid to be cleared indicating started work)
  // Then wait for valid output
  while(COUNT_NEIGHBORS_HW_OUT->valid==0){}
  // Read output
  return COUNT_NEIGHBORS_HW_OUT->count;
}

Results

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |   295|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     2|
|4     |LUT1       |   117|
|5     |LUT2       |   635|
|6     |LUT3       |   620|
|7     |LUT4       |   777|
|8     |LUT5       |   455|
|9     |LUT6       |  1928|
|10    |MUXF7      |    64|
|11    |RAM128X1D  |   128|
|12    |RAM16X1D   |     6|
|13    |RAM32M     |    31|
|14    |RAMB18E1   |     2|
|15    |RAMB36E1   |     2|
|16    |RAMB36E1_1 |     2|
|17    |RAMB36E1_2 |     2|
|18    |RAMB36E1_3 |     2|
|19    |RAMB36E1_4 |     2|
|20    |RAMB36E1_5 |     2|
|21    |RAMB36E1_6 |     2|
|22    |RAMB36E1_7 |     2|
|23    |RAMB36E1_8 |     2|
|24    |RAMB36E1_9 |     2|
|25    |FDRE       |  1334|
|26    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:           top_inst/risc_v...pc
Destination:      top_inst/reg_file...the_ram
Data Path Delay:  20.103ns  (logic 5.290ns (26.314%)  route 14.813ns (73.686%))
Logic Levels:     25  (CARRY4=7 LUT2=5 LUT3=2 LUT4=1 LUT6=8 MUXF7=1 RAMD64E=1)

The additional memory mapping logic causes a slight slow down in the CPU write-back path, FMAX ~= 48MHz. Even with the slightly slower FMAX, the count_live_neighbour_cells derived hardware FSM is more efficient (than doing the same tasks, ex. frame buffer reads+counting, using CPU instructions) and so rendering is accelerated a bit more reaching ~1 Game of Life frame per second.

Computing Cell Next State

Similar to counting living neighbor cells, the C function for determining a cells next state can be used to derive a hardware state machine (it will invoke the same counting neighbors state machine as was used prior). See snippet below from the Game of Life main.c file:

// Game of Life logic to determine if cell at x,y lives or dies
int32_t cell_next_state(int32_t x, int32_t y)
{
  int32_t neighbour_live_cells = count_live_neighbour_cells(x, y);
  int32_t cell_alive = frame_buf_read(x, y);
  int32_t cell_alive_next;
  if((cell_alive==1) and ((neighbour_live_cells==2) or (neighbour_live_cells==3))){
      cell_alive_next=1;
  }else if((cell_alive==0) and (neighbour_live_cells==3)){
      cell_alive_next=1;
  }else{
      cell_alive_next=0;
  }
  return cell_alive_next;
}

The FSM style wrapper code in cell_next_state_hw.c indicates to the PipelineC tool to derive a top level state machine from a particular C function:

// Hooks to derive hardware FSMs from C code
// Derived fsm from func
#include "cell_next_state_FSM.h"
// Wrap up main FSM as top level
FSM_MAIN_IO_WRAPPER(cell_next_state)

The FSM_MAIN_IO_WRAPPER macro from mem_map.h instantiates a top level single instance of the derived FSM and creates global wires cell_next_state_fsm_in|out that are made available for use by the CPU memory map.

Memory Map

First types are defined for the function inputs and outputs in cell_next_state_hw.h:

typedef struct cell_next_state_hw_in_t{
  FSM_IN_TYPE_FIELDS_2INPUTS(int32_t, x, int32_t, y)
}cell_next_state_hw_in_t;
typedef struct cell_next_state_hw_out_t{
  FSM_OUT_TYPE_FIELDS // is_alive
}cell_next_state_hw_out_t;

The FSM_IN_TYPE_FIELDS_2INPUTS and FSM_OUT_TYPE_FIELDS helper macros are from mem_map.h and help create valid signals used as part of handshaking with derived FSMs.

Then the pair of input and output variables has their memory mapped addresses defined in cell_next_state_hw's mem_map.h:

#define CELL_NEXT_STATE_HW_IN_ADDR CELL_NEXT_STATE_MEM_MAP_BASE_ADDR
#define CELL_NEXT_STATE_HW_OUT_ADDR (CELL_NEXT_STATE_HW_IN_ADDR + sizeof(cell_next_state_hw_in_t))
#define CELL_NEXT_STATE_MEM_MAP_NEXT_ADDR (CELL_NEXT_STATE_HW_OUT_ADDR + sizeof(cell_next_state_hw_out_t))
FSM_IO_MEM_MAP_VARS_DECL(
  CELL_NEXT_STATE_HW, 
  cell_next_state_hw_out_t, CELL_NEXT_STATE_HW_OUT_ADDR, 
  cell_next_state_hw_in_t, CELL_NEXT_STATE_HW_IN_ADDR)

The FSM_IO_MEM_MAP_VARS_DECL helper macro is from the CPU's mem_map.h and declares input and output static volatile variables with fixed memory addresses.

In the CPU's mem_map.c the input and output connections to-from the hardware FSM are connected to the memory mapped addresses:

  FSM_IO_REGS_DECL_2INPUTS(
    cell_next_state, cell_next_state_hw_out_t,
    cell_next_state_hw_in_t, x, y)
...
  FSM_IN_REG_STRUCT_MM_ENTRY(CELL_NEXT_STATE_HW_IN_ADDR, cell_next_state_hw_in_t, cell_next_state)
  FSM_OUT_REG_STRUCT_MM_ENTRY(CELL_NEXT_STATE_HW_OUT_ADDR, cell_next_state_hw_out_t, cell_next_state)
...
  FSM_OUT_REG(cell_next_state)

The FSM_IO_REGS_DECL_2INPUTS, FSM_IN|OUT_REG_STRUCT_MM_ENTRY, FSM_OUT_REG helper macros are defined in the CPU's mem_map.h and contain derived FSM specific helper functionality related to IO registers.

The CPU now uses the accelerated version of the next state function as shown in cell_next_state_hw.c:

int32_t cell_next_state(int32_t x, int32_t y){
  FSM_MEM_MAP_VARS_FUNC_BODY_2INPUTS(CELL_NEXT_STATE_HW, x, y)
}

The FSM_MEM_MAP_VARS_FUNC_BODY_2INPUTS macro is defined in the CPU's mem_map.h and inserts the code needed to use memory mapped input and output variables to complete the function call using hardware resources.

Finally after setting the configuration as needed in hw_config.h and hw.c the design can be synthesized.

Results

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |   295|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     2|
|4     |LUT1       |   119|
|5     |LUT2       |   629|
|6     |LUT3       |   469|
|7     |LUT4       |   850|
|8     |LUT5       |   547|
|9     |LUT6       |  1917|
|10    |MUXF7      |    48|
|11    |RAM128X1D  |   128|
|12    |RAM16X1D   |     6|
|13    |RAM32M     |    31|
|14    |RAMB18E1   |     2|
|15    |RAMB36E1   |     2|
|16    |RAMB36E1_1 |     2|
|17    |RAMB36E1_2 |     2|
|18    |RAMB36E1_3 |     2|
|19    |RAMB36E1_4 |     2|
|20    |RAMB36E1_5 |     2|
|21    |RAMB36E1_6 |     2|
|22    |RAMB36E1_7 |     2|
|23    |RAMB36E1_8 |     2|
|24    |RAMB36E1_9 |     2|
|25    |FDRE       |  1470|
|26    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:          riscv...pc
Destination:     reg_file...the_ram
Data Path Delay: 18.810ns  (logic 4.825ns (25.651%)  route 13.985ns (74.349%))
Logic Levels:    20  (CARRY4=2 LUT1=1 LUT3=1 LUT4=2 LUT5=1 LUT6=10 MUXF7=2 RAMD64E=1)

The operating frequency was roughly the same as before, FMAX ~= 48MHz. The cell next state computation being moved into a hardware FSM increases the frame rate a little from seen prior but still around a second per frame: Game of Life renders at ~1 FPS.

Frame Buffer + Cell Next State + Line Buffers

The Game of Life main.c main function uses a single frame buffer and two line buffers. Looping over each line in the frame two common operations are repeated:

Compute next state for cell and write to line buffer:

int32_t cell_alive_next = cell_next_state(frame_x, frame_y);
line_buf_write(line_sel, frame_x, cell_alive_next);

Read from line buffer, write data to frame buffer:

int32_t rd_data = line_buf_read(line_sel, frame_x);
frame_buf_write(frame_x, frame_y, rd_data);

Those operations were wrapped into a single helper function called next_state_buf_rw:

next_state_buf_rw_out_t next_state_buf_rw(next_state_buf_rw_in_t inputs)
{
  next_state_buf_rw_out_t outputs;
  if(inputs.op_sel==NEXT_STATE_LINE_WRITE){
    int32_t cell_alive_next = cell_next_state(inputs.frame_x, inputs.frame_y);
    line_buf_write(inputs.line_sel, inputs.frame_x, cell_alive_next);
  }else{ // LINE_READ_FRAME_WRITE
    int32_t rd_data = line_buf_read(inputs.line_sel, inputs.frame_x);
    frame_buf_write(inputs.frame_x, inputs.frame_y, rd_data);
  }
  return outputs;
}

Using design patterns like having a single inputs struct and a single output struct along with helper macros from the CPU's mem_map.h the steps for moving the function into hardware as seen in the single file next_state_buf_rw.c are summarized as follows: (for more verbose details see earlier sections above)

  1. Define the input struct type and the output struct type
  2. Use the input and output structs to write the function
  3. Use macros to tag the function as deriving a hardware state machine
  4. Use macros to define memory mapped variables for the input and output struct, and a version of the function that uses those instead of executing on the CPU

Then inside the CPU's mem_map.c mem_map_module() these steps finalize the connection to+from the CPU and hardware module with input+output registers:

  1. FSM_IO_REGS_DECL macro for declaring IO registers, connecting inputs
  2. FSM_IN|OUT_REG_STRUCT_MM_ENTRY macros to connect to the load|store write|read logic
  3. FSM_OUT_REG macro for using output registers

The overall system configuration (what functions to use and accelerate, etc) is configured in hw_config.h.

Results

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |   279|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   122|
|5     |LUT2       |   734|
|6     |LUT3       |   447|
|7     |LUT4       |   925|
|8     |LUT5       |   879|
|9     |LUT6       |  1721|
|10    |MUXF7      |    64|
|11    |MUXF8      |    32|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       |  1638|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:          next_state_buf_rw...FSM_STATE_reg
Destination:     next_state_buf_rw....FSM_STATE_reg
Data Path Delay: 18.719ns  (logic 6.493ns (34.687%)  route 12.226ns (65.313%))
Logic Levels:    30  (CARRY4=11 LUT1=1 LUT3=2 LUT4=4 LUT5=3 LUT6=9)

Similar to before the FMAX is around ~48MHz and there appears to be a slight increase in FPS but still ~1 Game of Life FPS. It is good to note that the critical path is no longer through the single cycle CPU PC->write back, but in the derived next_state_buf_rw state machine.

Multiple FSM Threads

The above section discussed implementing a derived finite state machine version of the "Compute next state for cell and write to line buffer or read from line buffer, write data to frame buffer" next_state_buf_rw function.

In this section that FSM is instantiated multiple times to parallelize the work across the FRAME_WIDTH.

Using helper macros and some specific design patterns makes connecting the accelerator hardware to the CPU easy (for more verbose details see earlier sections above).

Specifically in multi_next_state_buf_rw.c this version ~fakes the interface of a single derived multi_next_state_buf_rw FSM function. However, it is not actually a derived FSM as its implementation is plain hardware description that instantiates multiple copies of the original next_state_buf_rw derived FSM.

multi_next_state_buf_rw_INPUT_t multi_next_state_buf_rw_fsm_in;
multi_next_state_buf_rw_OUTPUT_t multi_next_state_buf_rw_fsm_out;
#pragma MAIN multi_wrapper
void multi_wrapper()
{
  // IO for each FSM isntance
  next_state_buf_rw_INPUT_t fsm_in[NUM_THREADS];
  next_state_buf_rw_OUTPUT_t fsm_out[NUM_THREADS];

  // Inputs are only valid if all FSM instances ready to start
  uint1_t all_fsms_input_ready;
  // Value isnt known until 'after', is feedback
  #pragma FEEDBACK all_fsms_input_ready
  
  // Only ready for output once all are done and valid
  uint1_t all_fsms_valid_output;
  // Value isnt known until 'after', is feedback
  #pragma FEEDBACK all_fsms_valid_output

  // The N instances of FSM with input valid ready gated
  uint32_t i;
  for (i = 0; i < NUM_THREADS; i+=1)
  {
    fsm_in[i].input_valid = all_fsms_input_ready & multi_next_state_buf_rw_fsm_in.input_valid;
    fsm_in[i].output_ready = all_fsms_valid_output & multi_next_state_buf_rw_fsm_in.output_ready;
    fsm_in[i].inputs = multi_next_state_buf_rw_fsm_in.inputs.next_state_buf_rw_inputs;
    fsm_in[i].inputs.frame_x = multi_next_state_buf_rw_fsm_in.inputs.next_state_buf_rw_inputs.frame_x + i;
    fsm_out[i] = next_state_buf_rw_FSM(fsm_in[i]);
  }

  // Were all FSMs ready/valid? (feedback)
  all_fsms_input_ready = 1;
  all_fsms_valid_output = 1;
  for (i = 0; i < NUM_THREADS; i+=1)
  {
    all_fsms_input_ready &= fsm_out[i].input_ready;
    all_fsms_valid_output &= fsm_out[i].output_valid;
  }

  // Gated output valid ready
  multi_next_state_buf_rw_fsm_out.output_valid = all_fsms_valid_output;
  multi_next_state_buf_rw_fsm_out.input_ready = all_fsms_input_ready;
}

As before the CPU's mem_map.c mem_map_module() was updated to connect the accelerator hardware IO registers. And finally hw_config.h sets the configuration definitions as needed.

Results

2 Threads

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |   387|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   127|
|5     |LUT2       |   867|
|6     |LUT3       |   832|
|7     |LUT4       |  1206|
|8     |LUT5       |  1184|
|9     |LUT6       |  2314|
|10    |MUXF7      |    96|
|11    |MUXF8      |    32|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       |  2192|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:           multi_wrapper.../...next_state_buf_rw...0/FSM_STATE_reg                          
Destination:      multi_wrapper.../..next_state_buf_rw...0/FUNC_CALL_RETURN_FSM_STATE
Data Path Delay:  23.128ns  (logic 5.356ns (23.158%)  route 17.772ns (76.842%))
Logic Levels:     30  (CARRY4=4 LUT2=1 LUT3=3 LUT4=6 LUT5=8 LUT6=8)

~FMAX=40MHz, 1.5FPS

4 Threads

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |   600|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   142|
|5     |LUT2       |  1396|
|6     |LUT3       |  1315|
|7     |LUT4       |  1599|
|8     |LUT5       |  1990|
|9     |LUT6       |  2906|
|10    |MUXF7      |   130|
|11    |MUXF8      |    32|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       |  3236|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:           multi_wrapper.../...next_state_buf_rw_...1/FSM_STATE_reg                 
Destination:      mem_SPLIT2.../mem.../mem_map_module.../multi_next_state_buf_rw_out_reg
Data Path Delay:  26.562ns  (logic 5.878ns (22.129%)  route 20.684ns (77.871%))
Logic Levels:     34  (CARRY4=4 LUT2=1 LUT3=3 LUT4=7 LUT5=8 LUT6=11)

FMAX ~= 35MHz, ~2FPS

8 Threads

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |  1025|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   163|
|5     |LUT2       |  2460|
|6     |LUT3       |  2406|
|7     |LUT4       |  2387|
|8     |LUT5       |  3415|
|9     |LUT6       |  4606|
|10    |MUXF7      |   237|
|11    |MUXF8      |    65|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       |  5320|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:           multi_wrapper.../next_state_buf_rw...2/FSM_STATE_reg
Destination:      mem_SPLIT2.../mem.../mem_map_module.../multi_next_state_buf_rw_out_reg
Data Path Delay:  29.279ns  (logic 6.681ns (22.818%)  route 22.598ns (77.182%))
Logic Levels:     40  (CARRY4=7 LUT2=1 LUT3=2 LUT4=6 LUT5=8 LUT6=16)

FMAX ~= 32Mhz, ~2.8FPS

16 Threads

Resources

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |  1875|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   208|
|5     |LUT2       |  4609|
|6     |LUT3       |  4549|
|7     |LUT4       |  3859|
|8     |LUT5       |  6212|
|9     |LUT6       |  8159|
|10    |MUXF7      |   440|
|11    |MUXF8      |   153|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       |  9484|
|27    |FDSE       |     5|
+------+-----------+------+

Timing

Source:           multi_wrapper.../next_state_buf_rw...5/FSM_STATE_reg
Destination:      mem_SPLIT2.../mem.../mem_map_module.../multi_next_state_buf_rw_out_reg
Data Path Delay:  31.072ns  (logic 7.223ns (23.246%)  route 23.849ns (76.754%))
Logic Levels:     42  (CARRY4=7 LUT2=1 LUT3=2 LUT4=6 LUT5=9 LUT6=16 MUXF7=1)

FMAX ~30MHz, ~3.5FPS

16 Threads w/ Extra Registers

Noticing that in the above configurations the critical path is to the multi-thread wrapper module output register multi_next_state_buf_rw_out_reg, some extra registers can be added to break the path and increase fmax. Inside the multi_wrapper() function in multi_next_state_buf_rw.c this means registering the inputs+outputs from from each copy of the next_state_buf_rw derived FSM (in addition to the registering that's done the for the entire multi_wrapper() instance inputs+outputs in the memory map connections):

Resources:

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |  1875|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   226|
|5     |LUT2       |  4739|
|6     |LUT3       |  4569|
|7     |LUT4       |  3490|
|8     |LUT5       |  4455|
|9     |LUT6       |  9674|
|10    |MUXF7      |   440|
|11    |MUXF8      |   153|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       | 10096|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:           multi_wrapper.../next_state_buf_rw...4/FSM_STATE_reg
Destination:      multi_wrapper.../next_state_buf_rw...14/cell_next_state.../count_neighbors.../FSM_STATE_reg
Data Path Delay:  27.258ns  (logic 7.736ns (28.381%)  route 19.522ns (71.619%))
Logic Levels:     40  (CARRY4=10 LUT2=2 LUT4=5 LUT5=7 LUT6=15 MUXF7=1)

FMAX ~35MHZ, ~3.6FPS

16 Threads w/ Extra Clock Cycles

The critical paths have been inside next_state_buf_rw derived state machines. It is possible to increase the operating freqyency of FSMs by placing __clk() extra clock cycles to break combinatorial logic. Extra cycles were added like so:

next_state_buf_rw_out_t next_state_buf_rw(next_state_buf_rw_in_t inputs)
{
  next_state_buf_rw_out_t outputs;
  if(inputs.op_sel==NEXT_STATE_LINE_WRITE){
    __clk();
    int32_t cell_alive_next = cell_next_state(inputs.frame_x, inputs.frame_y);
    line_buf_write(inputs.line_sel, inputs.frame_x, cell_alive_next);  
  }else{ // LINE_READ_FRAME_WRITE
    __clk();
    int32_t rd_data = line_buf_read(inputs.line_sel, inputs.frame_x);
    frame_buf_write(inputs.frame_x, inputs.frame_y, rd_data);
  }
  return outputs;
}

Resources

+------+-----------+------+
|      |Cell       |Count |
+------+-----------+------+
|1     |CARRY4     |  1875|
|2     |DSP48E1    |     1|
|3     |DSP48E1_1  |     1|
|4     |LUT1       |   210|
|5     |LUT2       |  4849|
|6     |LUT3       |  4565|
|7     |LUT4       |  4118|
|8     |LUT5       |  4616|
|9     |LUT6       |  9871|
|10    |MUXF7      |   440|
|11    |MUXF8      |   153|
|12    |RAM128X1D  |   128|
|13    |RAM16X1D   |     6|
|14    |RAM32M     |    31|
|15    |RAMB18E1   |     2|
|16    |RAMB36E1   |     1|
|17    |RAMB36E1_1 |     1|
|18    |RAMB36E1_2 |     1|
|19    |RAMB36E1_3 |     1|
|20    |RAMB36E1_4 |     1|
|21    |RAMB36E1_5 |     1|
|22    |RAMB36E1_6 |     1|
|23    |RAMB36E1_7 |     1|
|24    |RAMB36E1_8 |     1|
|25    |RAMB36E1_9 |     1|
|26    |FDRE       | 10106|
|27    |FDSE       |     5|
+------+-----------+------+

Timing:

Source:         multi_wrapper.../fsm_in_reg                         
Destination:     multi_wrapper.../next_state_buf_rw...14/cell_next_state../count_neighbors../FSM_STATE_reg
Data Path Delay: 23.306ns  (logic 6.016ns (25.813%)  route 17.290ns (74.187%))
Logic Levels:    32  (CARRY4=7 LUT2=2 LUT4=3 LUT5=6 LUT6=13 MUXF7=1)

FMAX ~= 40MHz, ~4.4FPS

What to accelerate in hardware next?

The above series of steps describes taking a C code version of the 'Game of Life' executing on a single cycle RISC-V CPU and moving progressively more and more functionality into PipelineC derived hardware finite state machines. This style of hardware is 'sequential', ex. compiled instructions and/or FSM states.

There are of course options to continue optimization of this serialized/sequential design. Moving the entire FRAME_WIDTH inner loops into hardware FSMs should shave a few cycles compared to still having the loop control done in CPU instructions. Changes to the Game of Life algorithm to reduce duplicate reads and perhaps some 32b CPU word focused memory packing could make things more efficient. If memory is plentiful a complete second copy of the frame buffer could be used to parallelize across the entire frame (as opposed iterating line by line using a pair of line buffers).

However rarely is it ~optimal hardware design to merely ~copy the state machine a CPU was executing as instructions and instead implement it as a finite state machine. Even trying to scale that idea to use multiple 'threads' of execution eventually hits a familiar obstacle: the memory wall. That is, that the limiting factor in the design ends up being the way in which data is read+written from memory. In this case, reading and writing single pixels at a time from anywhere across the entire frame buffer at the same clock as other slow computations limits the ultimate throughput for computing Game of Life.

First the frame buffer RAM could be moved to its own faster clock domain giving the opportunity to increase memory access throughput. Currently reads and writes are not pipelined, that is for ex., a read is started, and a few cycles later it completed returning read data. It is instead possible to pipeline operations to be back to back with several operations 'in flight' and returning one after the other. There is also no reason to limit memory to reading and writing single pixels at a time from addressable locations, shallower and wider RAMs can increase the throughput at which pixels can be read+written. And finally there is no reason to have a single frame buffer RAM (in FPGA limited to a single port or two) instead common techniques like memory banking could be used to increase the effective number of ports. With points like those in mind, there are many hardware architectures that might scale better for the Game of Life computation. Future work might do a more general PipelineC hardware design that is less/not CPU focused, maybe even using autopipelining as well.

RISC-V + PipelineC Integration Summary

This section quickly summarizes what demonstrated at length in above sections relating to Game of Life acceleration.

Memory Mapped Registers

Store and load instructions can be turned into register writes and reads by mapping certain memory addresses to registers.

C code for declaring a memory mapped register:

// My Register
#define MY_REG_ADDR 0x10000000
static volatile uint32_t* MY_REG = (uint32_t*)MY_REG_ADDR;

Anywhere in C code the MY_REG pointer can be used to read/load write/store to a hardware register.

// Like any other pointer, etc
*MY_REG = 1;
uint32_t x = *MY_REG + 1;

In PipelineC code the CPU's mem_map.c mem_map_module() contains the hook between load-store instructions and reading+writing the hardware register. For example an entry for connecting the software MY_REG to a hardware my_reg register mapped to address MY_REG_ADDR:

  static uint32_t my_reg; // A register
...
  if(addr==MY_REG_ADDR){
    o.addr_is_mapped = 1;
    o.rd_data = my_reg;
    if(wr_en){
      my_reg = wr_data;
    }
  }

There are many helper macros available in mem_map.h. The above address checking, read data, and write data enable logic could instead be written as:

WORD_MM_ENTRY(MY_REG_ADDR, my_reg)

Structs can be mapped like:

STRUCT_MM_ENTRY(ADDR, type, reg_name)

Memory Mapped Derived Finite State Machines

In PipelineC it possible to derive finite state machines from C functions.

Consider a my_fsm function with a signature like so:

my_fsm_out_t my_fsm(my_fsm_in_t inputs);

Using design patterns like having a input and output structs named above <func_name>_out_t|<func_name>_in_t inputs allows for some concise helper macros from mem_map.h to be used:

C Code

// Add valid+ready handshaking as needed around `my_fsm_in|out_t`
// The extra signals are used to connect to the valid+ready handshake
// that is built into PipelineC derived FSMs already
FSM_IO_TYPES_WRAPPER(my_fsm)
// Define addresses for FSM inputs and outputs as memory mapped registers
#define MY_FSM_HW_IN_ADDR 0x00000000
#define MY_FSM_HW_OUT_ADDR (MY_FSM_HW_IN_ADDR + sizeof(my_fsm_in_valid_t))
// Declare a version of the my_fsm function to run on the CPU
// that uses variables at memory mapped address to invoke the derived FSM
FSM_MEM_MAP_FUNC_DECL(my_fsm, MY_FSM_HW)
// ^Macro expands to a function doing simple polling like so:
my_fsm_out_t func_name(my_fsm_in_t inputs){
  /* Set input data without valid*/
  MY_FSM_HW_IN->data = inputs;
  /* Set valid bit */
  MY_FSM_HW_IN->valid = 1;
  /* (optionally wait for valid to be cleared indicating started work)*/
  /* Then wait for valid output*/
  while(MY_FSM_HW_OUT->valid==0){}
  /* Return output data*/
  return MY_FSM_HW_OUT->data;
}

PipelineC Code

// Hardware specific to and from bytes conversion functions
#include "my_fsm_in_valid_t_bytes_t.h"
#include "my_fsm_out_valid_t_bytes_t.h"
// Tell PipelineC tool to generate FSM hardware my_fsm_FSM module
#include "my_fsm_FSM.h"
// Declare a top level instance of the FSM 
// with global wires for inputs and outputs
FSM_MAIN_IO_WRAPPER(my_fsm)

Connect to the CPU's mem_map.c mem_map_module():

// Input+Output registers 
// to connect global wires for FSM IO
FSM_IO_REGS_DECL(my_fsm)
...
// The address compare memory map entry for IO structs
FSM_IO_REG_STRUCT_MM_ENTRY(MY_FSM_HW, my_fsm)
...
// Register the FSM output wires
FSM_OUT_REG(my_fsm)

Thanks for your time - feel free to reach out with questions :)

Clone this wiki locally