-
Notifications
You must be signed in to change notification settings - Fork 50
Shared Resource Bus
This page describes 'shared resource buses' which are very similar to system buses like AXI: where managers
make requests to and receive responses from subordinates
using five separate valid+ready handshake
channels for simultaneous reads
and writes
.
These buses are used in a graphics demo to have multiple 'host threads' share frame buffers to draw the Mandelbrot set.
PipelineC's shared_resource_bus.h generic shared bus header is used to create hosts
that make read+write requests
to and receive read+write responses
from devices
.
For example, if the device is a simple byte-address memory mapped RAM
, request and responses can be configured like so:
-
write_req_data_t
: Write request data- ex. RAM needs
address
to request writetypedef struct write_req_data_t { uint32_t addr; /*// AXI: // Address // "Write this stream to a location in memory" id_t awid; addr_t awaddr; uint8_t awlen; // Number of transfer cycles minus one uint3_t awsize; // 2^size = Transfer width in bytes uint2_t awburst;*/ }write_req_data_t;
- ex. RAM needs
-
write_data_word_t
: Write data word- ex. RAM writes some
data
element, some number of bytes, to addressed locationtypedef struct write_data_word_t { uint8_t data; /*// AXI: // Data stream to be written to memory uint8_t wdata[4]; // 4 bytes, 32b uint1_t wstrb[4];*/ }write_data_word_t;
- ex. RAM writes some
-
write_resp_data_t
: Write response data- ex. RAM write returns dummy single valid/done/complete bit
typedef struct write_resp_data_t { uint1_t dummy; /*// AXI: // Write response id_t bid; uint2_t bresp; // Error code*/ } write_resp_data_t;
- ex. RAM write returns dummy single valid/done/complete bit
-
read_req_data_t
: Read request data- ex. RAM needs
address
to request readtypedef struct read_req_data_t { uint32_t addr; /*// AXI: // Address // "Give me a stream from a place in memory" id_t arid; addr_t araddr; uint8_t arlen; // Number of transfer cycles minus one uint3_t arsize; // 2^size = Transfer width in bytes uint2_t arburst;*/ } read_req_data_t;
- ex. RAM needs
-
read_data_resp_word_t
: Read data and response word- ex. RAM read returns some
data
elementtypedef struct read_data_resp_word_t { uint8_t data; /*// AXI: // Read response id_t rid; uint2_t rresp; // Data stream from memory uint8_t rdata[4]; // 4 bytes, 32b;*/ } read_data_resp_word_t;
- ex. RAM read returns some
Shared resource buses use valid+ready handshaking just like AXI. Each of the five channels (write request, write data, write response, read request, and read data) has its own handshaking signals.
Again, like AXI, these buses have pipelining and support multiple ID
s with transactions in flight.
Multi cycle bursts of data bounded by packet last
flags also exist but are unused/untested in most cases.
WARNING: Bursting behaviors have never been used or tested and are known to not work with many existing components reach out for more information.
A kind/type of 'shared bus' is declared by using the SHARED_BUS_TYPE_DEF
macro. Instances of the shared bus are declared using the "shared_resource_bus_decl.h"
header-as-macro. The macros declare types, helper functions, and a pair of global wires. One of the global variable wires is used for device to host
data, while the other wire is for the opposite direction host to device
. PipelineC's #pragma INST_ARRAY
shared global variables are used to resolve multiple simultaneous drivers of the wire pairs into shared resource bus arbitration.
SHARED_BUS_TYPE_DEF(
ram_bus, // Bus 'type' name
uint32_t, // Write request type (ex. RAM address)
uint8_t, // Write data type (ex. RAM data)
uint1_t, // Write response type (ex. dummy value for RAM)
uint32_t, // Read request type (ex. RAM address)
uint8_t // Read data type (ex. RAM data)
)
#define SHARED_RESOURCE_BUS_NAME the_bus_name // Instance name
#define SHARED_RESOURCE_BUS_TYPE_NAME ram_bus // Bus 'type' name
#define SHARED_RESOURCE_BUS_WR_REQ_TYPE uint32_t // Write request type (ex. RAM address)
#define SHARED_RESOURCE_BUS_WR_DATA_TYPE uint8_t // Write data type (ex. RAM data)
#define SHARED_RESOURCE_BUS_WR_RESP_TYPE uint1_t // Write response type (ex. dummy value for RAM)
#define SHARED_RESOURCE_BUS_RD_REQ_TYPE uint32_t // Read request type (ex. RAM address)
#define SHARED_RESOURCE_BUS_RD_DATA_TYPE uint8_t // Read data type (ex. RAM data)
#define SHARED_RESOURCE_BUS_HOST_PORTS NUM_HOST_PORTS
#define SHARED_RESOURCE_BUS_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_DEV_PORTS NUM_DEV_PORTS
#define SHARED_RESOURCE_BUS_DEV_CLK_MHZ DEV_CLK_MHZ
#include "shared_resource_bus_decl.h"
The SHARED_BUS_TYPE_DEF
declares types like <bus_type>_dev_to_host_t
and <bus_type>_host_to_dev_t
.
Arbitrary devices are connected to the bus via controller
modules that ~convert host to-from device signals into device specifics.
Again for example, a simple RAM device might have a controller module like:
// Controller Outputs:
typedef struct ram_ctrl_t{
// ex. RAM inputs
uint32_t addr;
uint32_t wr_data;
uint32_t wr_enable;
// Bus signals driven to host
ram_bus_dev_to_host_t to_host;
}ram_ctrl_t;
ram_ctrl_t ram_ctrl(
// Controller Inputs:
// Ex. RAM outputs
uint32_t rd_data,
// Bus signals from the host
ram_bus_host_to_dev_t from_host
);
Inside that ram_ctrl
module RAM specific signals are connected to the five valid+ready
handshakes going to_host
(ex. out from RAM) and from_host
(ex. into RAM).
A full example of a controller can be found in the shared frame buffer example code discussed in later sections.
In the above sections a controller
function, ex. ram_ctrl
, describes how a device connects to a shared bus. Using the SHARED_BUS_ARB
macro the below code instantiates the arbitration
that connects the multiple hosts and devices together through <instance_name>_from_host
, <instance_name>_to_host
wires for each device.
MAIN_MHZ(ram_arb_connect, DEV_CLK_MHZ)
void ram_arb_connect()
{
// Arbitrate M hosts to N devs
// Macro declares the_bus_name_from_host and the_bus_name_to_host
SHARED_BUS_ARB(ram_bus, the_bus_name, NUM_DEV_PORTS)
// Connect devs to arbiter ports
uint32_t i;
for (i = 0; i < NUM_DEV_PORTS; i+=1)
{
ram_dev_ctrl_t port_ctrl
= ram_ctrl(<RAM outputs>, the_bus_name_from_host[i]);
<RAM inputs> = port_ctrl....;
the_bus_name_to_host[i] = port_ctrl.to_host;
}
}
The "shared_resource_bus_decl.h"
header-as-macro declares derived finite state machine helper functions for reading and writing
the shared resource bus. These functions are to be used from NUM_HOST_PORTS
simultaneous host FSM 'threads'.
Below, for example, shows generated signatures for read
ing and write
ing the example shared bus RAM:
uint8_t the_bus_name_read(uint32_t addr);
uint1_t the_bus_name_write(uint32_t addr, uint8_t data); // Dummy return value
Old graphics demo diagram:
The below graphics demo differs from the old one shown in the diagram above (but the old one is also worth reading for more details).
Instead of requiring on-chip block RAM, this new demo can use off-chip DDR memory for large color frame buffers. Additionally, this demo focuses on a more complex rendering computation that can benefit from PipelineC's auto-pipelining.
The graphics_demo.c file is an example exercising a dual frame buffer as a shared bus resource from dual_frame_buffer.c. The demo slowly cycles through R,G,B color ranges, requiring for each pixel: a read from frame buffer RAM, minimal computation to update pixel color, and a write back to frame buffer RAM for display.
The frame buffer is configured to use a Xilinx AXI DDR controller starting inside ddr_dual_frame_buffer.c. The basic shared resource bus setup for connecting to the Xilinx DDR memory controller AXI bus can be found in axi_xil_mem.c. In that file an instance of an axi_shared_bus_t
shared resource bus (defined in axi_shared_bus.h) called axi_xil_mem
is declared using the shared_resource_bus_decl.h file include-as-macro helper.
In addition to 'user' rendering threads, the frame buffer memory shared resource needs to be reading pixels at a rate that can meet the streaming requirement of the VGA resolution pixel clock timing for connecting a display.
Unlike the the old demo, in this demo ddr_dual_frame_buffer.c uses a separate 'read-only priority port' axi_xil_rd_pri_port_mem_host_to_dev_wire
wire to simply connect a VGA position counter to a dedicated read request side of the shared resource bus. Responses from the bus are the pixels that are written directly into the vga_pmod_async_pixels_fifo.c display stream.
MAIN_MHZ(host_vga_reader, XIL_MEM_MHZ)
void host_vga_reader()
{
static uint1_t frame_buffer_read_port_sel_reg;
// READ REQUEST SIDE
// Increment VGA counters and do read for each position
static vga_pos_t vga_pos;
// Read and increment pos if room in fifos (cant be greedy since will 100% hog priority port)
uint1_t fifo_ready;
#pragma FEEDBACK fifo_ready
// Read from the current read frame buffer addr
uint32_t addr = pos_to_addr(vga_pos.x, vga_pos.y);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.araddr = dual_ram_to_addr(frame_buffer_read_port_sel_reg, addr);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arlen = 1-1; // size=1 minus 1: 1 transfer cycle (non-burst)
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arsize = 2; // 2^2=4 bytes per transfer
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.data.user.arburst = BURST_FIXED; // Not a burst, single fixed address per transfer
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.req.valid = fifo_ready;
uint1_t do_increment = fifo_ready & axi_xil_rd_pri_port_mem_dev_to_host_wire.read.req_ready;
vga_pos = vga_frame_pos_increment(vga_pos, do_increment);
// READ RESPONSE SIDE
// Get read data from the AXI RAM bus
uint8_t data[4];
uint1_t data_valid = 0;
data = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.burst.data_resp.user.rdata;
data_valid = axi_xil_rd_pri_port_mem_dev_to_host_wire.read.data.valid;
// Write pixel data into fifo
pixel_t pixel;
pixel.a = data[0];
pixel.r = data[1];
pixel.g = data[2];
pixel.b = data[3];
pixel_t pixels[1];
pixels[0] = pixel;
fifo_ready = pmod_async_fifo_write_logic(pixels, data_valid);
axi_xil_rd_pri_port_mem_host_to_dev_wire.read.data_ready = fifo_ready;
frame_buffer_read_port_sel_reg = frame_buffer_read_port_sel;
}
In graphics_demo.c the pixel_kernel
function implements incrementing RGB channel values as a simple test pattern.
The pixels_kernel_seq_range
function iterates over a range of frame area executing pixel_kernel
for reach pixel. The frame area is defined by start and end x
and y
positions.
// Single 'thread' state machine running pixel_kernel "sequentially" across an x,y range
void pixels_kernel_seq_range(
kernel_args_t args,
uint16_t x_start, uint16_t x_end,
uint16_t y_start, uint16_t y_end)
{
uint16_t x;
uint16_t y;
for(y=y_start; y<=y_end; y+=TILE_FACTOR)
{
for(x=x_start; x<=x_end; x+=TILE_FACTOR)
{
if(args.do_clear){
pixel_t pixel = {0};
frame_buf_write(x, y, pixel);
}else{
// Read the pixel from the 'read' frame buffer
pixel_t pixel = frame_buf_read(x, y);
pixel = pixel_kernel(args, pixel, x, y);
// Write pixel back to the 'write' frame buffer
frame_buf_write(x, y, pixel);
}
}
}
}
Multiple host threads can be reading and writing the frame buffers trying to execute their own sequential run of pixels_kernel_seq_range
. This is accomplished by manually instantiating multiple derived FSM thread pixels_kernel_seq_range_FSM
modules inside of a function called render_demo_kernel
. The NUM_TOTAL_THREADS = (NUM_X_THREADS*NUM_Y_THREADS)
copies of pixels_kernel_seq_range
all run in parallel, splitting the FRAME_WIDTH
by NUM_X_THREADS
threads and FRAME_HEIGHT
by NUM_Y_THREADS
.
// Module that runs pixel_kernel for every pixel
// by instantiating multiple simultaneous 'threads' of pixel_kernel_seq_range
void render_demo_kernel(
kernel_args_t args,
uint16_t x, uint16_t width,
uint16_t y, uint16_t height
){
// Wire up N parallel pixel_kernel_seq_range_FSM instances
uint1_t thread_done[NUM_X_THREADS][NUM_Y_THREADS];
uint32_t i,j;
uint1_t all_threads_done;
while(!all_threads_done)
{
pixels_kernel_seq_range_INPUT_t fsm_in[NUM_X_THREADS][NUM_Y_THREADS];
pixels_kernel_seq_range_OUTPUT_t fsm_out[NUM_X_THREADS][NUM_Y_THREADS];
all_threads_done = 1;
uint16_t thread_x_size = width >> NUM_X_THREADS_LOG2;
uint16_t thread_y_size = height >> NUM_Y_THREADS_LOG2;
for (i = 0; i < NUM_X_THREADS; i+=1)
{
for (j = 0; j < NUM_Y_THREADS; j+=1)
{
if(!thread_done[i][j])
{
fsm_in[i][j].input_valid = 1;
fsm_in[i][j].output_ready = 1;
fsm_in[i][j].args = args;
fsm_in[i][j].x_start = (thread_x_size*i) + x;
fsm_in[i][j].x_end = fsm_in[i][j].x_start + thread_x_size - 1;
fsm_in[i][j].y_start = (thread_y_size*j) + y;
fsm_in[i][j].y_end = fsm_in[i][j].y_start + thread_y_size - 1;
fsm_out[i][j] = pixels_kernel_seq_range_FSM(fsm_in[i][j]);
thread_done[i][j] = fsm_out[i][j].output_valid;
}
all_threads_done &= thread_done[i][j];
}
}
__clk();
}
}
render_demo_kernel
can then simply run in a loop, trying for the fastest frames per second possible.
void main()
{
kernel_args_t args;
...
while(1)
{
// Render entire frame
render_demo_kernel(args, 0, FRAME_WIDTH, 0, FRAME_HEIGHT);
}
}
The actual graphics_demo.c file main()
does some extra DDR initialization, is slowed down to render the test pattern slowly, and manages the toggling of the dual frame buffer 'which is the read buffer' select signal after each render_demo_kernel
iteration: frame_buffer_read_port_sel = !frame_buffer_read_port_sel;
.
The above graphics demo uses an AXI RAM frame buffer as the resource shared on a bus.
Another common use case is having an automatically pipelined function as the shared resource. shared_resource_bus_pipeline.h is a header-as-macro helper for declaring a pipeline instance connected to multiple host state machines via a shared resource bus.
// Example declaration using helper header-as-macro
#define SHARED_RESOURCE_BUS_PIPELINE_NAME name
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE output_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC the_func_to_pipeline
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE input_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"
In the above example a function output_t the_func_to_pipeline(input_t)
is made into a pipeline instance used like output_t name(input_t)
from derived FSM NUM_THREADS
threads host threads (running at HOST_CLK_MHZ
). The pipeline is automatically pipelined to meet the target DEV_CLK_MHZ
operating frequency.
As a demonstration of shared_resource_bus_pipeline.h the mandelbrot_demo.c file instantiates several pipeline devices for computation inside shared_mandelbrot_dev.c.
For example the device for computing Mandelbrot iterations is declared as such:
// Do N Mandelbrot iterations per call to mandelbrot_iter_func
#define ITER_CHUNK_SIZE 6
#define MAX_ITER 32
typedef struct mandelbrot_iter_t{
complex_t c;
complex_t z;
complex_t z_squared;
uint1_t escaped;
uint32_t n;
}mandelbrot_iter_t;
#define ESCAPE 2.0
mandelbrot_iter_t mandelbrot_iter_func(mandelbrot_iter_t inputs)
{
mandelbrot_iter_t rv = inputs;
uint32_t i;
for(i=0;i<ITER_CHUNK_SIZE;i+=1)
{
// Mimic while loop
if(!rv.escaped & (rv.n < MAX_ITER))
{
// float_lshift is division by subtraction on exponent only
rv.z.im = float_lshift((rv.z.re*rv.z.im), 1) + rv.c.im;
rv.z.re = rv.z_squared.re - rv.z_squared.im + rv.c.re;
rv.z_squared.re = rv.z.re * rv.z.re;
rv.z_squared.im = rv.z.im * rv.z.im;
rv.n = rv.n + 1;
rv.escaped = (rv.z_squared.re+rv.z_squared.im) > (ESCAPE*ESCAPE);
}
}
return rv;
}
#define SHARED_RESOURCE_BUS_PIPELINE_NAME mandelbrot_iter
#define SHARED_RESOURCE_BUS_PIPELINE_OUT_TYPE mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_FUNC mandelbrot_iter_func
#define SHARED_RESOURCE_BUS_PIPELINE_IN_TYPE mandelbrot_iter_t
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_THREADS NUM_USER_THREADS
#define SHARED_RESOURCE_BUS_PIPELINE_HOST_CLK_MHZ HOST_CLK_MHZ
#define SHARED_RESOURCE_BUS_PIPELINE_DEV_CLK_MHZ MANDELBROT_DEV_CLK_MHZ
#include "shared_resource_bus_pipeline.h"
Note the ITER_CHUNK_SIZE
scaling constant: This allows the single pipeline to compute multiple iterations (as opposed to using a single iteration pipeline more times sequentially). In this design with a medium sized FPGA the value scales (typically to fill the space not used by the derived FSM threads) to ~1-8
iterations in the pipeline.
Other devices include screen_to_complex
for computing screen position to complex plane position value, as well as iter_to_color
which takes and integer number of iterations and returns an RGB color.
Using the setup from inside the PipelineC-Graphics repo the following commands will compile and run the demo at 480p display with 8x tiling down to 80x60 pixels in frame buffer
.
rm -Rf ./build
../PipelineC/src/pipelinec mandelbrot_pipelinec_app.c --out_dir ./build --comb --sim --verilator --run -1
verilator -Mdir ./obj_dir -Wno-UNOPTFLAT -Wno-WIDTH -Wno-CASEOVERLAP --top-module top -cc ./build/top/top.v -O3 --exe main.cpp -I./build/verilator -CFLAGS -DUSE_VERILATOR -CFLAGS -DFRAME_WIDTH=640 -CFLAGS -DFRAME_HEIGHT=480 -LDFLAGS $(shell sdl2-config --libs)
cp ./main.cpp ./obj_dir
make CXXFLAGS="-DUSE_VERILATOR -I../../PipelineC/ -I../../PipelineC/pipelinec/include -I../build/verilator -I.." -C ./obj_dir -f Vtop.mk
./obj_dir/Vtop
Alternatively, cloning the mandelbrot
branch will allow you run to make mandelbrot_verilator
instead.
In keeping line with the law , its typical to scale frequency as the easiest first option. This design has two relevant clocks: DEV_CLK
the device clock and HOST_CLK
the host FSM 'threads' clock.
Currently derived FSMs have many optimizations to be done yet and thus typically you won't be able to scale thread clocks reliably, or very far. In this design the HOST_CLK
is set to ~40MHz
.
The device clock scaling is where PipelineC's automatic pipelining is critical. Essentially the device clock can be set ~arbitrarily high. However, the latency penalty of asynchronous clock domain crossings and the latency of ever more deeply pipelined devices needs to be weighed, the best solution is not always just to run at the maximum clock rate. In this original single threaded 'CPU style' design it was typical for the optimal device clock to be equal to the slow host clock. That is because the latency of extra pipelining + having the clock domain for the devices running at a higher clock rate was not an overall performance benefit.
Having a few simply sequentially iterating single-threads is very... CPU like. In that way, it is possible to scale 'by adding more cores/threads', but this comes with heavy resource use, needing to duplicate the entire state machine. In this example a medium size FPGA comfortable fits 4-8 derived state machine 'threads'.
In addition to tuning 'CPU-like' scaling parameters as above, it's possible to instead make ~'algorithmic'/architectural changes to be even more specific to the task, where the source code looks further and further from the original simple Mandelbrot implementation.
As opposed to having pipeline devices that separately compute dedicate things, for ex. screen_to_complex
screen position and mandelbrot_iter
the Mandelbrot iterations, it is possible to use fewer resources and, again CPU-like, use smaller single operation floating points units to compute with. In shared_mandelbrot_dev_fp_ops.c there is a version of the Mandebrot demo that has two Mandelbrot devices: a floating point adder, and a floating point multiplier. These pipelines are used iteratively for all floating point computations.
The original +
and *
floating point operators must manually be replaced with function calls (pending work on automatic operator overloading). Functions wrapping the shared resource bus pipelines of the form float fp_add(float x, float y)
and float fp_mult(float x, float y)
are made available.
Replacing all floating point operators with generic fp_add/mult
pipeline device use instead of using specialized Mandebrot specific pipeline devices will obviously result in a performance decrease. However, this structure opens up new ways of using the pipelines.
To start, automatically pipelining smaller single floating point units is easier than more complex pipelines consisting of multiple operators. So it becomes very easy to always run the floating point units near the maximum possible rate. However, best performance requires saturating those highly pipelined operators with multiple operations in flight at once (ex. multiple add/mults at once). An easy way to do that from a single thread is to find parallelism in the data:
The Mandelbrot iteration consists of four add/sub operations and three multiply operations. However, not all operators depend on each other which allows some to be completed at the same time as others.
The below C code completes two floating point operations at the same time. This is accomplished by starting the two ops, then waiting for both to finish.
// Code written in columns - a single op looks like:
// _start(args);
// out = _finish();
// Mult and sub at same time
/*float re_mult_im = */ fp_mult_start(rv.z.re, rv.z.im); /*float re_minus_im = */ fp_sub_start(rv.z_squared.re, rv.z_squared.im);
float re_mult_im = fp_mult_finish(/*rv.z.re, rv.z.im*/); float re_minus_im = fp_sub_finish(/*rv.z_squared.re, rv.z_squared.im*/);
// Two adds at same time
/*rv.z.im = */ fp_add_start(float_lshift(re_mult_im, 1), rv.c.im); /*rv.z.re = */ fp_add_start(re_minus_im, rv.c.re);
rv.z.im = fp_add_finish(/*float_lshift(re_mult_im, 1), rv.c.im*/); rv.z.re = fp_add_finish(/*re_minus_im, rv.c.re*/);
// Two mult at same time
/*rv.z_squared.re = */ fp_mult_start(rv.z.re, rv.z.re); /*rv.z_squared.im = */ fp_mult_start(rv.z.im, rv.z.im);
rv.z_squared.re = fp_mult_finish(/*rv.z.re, rv.z.re*/); rv.z_squared.im = fp_mult_finish(/*rv.z.im, rv.z.im*/);
// Final single adder alone
/*float re_plus_im = */ fp_add_start(rv.z_squared.re, rv.z_squared.im);
float re_plus_im = fp_add_finish(/*rv.z_squared.re, rv.z_squared.im*/);
As has been noted derived FSMs have many optimizations to be done yet and so simply adding more 'threads/cores' to the design doesn't scale very far. Additionally, the data parallelism found in the Mandelbrot iteration operations is not as great a level of paralleism as deeply pipelines devices can handle. Ex. You might have 2,4,8 threads/cores of FSM execution and the 1-2 operations in flight from each thread - meaning a maximum pipeline depth of ~8*~2=16 stages is the most the pipelining/frequency scaling the design can benefit from. What is needed is a way to access pixel level parallelism (ex. how the many many pixels can all be computed independently of each other by threads splitting parts of the screen to render) but without the cost of having many hardware threads.
Instead its possible to emulate having multiple physical 'cores/threads' of FSM execution by using single physical thread to time multiplex across multiple ~'instruction streams'/state machines. Very similar to how single core CPUs run multiple "simultaneous" threads by having processes time multiplexed onto the core by the operating system. In this case, there is no operating system - and the scheduling is a simple fixed cycling through discrete states of ~coroutine style functions.
Coroutines require execution to be suspended and then at a later time resumed. In this case, this is accomplished by creating a state machine (In C/PipelineC that itself is 'running' on the derived state machine 'core/thread' - yes it's confusing). This allows a single copy of the hardware (the physical FSM) to 'in software' implement N "simultaneous"(time multiplex quickly) copies of a small 'coroutine' state machine. In the mandelbrot_demo_w_fake_time_multiplex_threads.c version of the demo, physical derived FSM 'cores/threads' get called 'hardware threads' and the multiple time multiplexed instances of coroutines are called 'software threads'.
The below code from shared_mandelbrot_dev_w_fake_time_multiplex_threads.c describes the high level mandelbrot_kernel_fsm 'coroutine' state machine. It cycles over computing the screen position, looping for the Mandelbrot iterations, and then computing the final pixel color. It is written using async _start
and _finish
function calls to the original Mandelbrot specific shared pipeline devices. Each device has a state for starting an operation, and a state for finishing it.
typedef enum mandelbrot_kernel_fsm_state_t{
screen_to_complex_START,
screen_to_complex_FINISH,
mandelbrot_iter_START,
mandelbrot_iter_FINISH,
iter_to_color_START,
iter_to_color_FINISH
}mandelbrot_kernel_fsm_state_t;
typedef struct mandelbrot_kernel_state_t{
// FSM state
mandelbrot_kernel_fsm_state_t fsm_state;
// Func inputs
pixel_t pixel;
screen_state_t screen_state;
uint16_t x;
uint16_t y;
// Func local vars
mandelbrot_iter_t iter;
screen_to_complex_in_t stc;
pixel_t p;
// Done signal
uint1_t done;
}mandelbrot_kernel_state_t;
mandelbrot_kernel_state_t mandelbrot_kernel_fsm(mandelbrot_kernel_state_t state)
{
// The state machine starting and finishing async operations
state.done = 0;
if(state.fsm_state==screen_to_complex_START){
// Convert pixel coordinate to complex number
state.stc.state = state.screen_state;
state.stc.x = state.x;
state.stc.y = state.y;
/*state.iter.c = */screen_to_complex_start(state.stc);
state.fsm_state = screen_to_complex_FINISH;
}else if(state.fsm_state==screen_to_complex_FINISH){
state.iter.c = screen_to_complex_finish(/*state.stc*/);
// Do the mandelbrot iters
state.iter.z.re = 0.0;
state.iter.z.im = 0.0;
state.iter.z_squared.re = 0.0;
state.iter.z_squared.im = 0.0;
state.iter.n = 0;
state.iter.escaped = 0;
state.fsm_state = mandelbrot_iter_START;
}else if(state.fsm_state==mandelbrot_iter_START){
if(!state.iter.escaped & (state.iter.n < MAX_ITER)){
/*state.iter = */mandelbrot_iter_start(state.iter);
state.fsm_state = mandelbrot_iter_FINISH;
}else{
state.fsm_state = iter_to_color_START;
}
}else if(state.fsm_state==mandelbrot_iter_FINISH){
state.iter = mandelbrot_iter_finish(/*state.iter*/);
state.fsm_state = mandelbrot_iter_START;
}else if(state.fsm_state==iter_to_color_START){
// The color depends on the number of iterations
/*state.p = */iter_to_color_start(state.iter.n);
state.fsm_state = iter_to_color_FINISH;
}else if(state.fsm_state==iter_to_color_FINISH){
state.p = iter_to_color_finish(/*state.iter.n*/);
state.done = 1;
state.fsm_state = screen_to_complex_START; // Probably not needed
}
return state;
}
Then a single hardware 'thread' of the below C code is used to execute NUM_SW_THREADS
time multiplexed "simultaneous" instances of the above mandelbrot_kernel_fsm
coroutine 'software thread' state machine:
n_pixels_t n_time_multiplexed_mandelbrot_kernel_fsm(
screen_state_t screen_state,
uint16_t x[NUM_SW_THREADS],
uint16_t y[NUM_SW_THREADS]
){
n_pixels_t rv;
mandelbrot_kernel_state_t kernel_state[NUM_SW_THREADS];
// INIT
uint32_t i;
for(i = 0; i < NUM_SW_THREADS; i+=1)
{
kernel_state[i].fsm_state = screen_to_complex_START; // Probably not needed
kernel_state[i].screen_state = screen_state;
kernel_state[i].x = x[i];
kernel_state[i].y = y[i];
}
// LOOP doing N 'coroutine'/fsms until done
uint1_t thread_done[NUM_SW_THREADS];
uint1_t all_threads_done = 0;
do
{
all_threads_done = 1;
for(i = 0; i < NUM_SW_THREADS; i+=1)
{
// operate on front of shift reg [0] (as opposed to random[i])
if(!thread_done[0]){
kernel_state[0] = mandelbrot_kernel_fsm(kernel_state[0]);
rv.data[0] = kernel_state[0].p;
thread_done[0] = kernel_state[0].done;
all_threads_done &= thread_done[0];
}
// And then shift the reg to prepare next at [0]
ARRAY_1ROT_DOWN(mandelbrot_kernel_state_t, kernel_state, NUM_SW_THREADS)
ARRAY_1ROT_DOWN(pixel_t, rv.data, NUM_SW_THREADS)
ARRAY_1ROT_DOWN(uint1_t, thread_done, NUM_SW_THREADS)
}
}while(!all_threads_done);
return rv;
}
As seen above, the code loops across all NUM_SW_THREADS
instances of the coroutine state machine, where each iteration calls mandelbrot_kernel_fsm
and checks to see if the state machine has completed yet (and supplied a return value). In this way, cycling one state transition at a time across all of the 'soft' instances makes for an easy way to have NUM_SW_THREADS
operations in flight at once to the pipelined devices from just a single physical hardware derived FSM instance thread.
Using these shared resource buses its possible to picture even more complex architectures of host threads and computation devices.
Generally the functionality in shared_resource_bus.h will continue to be improved and made easier to adapt in more design situations.
Please reach out if interested in giving anything a try or making improvements, happy to help! -Julian