Constructing large, parallel, state machines #2414

pdwhittaker · 2023-01-24T15:18:45Z

pdwhittaker
Jan 24, 2023

Hi all. I'm trying to use Clash to construct a large state machine for a project that I'm working on. Clash and VHDL are still new to me and I'm hitting some problems that I can't see a way past, so I'm asking here to see what suggestions anyone might have.

I've put together a state machine implementation using the mealy function and some next-state logic. It seems relatively simple, and everything looks okay so far as I can see. It also has some nice properties that I'd like to keep: in particular, the next-state calculations are all being done simultaneously, in parallel. (This parallelisation was part of the motivation for doing this in hardware in the first place.)

But: it doesn't scale (see below). I'm wondering if this is because the algorithm I'm using is at fault, and whether it's even possible for the nice properties I've got to hold for non-trivial problem sizes.

—

Here's a cut-down version of my code:

module Example.CutDown (topEntity) where

import Clash.Prelude

type StateVecLen = 10

type StateVecIndexType = Signed 5

initialState :: Vec StateVecLen Bool
initialState =
    False
        :> False
        :> False
        :> False
        :> False
        :> False
        :> False
        :> False
        :> False
        :> True
        :> Nil

goalStateIndex :: StateVecIndexType
goalStateIndex = 0

getBit :: Vec StateVecLen Bool -> StateVecIndexType -> Bool
getBit currState index = case index of
    0 -> c 7 || c 4 || c 0
    1 -> c 0 || c 0 || c 5 || c 1 || c 1 || c 1
    2 -> c 9 || c 6 || c 2
    3 -> c 2 || c 2 || c 7 || c 3 || c 3 || c 3
    4 -> c 1 || c 8 || c 4
    5 -> c 4 || c 4 || c 9 || c 5 || c 5 || c 5
    6 -> c 3 || c 0 || c 6
    7 -> c 6 || c 6 || c 1 || c 7 || c 7 || c 7
    8 -> c 5 || c 2 || c 8
    9 -> c 8 || c 8 || c 3 || c 9 || c 9 || c 9
    _ -> False
  where
    c :: StateVecIndexType -> Bool
    c i = currState !! i

-- Bool input argument is currently unused.
stateMachine ::
    StateVecIndexType ->
    Vec StateVecLen Bool ->
    Bool ->
    (Vec StateVecLen Bool, Bool)
stateMachine goalBit state _input = (newState, goalReached)
  where
    newState :: Vec StateVecLen Bool
    newState = imap (\i _a -> getBit state (fromIntegral i)) state
    goalReached :: Bool
    goalReached = state !! goalBit

topEntity ::
    Clock System ->
    Reset System ->
    Enable System ->
    Signal System Bool ->
    Signal System Bool
topEntity = exposeClockResetEnable mealyMachine
  where
    mealyMachine ::
        HiddenClockResetEnable dom =>
        Signal dom Bool ->
        Signal dom Bool
    mealyMachine = mealy transferFunction initialState
    transferFunction ::
        Vec StateVecLen Bool ->
        Bool ->
        (Vec StateVecLen Bool, Bool)
    transferFunction = stateMachine goalStateIndex

—

This example code has only 10 boolean values as its state, and it all works fine. By that, I mean that I can run it through Clash, generate VHDL, and have Vivado successfully import and compile that VHDL.

However, when I try to use this approach on a real-world problem that uses upwards of 8,000 booleans, it fails badly.

Only these changes are needed to grow the example above to the real-world problem size:

Supply a (much) longer initialState vector
Add (many) extra cases to getBit
Increase the sizes of StateVecLen and StateVecIndexType types accordingly

(In fact, these steps are the reverse of how I cut down my real-world problem to produce the above example code.)

When I try processing the "full size" implementation, the Vivado tools I'm using (release 2022.1, I believe) really struggle to process the VHDL that Clash generates.

The 'vivado' compiler cannot compile them at all: it takes 6 hours to exhaust all (115GB) of the available memory. Even importing the files into the project is a struggle, as 'srcscanner' falls over in the same way.

—

I've read around a bit, and I suspect my approach simply requires the FPGA implementation to do too much "routing" to get values into and out of the next-state calculation logic.

In particular, the getBit function connects boolean values from all across the state, so I imagine it turning things into a "huge ball of wiring" as the number of values increases.

One suggestion I've seen is to store the boolean values in a block RAM instead of in registers (as I believe my current code does). But my understanding is that block RAM can access only one or two values per cycle, which would force the calculation of the next-state values to be done in series, not in parallel.

—

To get to my question, then:

Is there some way I could "fix" my implementation above to avoid giving Vivado indigestion, or is there not really any scalable way to do this sort of parallel calculation on an FPGA?

Thanks in advance for any insights or pointers that might help me out with this,

Paul.

Answered by basile-henry

Jan 24, 2023

Hello!
I think the example is a little bit too abstract for me to understand how the requirements change as the problem scales.
I think the BRAM recommendation is good to deal with a big state. If the limitation in the number of concurrent reads is a problem, consider replicating the state over multiple BRAMs.
In general I would avoid using really big vectors, Clash tends to struggle with big vectors and compilation becomes really slow even before they cause synthesis issues.
If the getBit gets really complicated, maybe there is a way to break down your state machine into multiple separate state machines that send each other messages? Or maybe there's a way to do some of the work sequenti…

View full answer

basile-henry · 2023-01-24T16:12:02Z

basile-henry
Jan 24, 2023
Collaborator

Hello!
I think the example is a little bit too abstract for me to understand how the requirements change as the problem scales.
I think the BRAM recommendation is good to deal with a big state. If the limitation in the number of concurrent reads is a problem, consider replicating the state over multiple BRAMs.
In general I would avoid using really big vectors, Clash tends to struggle with big vectors and compilation becomes really slow even before they cause synthesis issues.
If the getBit gets really complicated, maybe there is a way to break down your state machine into multiple separate state machines that send each other messages? Or maybe there's a way to do some of the work sequentially instead of everything in parallel.

8 replies

DigitalBrains1 Jan 31, 2023
Maintainer

You could reduce the computational load of the Clash design as well as the VHDL design, but not the resultant hardware, by using binary numbers as state and using numeric boolean operations plus reduceOr. But that doesn't address the fundamental problem; it just makes it easier to digest for Clash and Vivado, while you look for ways to deal with the real problems.

So not Vec StateVecLen Bool but Unsigned StateVecLen. And not

getBit currState index = case index of
    0 -> c 7 || c 4 || c 0

but

getBit currState index = case index of
    0 -> reduceOr (currState .&. 0b0010010001)

(Just an example, the whole getBit thing might need to be reconsidered, or not, I don't readily know).

This also increases load in a way: every branch now operates on StateVecLen bits instead of only the checked bits. I think this is pretty quickly eliminated by Vivado when it starts "optimising".

[edit]
About the difference between Unsigned n and BitVector n: the latter sounds better, but Bit and BitVector are three-valued logic whereas Unsigned is regular two-valued logic. So Unsigned performs better even though BitVector is semantically a better fit. I'd pick Unsigned for performance.
[/edit]

pdwhittaker Jan 31, 2023
Author

Nice, I'd not have thought of using a single Unsigned like this. I'll probably still do the refactor to use block RAM anyway (since future problems might be larger), but may well give this a quick try first just to see how it copes.

One question, though: is there a better way of specifying the bitmasks for reduceOr than literal constants? I'm thinking that the constants are values of type Unsigned StateVecLen, i.e. would be StateVecLen (+2) characters long when written in binary. Even if I use hex, when I scale this problem up it'll take 16MB of source code (from 8k cases x 2k character bitmasks), which I imagine GHC will struggle with. Any thoughts?

basile-henry Jan 31, 2023
Collaborator

If you're using a BRAM, you can initialise it with a separate file: https://hackage.haskell.org/package/clash-prelude-1.6.4/docs/Clash-Explicit-BlockRam-File.html
Or if you're generating the content in Haskell, you could do it in TemplateHaskell at compile time using a MemBlob: https://hackage.haskell.org/package/clash-prelude-1.6.4/docs/Clash-Explicit-BlockRam-Blob.html#v:memBlobTH

DigitalBrains1 Jan 31, 2023
Maintainer

Note that both blockRamFile and blockRamBlob lead to short Haskell source, but blockRamBlob still generates large HDL. And blockRamFile probably performs better in Verilog and SystemVerilog and poorly in VHDL, as in Verilog, the "read file" mechanism is builtin whereas in VHDL it is not. Also, both will generate ASCII binary constants, a ~~256-fold~~ 128-fold overhead. With Blob it's in the HDL source and with File it's in the separate file. Conversely, in Haskell, Blob outperforms File usually, but this is only startup costs, once all memory locations have been read they should perform identically from then on knocks on wood.

[edit]
Oh, and Vivado can be a struggle with memory files. If you use project mode, have the file extension be .mem and add the file to your project. I forgot if we figured out a good way to add them in non-project mode yet.
[/edit]

pdwhittaker Jan 31, 2023
Author

Cheers, especially for that tip about Vivado .mem filenames.

DigitalBrains1 · 2023-01-28T10:43:48Z

DigitalBrains1
Jan 28, 2023
Maintainer

Hello and welcome!

Your code looks like it is generated code. And while I can see how it works, I have no idea what it does. For me, that makes giving advice rather difficult; I can't actually think and reason about the problem or suitable tactics of approaching it differently.

Basile already makes several good points, I'll try to add to it.

FPGA's have great parallelism, yes. But computations and storage in actual designs still usually have a fair measure of locality. Data usually only travels relatively small distances over the chip, and neighbouring logic elements are better connected than elements far apart. Eventually, state from one side of the chip might reach the other side, but preferably not in a single clock cycle. If you have 8000 bits of state and the next state of one bit depends on bits all over that current state, that does sound like it will give routing issues. Also, the amount of bits that determine the next state of a single bit is relevant.

However, while FPGA's are great at parallelism, that doesn't mean it's a good solution to do everything in parallel. The clock speed the eventual circuit will run at is limited by the longest combinatorial path: the longest path some signal needs to propagate from the output of the register with the current state to the input of the register with the next state. If you do an awful lot in a single clock cycle, that path might get very long. If you have a circuit that does everything in a single clock cycle, but it can only run at 10 MHz, do realise that a circuit that does the same computation but distributed over 20 clock cycles and achieving a clock speed of 200 MHz is just as quick in the end!

Finally, there's pipelining. Suppose you split your computation into 5 sequential steps, running at 200 MHz. You might say, okay, that means I can process 40 million pieces of data per second:

1	2	3	4	5
D1
	D1
		D1
			D1
				D1
D2
	D2

D1 is a piece of data (a datum if you will), and time flows from top to bottom. Left to right is data processing step. After 5 clock cycles, the data is processed.

But if you can write a pipelined solution, it can actually process 200 million pieces of data: the circuit that does processing step 1 can already start on D2 when D1 has travelled to the circuit doing processing step 2. It processes 200 million pieces of data per second, it's just that there's a latency of 5 cycles before the result is available:

1	2	3	4	5
D1
D2	D1
D3	D2	D1
D4	D3	D2	D1
D5	D4	D3	D2	D1
D6	D5	D4	D3	D2
D7	D6	D5	D4	D3

This is a lot of parallelism in a single 5-step computation. And you can do other computations with the area of FPGA that you still have left. What it boils down to is, achieving parallelism is not about doing everything in a single combinatorial superfunction, it's about pipelining computations and having many computations running next to each other.

Note that block RAMs are pipelined. While it takes one or two cycles before you get your result, it will produce a result every clock cycle. Why one or two you ask? The minimum is one, but you can often reach higher clock speeds if you enable the register that is on the output of the block RAM. This adds another cycle delay, but might improve throughput. If you simply put a register on the output in Clash, the synthesis tool ought to be able to infer that you want that extra register in the block RAM enabled.

[edit]
Wait, no, that final statement should be qualified. The output register of the block RAM is restricted in availability of enable and reset lines. You will need to use a form of register (register, delay, dflipflop) and arguments that can be realised with the output register in the block RAM, otherwise it will pick a generic register and use that, which is suboptimal. I think dflipflop should generally work. If there is an an enable line on the output reg, I'd hope that delay undefined would work (which is a register with an enable line but no initial value).
[/edit]

1 reply

pdwhittaker Jan 31, 2023
Author

Yes, you're absolutely right about this being autogenerated code, so I'm afraid there isn't much "intuition" to give about what it's doing, except for just executing a state machine. (The top-level input to this project comes in a very hardware-unfriendly format, which I'm first compiling down to a transition graph in Haskell, and then converting to a state machine for Clash to execute.)

Also, as this is still fairly early days for me, my current goal is only to see whether my basic implementation choices would work (which, as it turns out, they won't 😄).

Thanks for your points about routing issues and designing with data locality in mind — although I might eventually have guessed that data locality was necessary, it's much nicer to have someone with experience saying so. Not sure if I'll try pipelining yet, but once I've got things working that seems like a good direction to go: as you say, it's a significant throughput improvement. I will keep it in mind when I'm redesigning to use block RAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constructing large, parallel, state machines #2414

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Constructing large, parallel, state machines #2414

pdwhittaker Jan 24, 2023

Replies: 2 comments · 9 replies

basile-henry Jan 24, 2023 Collaborator

DigitalBrains1 Jan 31, 2023 Maintainer

pdwhittaker Jan 31, 2023 Author

basile-henry Jan 31, 2023 Collaborator

DigitalBrains1 Jan 31, 2023 Maintainer

pdwhittaker Jan 31, 2023 Author

DigitalBrains1 Jan 28, 2023 Maintainer

pdwhittaker Jan 31, 2023 Author

pdwhittaker
Jan 24, 2023

Replies: 2 comments 9 replies

basile-henry
Jan 24, 2023
Collaborator

DigitalBrains1 Jan 31, 2023
Maintainer

pdwhittaker Jan 31, 2023
Author

basile-henry Jan 31, 2023
Collaborator

DigitalBrains1 Jan 31, 2023
Maintainer

pdwhittaker Jan 31, 2023
Author

DigitalBrains1
Jan 28, 2023
Maintainer

pdwhittaker Jan 31, 2023
Author