Constructing large, parallel, state machines #2414
-
Hi all. I'm trying to use Clash to construct a large state machine for a project that I'm working on. Clash and VHDL are still new to me and I'm hitting some problems that I can't see a way past, so I'm asking here to see what suggestions anyone might have. I've put together a state machine implementation using the But: it doesn't scale (see below). I'm wondering if this is because the algorithm I'm using is at fault, and whether it's even possible for the nice properties I've got to hold for non-trivial problem sizes. — Here's a cut-down version of my code: module Example.CutDown (topEntity) where
import Clash.Prelude
type StateVecLen = 10
type StateVecIndexType = Signed 5
initialState :: Vec StateVecLen Bool
initialState =
False
:> False
:> False
:> False
:> False
:> False
:> False
:> False
:> False
:> True
:> Nil
goalStateIndex :: StateVecIndexType
goalStateIndex = 0
getBit :: Vec StateVecLen Bool -> StateVecIndexType -> Bool
getBit currState index = case index of
0 -> c 7 || c 4 || c 0
1 -> c 0 || c 0 || c 5 || c 1 || c 1 || c 1
2 -> c 9 || c 6 || c 2
3 -> c 2 || c 2 || c 7 || c 3 || c 3 || c 3
4 -> c 1 || c 8 || c 4
5 -> c 4 || c 4 || c 9 || c 5 || c 5 || c 5
6 -> c 3 || c 0 || c 6
7 -> c 6 || c 6 || c 1 || c 7 || c 7 || c 7
8 -> c 5 || c 2 || c 8
9 -> c 8 || c 8 || c 3 || c 9 || c 9 || c 9
_ -> False
where
c :: StateVecIndexType -> Bool
c i = currState !! i
-- Bool input argument is currently unused.
stateMachine ::
StateVecIndexType ->
Vec StateVecLen Bool ->
Bool ->
(Vec StateVecLen Bool, Bool)
stateMachine goalBit state _input = (newState, goalReached)
where
newState :: Vec StateVecLen Bool
newState = imap (\i _a -> getBit state (fromIntegral i)) state
goalReached :: Bool
goalReached = state !! goalBit
topEntity ::
Clock System ->
Reset System ->
Enable System ->
Signal System Bool ->
Signal System Bool
topEntity = exposeClockResetEnable mealyMachine
where
mealyMachine ::
HiddenClockResetEnable dom =>
Signal dom Bool ->
Signal dom Bool
mealyMachine = mealy transferFunction initialState
transferFunction ::
Vec StateVecLen Bool ->
Bool ->
(Vec StateVecLen Bool, Bool)
transferFunction = stateMachine goalStateIndex — This example code has only 10 boolean values as its state, and it all works fine. By that, I mean that I can run it through Clash, generate VHDL, and have Vivado successfully import and compile that VHDL. However, when I try to use this approach on a real-world problem that uses upwards of 8,000 booleans, it fails badly. Only these changes are needed to grow the example above to the real-world problem size:
(In fact, these steps are the reverse of how I cut down my real-world problem to produce the above example code.) When I try processing the "full size" implementation, the Vivado tools I'm using (release 2022.1, I believe) really struggle to process the VHDL that Clash generates. The 'vivado' compiler cannot compile them at all: it takes 6 hours to exhaust all (115GB) of the available memory. Even importing the files into the project is a struggle, as 'srcscanner' falls over in the same way. — I've read around a bit, and I suspect my approach simply requires the FPGA implementation to do too much "routing" to get values into and out of the next-state calculation logic. In particular, the One suggestion I've seen is to store the boolean values in a block RAM instead of in registers (as I believe my current code does). But my understanding is that block RAM can access only one or two values per cycle, which would force the calculation of the next-state values to be done in series, not in parallel. — To get to my question, then:
Thanks in advance for any insights or pointers that might help me out with this, Paul. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Hello! |
Beta Was this translation helpful? Give feedback.
-
Hello and welcome! Your code looks like it is generated code. And while I can see how it works, I have no idea what it does. For me, that makes giving advice rather difficult; I can't actually think and reason about the problem or suitable tactics of approaching it differently. Basile already makes several good points, I'll try to add to it. FPGA's have great parallelism, yes. But computations and storage in actual designs still usually have a fair measure of locality. Data usually only travels relatively small distances over the chip, and neighbouring logic elements are better connected than elements far apart. Eventually, state from one side of the chip might reach the other side, but preferably not in a single clock cycle. If you have 8000 bits of state and the next state of one bit depends on bits all over that current state, that does sound like it will give routing issues. Also, the amount of bits that determine the next state of a single bit is relevant. However, while FPGA's are great at parallelism, that doesn't mean it's a good solution to do everything in parallel. The clock speed the eventual circuit will run at is limited by the longest combinatorial path: the longest path some signal needs to propagate from the output of the register with the current state to the input of the register with the next state. If you do an awful lot in a single clock cycle, that path might get very long. If you have a circuit that does everything in a single clock cycle, but it can only run at 10 MHz, do realise that a circuit that does the same computation but distributed over 20 clock cycles and achieving a clock speed of 200 MHz is just as quick in the end! Finally, there's pipelining. Suppose you split your computation into 5 sequential steps, running at 200 MHz. You might say, okay, that means I can process 40 million pieces of data per second:
D1 is a piece of data (a datum if you will), and time flows from top to bottom. Left to right is data processing step. After 5 clock cycles, the data is processed. But if you can write a pipelined solution, it can actually process 200 million pieces of data: the circuit that does processing step 1 can already start on D2 when D1 has travelled to the circuit doing processing step 2. It processes 200 million pieces of data per second, it's just that there's a latency of 5 cycles before the result is available:
This is a lot of parallelism in a single 5-step computation. And you can do other computations with the area of FPGA that you still have left. What it boils down to is, achieving parallelism is not about doing everything in a single combinatorial superfunction, it's about pipelining computations and having many computations running next to each other. Note that block RAMs are pipelined. While it takes one or two cycles before you get your result, it will produce a result every clock cycle. Why one or two you ask? The minimum is one, but you can often reach higher clock speeds if you enable the register that is on the output of the block RAM. This adds another cycle delay, but might improve throughput. If you simply put a register on the output in Clash, the synthesis tool ought to be able to infer that you want that extra register in the block RAM enabled. [edit] |
Beta Was this translation helpful? Give feedback.
Hello!
I think the example is a little bit too abstract for me to understand how the requirements change as the problem scales.
I think the BRAM recommendation is good to deal with a big state. If the limitation in the number of concurrent reads is a problem, consider replicating the state over multiple BRAMs.
In general I would avoid using really big vectors, Clash tends to struggle with big vectors and compilation becomes really slow even before they cause synthesis issues.
If the
getBit
gets really complicated, maybe there is a way to break down your state machine into multiple separate state machines that send each other messages? Or maybe there's a way to do some of the work sequenti…