Perf improvements #314

GregAC · 2024-11-03T18:55:11Z

Proposed performance improvements (improving SRAM throughput and Ibex single cycle multiplier), rebased on top of the pinmux changes (#309).

Timing is passing and build time looking good.

Two things were required to achieve this:

Increasing the optimization level (Increase optimisation level #302), though this had no clear impact on build times
Adding a register stage to the internal SPI loopback (see commit dd85088). The timing path this cuts is odd, there's a single level of logic between two registers with almost 25ns worth of delay. I suspect it's something to do with the SPI timing constraints causing the tool to want the register to output at a hugely skewed time. We may just be on the edge of getting away with it and these changes have pushed it over the edge so regardless of whether we take these changes we should likely cut the internal loopback path.

I've also added saving the utilization and timing reports from the FPGA CI as artifacts so they can be examined after (in particular so you can look at the timing results for any particular PR).

`ios` isn't very accurate as these are only outputs.

Verilator's lint complains about this being implicit when selecting a value of 8. See future pinmapping commit.

These pins are only used for spi so don't need to be pinmuxed.

This removes blocks that have no options from the register map

Updates the definitions of the number of GPIO, UART and SPI devices that are available in the software definitions file for Sonata to match the changed nubmer of devices in the new pin mapping configuration. This commit aims to keep changes minimal to get tests passing - it does not look at solving any other issues with e.g. outdated definitions for now.

Updates the manual `pinmux_checker` application along with the `pinmux_all_blocks_check` that uses it to match the new pin mapping changes under Pinmux for 1.0. This involves changing the pinmux assignment for a couple of pins to match the changed order in the pinmux config changes. It also involves reducing the number of UARTs that can be used by the the pinmux checker from 1-2 down from 1-4, since the overall number of UARTs has also been reduced by recent harwdare changes, and thus these non-existent UART devices should not be exposed. It finally also changes the number of SPIs exposed to just SPI0 and SPI1 which are now exposed via Pinmux, and makes the necessary changes to pinmux checker to facilitate this.

…ernal flash Updates the pinmux tests for the changes introduced with the new pinmux mapping. This primarily involves just changing the pins / devices used to the devices that are now mapped on those pins, and updating documentation to reflect this. Also resets all pinmux logic after muxing to ensure that if more tests are added afterwards or if the tests are run multiple times, that errors are not introduced by the pinmux tests changing state. Since the Application flash pinmux pins are no longer available through pinmux, this also involves converting the SPI Pinmux test to change to instead test using an extrenal SPI PMOD SF3 conecting to PMOD1. This repurposes the existing test logic used by the pinmux checker to reduce the amonut of additional code.

Adds a basic PMOD SF3 DPI in Verilator, attached to the PMOD1 pins so that the device can be used. This is modelled to be the same as the existing Application Flash on Sonata, but with a different JEDEC ID that it reports. Makes the Spi Flash take a JEDEC ID as a parameter to reduce duplicated logic between these DPIs. Disclaimer: I am not a hardware engineer, the hardware changes may contain problems.

Co-authored-by: Elliot Baptist <elliot.baptist@lowrisc.org>

Timing failures have been observed on this loopback path, it has minimal logic levels but a very long delay, possibly due to I/O timing constraints. Adding the register stage cuts the internal path. With the register stage the internal loopback cannot run at full speed, however as this is for testing purposes only this is acceptable.

Need a minimum of 2 (this is what is used in OpenTitan) to enable back to back requests without stall cycles.

Update code from upstream repository https://github.com/lowrisc/cheriot-ibex.git to revision ea2df9db3bcea776f0dc72d6d89c31c73798ecd4 * Feed RV32M through ibexc_top_tracing/ibexc_top (Greg Chadwick) * Switch to no bitmanip by default (Greg Chadwick) * Feed RV32B through in ibexc_top (Greg Chadwick) Signed-off-by: Greg Chadwick <gac@lowrisc.org>

This is effectively a no-op change. Before the latest Ibex was vendored we had no bitmanip (the RV32BFull parameter was not fully passed through) and RV32M was the fast multiplier. Sadly the single cycle multiplier seems to be increasing timing pressure. It does just meet timing but greatly increases synthesis times. As it's implemented with in-built FPGA DSP blocks it shouldn't be a big issue to use it so something to examine here but for now leave things as they are.

GregAC · 2024-11-04T07:25:54Z

For reference here is the timing summary from the CI run (from the implementation reports artefact)

    WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints  
    -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------  
      0.043        0.000                      0                42250        0.066        0.000                      0                42249        1.741        0.000                       0                 15674

Here's the utilization report

+----------------------------+-------+-------+------------+-----------+-------+
|          Site Type         |  Used | Fixed | Prohibited | Available | Util% |
+----------------------------+-------+-------+------------+-----------+-------+
| Slice LUTs                 | 22736 |     0 |          0 |     32600 | 69.74 |
|   LUT as Logic             | 22628 |     0 |          0 |     32600 | 69.41 |
|   LUT as Memory            |   108 |     0 |          0 |      9600 |  1.13 |
|     LUT as Distributed RAM |   108 |     0 |            |           |       |
|     LUT as Shift Register  |     0 |     0 |            |           |       |
| Slice Registers            | 15330 |     0 |          0 |     65200 | 23.51 |
|   Register as Flip Flop    | 15330 |     0 |          0 |     65200 | 23.51 |
|   Register as Latch        |     0 |     0 |          0 |     65200 |  0.00 |
| F7 Muxes                   |   551 |     0 |          0 |     16300 |  3.38 |
| F8 Muxes                   |   142 |     0 |          0 |      8150 |  1.74 |
+----------------------------+-------+-------+------------+-----------+-------+

Build time was 11m 30s which is in line with previous build times in CI

elliotb-lowrisc · 2024-11-04T08:44:11Z

I've also seen an LCD SPI path with 24 ns of delay on a single net. The hold slack on the path was only 13 ns, so I suppose it might be hold-fixing gone wrong.

elliotb-lowrisc · 2024-11-04T11:38:08Z

.github/workflows/ci.yml

@@ -86,6 +86,8 @@ jobs:
    runs-on: [ubuntu-22.04-fpga, sonata]
    env:
      BITSTREAM_PATH: build/lowrisc_sonata_system_0/synth-vivado/lowrisc_sonata_system_0.bit
+      TIMING_RPT: build/lowrisc_sonata_system_0/synth-vivado/lowrisc_sonata_system_0.runs/impl_1/top_sonata_timing_summary_routed.rpt


Is there a way to make this conditional on whether "top_sonata_timing_summary_postroute_physopted.rpt" exists, so we avoid looking at the wrong report when we switch to the new 2024.1-compatible post-route optimisation flow?

I think easier to just remember to fix this up when we alter the tool version. It's reasonable for CI to specifically work with the currently agreed tool versions and need modest updates when we change versions.

elliotb-lowrisc · 2024-11-04T12:07:53Z

rtl/ip/spi/rtl/spi.sv

-  assign spi_cipo = reg2hw.control.int_loopback.q ? spi_copi_o : spi_cipo_i;
+  logic spi_copi_q;
+
+  always_ff @(posedge clk_i) begin


I have noticed the lack of reset but think it is fine. The u_spi_core/copi_shift_q driving it also lacks a reset, so this is just replicating the behaviour one cycle later.

yeah I prefer to leave resets off flops that don't strictly need them. Given the SPI ignores incoming data when it's not been specifically command to clock things in/out then it won't ever actually 'see' the first bit out of reset anyway.

elliotb-lowrisc · 2024-11-04T13:55:10Z

I've run a bitstream build of this PR and a baseline build with the changes it is based on (up to and including the opt=2 change) using Vivado 2021.1 on my local machine. The breaking of the SPI loopback path did seem to improve the overall WNS (-1.451 -> -0.391). Though, for these two builds at least, the TNS went the other way (-8.168 -> -27.834), though the numbers are still relatively small. Build times were also increase substantially (19m -> 32m) though that is muddied by builds going on in parallel. All this is somewhat concerning (for the state of Sonata if not this PR) but not conclusive. More testing to follow

elliotb-lowrisc · 2024-11-04T14:22:21Z

Adding timing constraint updates (#316) and building ono showed a more reasonable build time (15m) and timing passing.

GregAC · 2024-11-04T14:27:56Z

Thanks for checking it out @elliotb-lowrisc. Frustrating it's giving you some failures.

I've certainly see some flakey behaviour from the implementation flow, one version of this (pre pinmux changes) actually gave a noticeable improvement in timing, taking us to 0.5 ns WNS (positive)!

I've rebased this locally now the pinmux changes are in and am doing another run, will see what results!

GregAC · 2024-11-04T14:59:03Z

Well rebasing and running locally gives a terrible timing result (something like -1.5 WNS) and increased build times.

So sadly I think we'll have to reject this for 1.0 :( I will the try the two changes separately and see if they work in isolation.

As @elliotb-lowrisc says I think this is more a reflection of the current state of Sonata rather than the specific things this PR is doing, all too easy to push things the wrong way with what looks like modest design changes.

elliotb-lowrisc · 2024-11-04T15:00:34Z

I think the SPI loopback change is definitely worth a quick PR and merge if possible

GregAC · 2024-11-04T15:00:54Z

I think the SPI loopback change is definitely worth a quick PR and merge if possible

Agreed I'll spin out a separate PR

GregAC · 2024-11-06T09:34:17Z

I have had this just passing timing on latest main but it does look very tight and was starting to push up build times.

So I think we're best off doing this post-1.0. With @elliotb-lowrisc's crossbar change and a couple of simple Ibex timing fixes I'd hope this becomes a lot more comfortable.

HU90m and others added 16 commits November 1, 2024 15:07

pinmux: changed spi signals to match common names

c516525

pinmux: changed pwm signal name

c5d0f42

`ios` isn't very accurate as these are only outputs.

pinmux: explicitly set sizes for combined output selectors

96e6d7c

Verilator's lint complains about this being implicit when selecting a value of 8. See future pinmapping commit.

pinmux: Added remaining devices to the headers and pmod

bec0c19

sonata_system: dedicated spi devices pulled out of pinmux

c2265d4

These pins are only used for spi so don't need to be pinmuxed.

doc: updated pin mappings diagram to reflect spi changes

f9f30c6

pinmux: removed block inputs with no input options.

248a851

This removes blocks that have no options from the register map

Pinmapping uart and spi fixup

c27d4cd

top: added the pmodc gpio block and pins

cb76457

Add PMODC pins to XDC file

e523ec8

Co-authored-by: Elliot Baptist <elliot.baptist@lowrisc.org>

Increase optimisation level

a23c6e5

This was referenced Nov 3, 2024

Single cycle mult #295

Closed

Sram throughput #279

Closed

GregAC added 5 commits November 3, 2024 22:28

[ci] Save bitstream utilization and timing reports as artifacts

c4ff981

[rtl] Increase outstanding requests in SRAM wrapper

f0ff867

Need a minimum of 2 (this is what is used in OpenTitan) to enable back to back requests without stall cycles.

Switch to single cycle multiplier for Ibex

8235bc7

GregAC force-pushed the perf_improvements branch from dc38d53 to 8235bc7 Compare November 3, 2024 22:28

GregAC added the consider-1.0 label Nov 4, 2024

elliotb-lowrisc reviewed Nov 4, 2024

View reviewed changes

GregAC added post-1.0 and removed consider-1.0 labels Nov 6, 2024

marnovandermaas removed the post-1.0 label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf improvements #314

Perf improvements #314

GregAC commented Nov 3, 2024

GregAC commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

elliotb-lowrisc Nov 4, 2024 •

edited

Loading

GregAC Nov 4, 2024

elliotb-lowrisc Nov 4, 2024

GregAC Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

GregAC commented Nov 4, 2024

GregAC commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

GregAC commented Nov 4, 2024

GregAC commented Nov 6, 2024

Perf improvements #314

Are you sure you want to change the base?

Perf improvements #314

Conversation

GregAC commented Nov 3, 2024

GregAC commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

elliotb-lowrisc Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

GregAC Nov 4, 2024

Choose a reason for hiding this comment

elliotb-lowrisc Nov 4, 2024

Choose a reason for hiding this comment

GregAC Nov 4, 2024

Choose a reason for hiding this comment

elliotb-lowrisc commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

GregAC commented Nov 4, 2024

GregAC commented Nov 4, 2024

elliotb-lowrisc commented Nov 4, 2024

GregAC commented Nov 4, 2024

GregAC commented Nov 6, 2024

elliotb-lowrisc Nov 4, 2024 •

edited

Loading