Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

ALTracer · 2024-10-01T16:58:06Z

Detailed description

This is a perf fix to an existing feature.
The existing problem is significant gaps between SPI bytes in both read and write xfers as driven by bmpflash + blackpill-f411ce (and likely others).
This PR solves it by providing a continuous block xfer primitive (no IRQ and no DMA involved)

Tested to increase bmpflash read -b int dump times from 35 to 31 seconds for a 8192 KiB w25q64 chip (using 12 MHz). The atomic section is used to block interrupts for 170 microseconds (256 byte read), otherwise my patch made the board hang reliably (no read timeouts). I may likely rewrite this once more using direct register manipulation as opposed to libopencm3 spi API usage. Short reads, like SFDP, indicate normal gaps between command bytes (I didn't change them) but no gaps in data page phase.
The acceleration is achieved by keeping a byte (actually 8/16-bit SPI word) in flight behind the DR shadow register, which is how it is intended to be used. DMA bindings are harder and may result in channel/stream conflicts.

Your checklist for this pull request

I've read the Code of Conduct
I've read the guidelines for contributing to this repository
It builds for hardware native (see Building the firmware)
It builds as BMDA (see Building the BMDA)
I've tested it to the best of my ability
My commit messages provide a useful short description of what the commits do

Closing issues

ALTracer · 2024-10-26T16:20:33Z

Rebased to main.
A one-line change in PR1968 triggers a warning from reusing tx buffer as a rx buffer now.
I can split them into distinct TX and RX buffers.
The IRQ-masking DR-polling version can deal with one of them possibly being NULL. I'd like to rely on TX buffer being non-null for simplicity, and containing zero-bytes which can be submitted to SPI DR.
Another venue is actually leveraging a DMA stream on F4 (or channel on F1) where platforms allow, and where it does not conflict with aux USART and SWO USART.
What is the API here -- submitting two unidirectional transfers, or submitting one transfer but specifying the Tx header length (and the rest is Rx)? I'm assuming full-duplex is time-divided into Tx then Rx. Block reads (and block writes) to 25-series flash are already implemented as a byte-wise Tx command then block-wise Rx.

dragonmux

Apologies this landed by the way side, done an initial review as with the revised platform configurations and such this is worth getting in for v2.0.

dragonmux · 2024-11-20T18:46:39Z

src/include/platform_support.h

@@ -77,6 +77,7 @@ bool platform_spi_deinit(spi_bus_e bus);

 bool platform_spi_chip_select(uint8_t device_select);
 uint8_t platform_spi_xfer(spi_bus_e bus, uint8_t value);
+void platform_spi_xfer_block(spi_bus_e bus, uint8_t *const data, size_t count);


The const here is correct on the buffer in the function definition itself, but should be dropped from this declaration per the clang-tidy lint about useless const.

dragonmux · 2024-11-20T18:51:19Z

src/target/spi.c

@@ -67,9 +67,13 @@ void bmp_spi_read(const spi_bus_e bus, const uint8_t device, const uint16_t comm
 	bmp_spi_setup_xfer(bus, device, command, address);
 	/* Now read back the data that elicited */
 	uint8_t *const data = (uint8_t *const)buffer;
+#if 0


If you're adding this new functionality, please just add it - if you want to preserve working with the old API then introduce a #define in the platform header that can be tested for here to switch to the new implementation. This will then also fix the builds.

Yes, I felt like PLATFORM_HAS_SPI and PLATFORM_HAS_SPI_BLOCKWISE or so would be nice to add. The first macro would guard dummy impls in all but two platforms, the second macro would dispatch to calling a block xfer function instead of slow byte wise callchains.
But first I wanted to evaluate flash size increase from this feature.

dragonmux · 2024-11-20T18:58:26Z

src/platforms/common/blackpill-f4/blackpill-f4.c

+		return;
+	}
+
+#if 0


What is the benefit and drawback of these two approaches? Can the simpler more expressive loop get similar performance if interrupts are suspended with an atomic context?

No, it can't, because it blocking-waits for the entire duration of 8/16-bit SPI word in https://github.com/libopencm3/libopencm3/blob/201f5bcfb3fa70ee34818152463e7139f24db377/lib/stm32/common/spi_common_all.c#L189-L190
But thanks to that it does not submit an extra word in flight to keep data pumping, and hence cannot miss an Rx byte.

ok, fair enough - then drop the simpler loop please as there's no point keeping it in this new code.

* Existing implementation has to walk up and down the function stack per byte, which is fine for commands and general poking * 256-byte long page reads and writes can be accelerated because the length is known ahead of time * Keep a byte in flight on stm32f1/f4 SPI (this is simpler than IRQ or DMA)

ALTracer force-pushed the feature/spi-perf branch from 4cf226f to a23c853 Compare October 26, 2024 16:15

dragonmux requested changes Nov 20, 2024

View reviewed changes

dragonmux added this to the v2.0 release milestone Nov 20, 2024

dragonmux added Enhancement General project improvement BMP Firmware Black Magic Probe Firmware (not PC hosted software) labels Nov 20, 2024

ALTracer added 4 commits November 21, 2024 19:00

spi: blackpill-f4: Split out const-qualified tx_buf

26dd5d8

native: Implement spi_xfer_block for faster reads

1430564

stlink, swlink: Add no-op stubs for spi_xfer_block

6a88dbf

ALTracer force-pushed the feature/spi-perf branch from f5f65a1 to 6a88dbf Compare November 21, 2024 16:04

ALTracer added 2 commits November 22, 2024 20:42

fixup! spi: blackpill-f4: Implement spi_xfer_block for faster reads

845ecb8

fixup! spi: blackpill-f4: Split out const-qualified tx_buf

8fb69b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

ALTracer commented Oct 1, 2024

ALTracer commented Oct 26, 2024

dragonmux left a comment

dragonmux Nov 20, 2024

dragonmux Nov 20, 2024

ALTracer Nov 22, 2024

dragonmux Nov 20, 2024

ALTracer Nov 21, 2024

dragonmux Nov 22, 2024

+              		return;
+              	}
+              #if 0

Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

Are you sure you want to change the base?

Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

Conversation

ALTracer commented Oct 1, 2024

Detailed description

Your checklist for this pull request

Closing issues

ALTracer commented Oct 26, 2024

dragonmux left a comment

Choose a reason for hiding this comment

dragonmux Nov 20, 2024

Choose a reason for hiding this comment

dragonmux Nov 20, 2024

Choose a reason for hiding this comment

ALTracer Nov 22, 2024

Choose a reason for hiding this comment

dragonmux Nov 20, 2024

Choose a reason for hiding this comment

ALTracer Nov 21, 2024

Choose a reason for hiding this comment

dragonmux Nov 22, 2024

Choose a reason for hiding this comment