Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPI CTRL0 register access unsafe - race condition can cause spurious transmissions #713

Closed
gavanfantom opened this issue Aug 17, 2023 · 6 comments · Fixed by #662
Closed
Labels
bug Something isn't working

Comments

@gavanfantom
Copy link

The SPIn_CTRL0 register contains the '''start''' field, which triggers the start of a transaction. The user guide cautions that this bit should not be set until pending transactions have finished. The SPI driver performs a read/modify/write operation on this register while a transaction is in progress, in order to change different fields.

This read/modify/write causes a 1 bit to be written to the '''start''' field if a transaction was in progress when the read was performed. It's not clear what happens if the 1 bit is written back while the transaction is still in progress, but it appears to be harmless. However, if the transaction finishes after the register is read but before it is written back, this will start a new transaction in the hardware.

It just so happens that on the MAX32670, if the instruction cache is enabled, and a 1 byte read/write transaction is performed, it hits this race condition and causes 2 bytes to be sent. Disabling the instruction cache works around the issue, but the underlying problem is still present. It is not safe to do a read/modify/write on this register when a SPI transaction is in progress, unless the '''start''' field is cleared to zero before the value is written back.

While I've found this with the spi_reva1.c driver, a quick read of the reva2 driver suggests that this issue may affect that driver too.

@Jake-Carter
Copy link
Contributor

Thanks for reporting this.

The most likely culprit is MXC_SPI_RevA1_MasterTransHandler.

This logic seems terrible...
image

The start bit is labeled as R/W1O, so clearing it to 0 as a solution should be safe and have no effect on device operation.

image

image

We're working on improved SPI drivers in #662. These v2 drivers have done away with the poorly placed ctrl0 interactions for SS assertion/deassertion, though we should fix this for v1 too.

@sihyung-maxim FYI v2 might still be susceptible to this here and here, but since we clear the EN bit beforehand it seems a lot less likely.

@Jake-Carter Jake-Carter added the bug Something isn't working label Aug 18, 2023
@Jake-Carter
Copy link
Contributor

@sihyung-maxim I'm linking the PR as a potential candidate to close this. We should test the reported condition with v2:

  • ICC enabled
  • TX length = 1 byte
  • RX length = 1 byte

Bug: TX byte is transmitted twice

@gavanfantom
Copy link
Author

The most likely culprit is MXC_SPI_RevA1_MasterTransHandler.

This logic seems terrible... image

That is exactly the problem, yes.

The start bit is labeled as R/W1O, so clearing it to 0 as a solution should be safe and have no effect on device operation.

That's right. I tested exactly that in those two places, and the fix worked:

    spi->ctrl0 = (spi->ctrl0 & ~MXC_F_SPI_REVA_CTRL0_START) | MXC_F_SPI_REVA_CTRL0_SS_CTRL;

    spi->ctrl0 &= ~(MXC_F_SPI_REVA_CTRL0_START | MXC_F_SPI_REVA_CTRL0_SS_CTRL);

@sihyung-maxim
Copy link
Contributor

@sihyung-maxim I'm linking the PR as a potential candidate to close this. We should test the reported condition with v2:

  • ICC enabled
  • TX length = 1 byte
  • RX length = 1 byte

Bug: TX byte is transmitted twice

This test case passes for v2. TX byte is only transmitted once. However, I will add a check to ensure a transaction is not in progress before setting the start bit.

As a side note: The v1 (non-DMA) implementation works by restarting the transaction every 32 messages. The MXC_SPI_RevA1_MasterTransHandler(...) function can only handle up to 32 messages per function call (the 32 messages limitation being the size of the SPI FIFO). This results in MXC_SPI_RevA1_MasterTransHandler(...) getting called several times for transaction lengths greater than 32 messages which means the SPI_CTRL0.ss_ctrl and SPI_CTRL0.start bits are repeatedly set or cleared every 32 messages.

The v2 does not restart the SPI block over the course of a transaction, so the SPI_CTRL0.ss_ctrl and SPI_CTRL0.start fields are only set/cleared once at the beginning of each transaction.

@khpeterson
Copy link
Contributor

khpeterson commented Aug 22, 2023

Are there other read/modify/writes of spi->ctrl0 that are vulnerable to this problem? I'm thinking of a case where a master DMA transaction follows a master transaction, but maybe clearing START as described above is enough.

khpeterson added a commit to khpeterson/analog-devices-msdk-fork that referenced this issue Aug 22, 2023
@khpeterson
Copy link
Contributor

FWIW, this fix resolves a case of "stuck DMA" that I have been fighting for a while. Occasionally, my app ends up in a state where SPI thinks its done but DMA is waiting to complete:

(gdb) p/x *MXC_SPI1
$35 = {{data32 = 0x0, data16 = {0x0, 0x0}, data8 = {0x0, 0x0, 0x0, 0x0}}, ctrl0 = 0x10003, ctrl1 = 0x1820182, ctrl2 = 0x800, ss_time = 0x10101, clk_cfg = 0x202, rsv_0x18 = 0x0, dma = 0x80408041, int_fl = 0x8807, int_en = 0x0, wake_fl = 0x0, wake_en = 0x0, stat = 0x0}
(gdb) p/x *MXC_DMA1
$36 = {cn = 0x3, intr = 0x0, rsv_0x8_0xff = {0x0 <repeats 62 times>}, ch = {{cfg = 0x80040210, st = 0x0, src = 0x20003886, dst = 0x14, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x80400011, st = 0x1, src = 0x20003724, dst = 0x20003a04, cnt = 0x4, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}, {cfg = 0x0, st = 0x0, src = 0x0, dst = 0x0, cnt = 0x0, src_rld = 0x0, dst_rld = 0x0, cnt_rld = 0x0}}}

Without the fix this my app gets into this state around 4 times in 8 hours of continuous operation; with it I have been able to run for nearly 24 hours with a single occurrence.

Thank you @gavanfantom!

Jake-Carter added a commit that referenced this issue Sep 21, 2023
- Clear START bit to 0 before writing back to register
- #713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants