intel-adsp/ace: fix firmware panic on MTL #75108

lyakh · 2024-06-27T13:42:03Z

The power_down() function attempts to lock the hpsram_mask on-stack variable in data cache, which causes an exception. Moving it to .bss by making it static fixes it.

The power_down() function attempts to lock the hpsram_mask on-stack variable in data cache, which causes an exception. Moving it to .bss by making it static fixes it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

marc-hb · 2024-07-01T18:36:58Z

Hopefully fixes or at least avoids:

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June thesofproject/sof#9268

andyross · 2024-07-01T20:15:51Z

FWIW as far as poorly-understood fixes go, @kv2019i 's fix in #75285 seems to my gut to be closer to a root cause. Just making it static is likely just working by moving it around randomly, where "pad it to a cache line" seems like something the hardware would actually care about.

kv2019i · 2024-07-02T09:30:30Z

@dcpleung @nashif Either this or alternative #75285 would be good to get to 3.7.0. The latter might be less intrusive, but I'll give reviewers chance to have a look.

lyakh · 2024-07-02T11:02:48Z

FWIW as far as poorly-understood fixes go, @kv2019i 's fix in #75285 seems to my gut to be closer to a root cause. Just making it static is likely just working by moving it around randomly, where "pad it to a cache line" seems like something the hardware would actually care about.

@andyross don't you think that the problem is also actually related to that memory being on stack? It seems possible to me that the problem arises because that location ends up sharing cache lines with the new stack frame? So by moving it out of stack we remove that possibility completely?

kv2019i

Based on today's findings, problem is with dcache-locking a variable in middle of a stack frame, so this fix as good as #75285 . Given this has the approvals, why not just go with this as the fix.

andyross · 2024-07-02T14:42:18Z

don't you think that the problem is also actually related to that memory being on stack?

The hardware doesn't know what the "stack" is though. Xtensa doesn't have any (documented) special handling of a stack pointer, or internal stack engine, etc... the way big core architectures do. The A1 register is only SP by convention, ENTRY instructions work symmetrically with any register, window overflow and alloca exceptions do the memory accesses themselves with regular instruction addressing, etc...

I mean, absent getting Cadence people to explain the fault, I guess we'll never know. But my bet is on alignment and dcache line overlap as being closer to the real truth.

lyakh · 2024-07-02T15:11:38Z

The hardware doesn't know what the "stack" is though. Xtensa doesn't have any (documented) special handling of a stack pointer, or internal stack engine, etc... the way big core architectures do. The A1 register is only SP by convention, ENTRY instructions work symmetrically with any register, window overflow and alloca exceptions do the memory accesses themselves with regular instruction addressing, etc...

@andyross that's exactly what I meant - memory, pointed by the register, used in entry - that's "stack" in the sense that it's accessed by the DSP internally.

andyross · 2024-07-02T15:28:13Z

memory, pointed by the register, used in entry - that's "stack" in the sense that it's accessed by the DSP internally.

ENTRY doesn't touch memory though, it just adds an offset to the register based on some window bits in PS left by the CALLx instruction. It a weird instruction, but 100% internal to register state as far as the CPU is concerned. All the spilling and filling of registers from the stack is done in pure software, using regular[1] instructions.

Now, again, this is all just meaningless pontification without input from hardware people in Cadence. So no strong argument. Just about feelings.

[1] There actually are special L/S32E instructions for the loads/stores in the window exceptions, but they don't change the access at all, they use a different source for the MMU protection ring so that the access is done with the privilege of the trapping code and not the handler.

kv2019i · 2024-07-03T09:55:48Z

@andyross @lyakh I think this is specific to DPFL. If I replace just this one DPFL with either DPFR or DPFW (prefetch without lock, either for read or write), the test case works. And as it happens, this power_down() is only place where we use DPFL.

andyross · 2024-07-03T15:17:19Z

I think this is specific to DPFL

That's consistent with the theory that the CPU is hitting its "too many pinned cache lines" exception, FWIW. The regular prefetch instructions don't have that restriction and won't trigger that exception (because their semantics are just hints: they don't want to change memory semantics, they just want to make sure the line is in cache where possible).

andyross · 2024-07-03T15:19:26Z

Heh, given that there are only three cache ways: has anyone tried to white box this? Maybe it's as simple as the CPU designer had an off-by-one in the Verilog and the CPU traps after the first collision instead of the second? Wouldn't be hard to manually lock a line at &my_char_array[0], and &my_char_array[0x4000], and &my_char_array[0x8000]. Per docs the third is the one that's supposed to fail, but...

lyakh mentioned this pull request Jun 27, 2024

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June thesofproject/sof#9268

Closed

lyakh force-pushed the power-down branch from 0e8352f to 8381afa Compare July 1, 2024 13:06

lyakh changed the title ~~[DNM][TEST] fix firmware panic on MTL~~ intel-adsp/ace: fix firmware panic on MTL Jul 1, 2024

lyakh marked this pull request as ready for review July 1, 2024 13:07

nashif added this to the v3.7.0 milestone Jul 1, 2024

zephyrbot added the platform: Intel ADSP Intel Audio platforms label Jul 1, 2024

zephyrbot requested review from andyross, ceolin, dcpleung, jxstelter, kv2019i, lgirdwood, marc-hb, marcinszkudlinski, nashif, softwarecki and tmleman July 1, 2024 13:07

zephyrbot assigned dcpleung Jul 1, 2024

dcpleung approved these changes Jul 1, 2024

View reviewed changes

ceolin approved these changes Jul 1, 2024

View reviewed changes

marc-hb added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Jul 1, 2024

kv2019i approved these changes Jul 2, 2024

View reviewed changes

aescolar merged commit 32d05d3 into zephyrproject-rtos:main Jul 3, 2024
29 checks passed

lyakh deleted the power-down branch July 3, 2024 13:44

lyakh mentioned this pull request Jul 3, 2024

[DNM][TEST] work around an exception during power down thesofproject/sof#9270

Closed

kv2019i mentioned this pull request Jul 3, 2024

intel_adsp/ace: power: pad the hpsram_mask passed to power_down #75285

Closed

tmleman mentioned this pull request Jul 23, 2024

intel_adsp/ace: power: pad the hpsram_mask passed to power_down #76046

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intel-adsp/ace: fix firmware panic on MTL #75108

intel-adsp/ace: fix firmware panic on MTL #75108

lyakh commented Jun 27, 2024

marc-hb commented Jul 1, 2024 •

edited

Loading

andyross commented Jul 1, 2024

kv2019i commented Jul 2, 2024

lyakh commented Jul 2, 2024

kv2019i left a comment

andyross commented Jul 2, 2024

lyakh commented Jul 2, 2024

andyross commented Jul 2, 2024

kv2019i commented Jul 3, 2024

andyross commented Jul 3, 2024

andyross commented Jul 3, 2024

intel-adsp/ace: fix firmware panic on MTL #75108

intel-adsp/ace: fix firmware panic on MTL #75108

Conversation

lyakh commented Jun 27, 2024

marc-hb commented Jul 1, 2024 • edited Loading

andyross commented Jul 1, 2024

kv2019i commented Jul 2, 2024

lyakh commented Jul 2, 2024

kv2019i left a comment

Choose a reason for hiding this comment

andyross commented Jul 2, 2024

lyakh commented Jul 2, 2024

andyross commented Jul 2, 2024

kv2019i commented Jul 3, 2024

andyross commented Jul 3, 2024

andyross commented Jul 3, 2024

marc-hb commented Jul 1, 2024 •

edited

Loading