[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

kv2019i · 2024-06-27T10:39:57Z

Describe the bug
A DSP panic started showing up in CI runs on 27th June. No individual PR merged recently shows this in test results, so suspcion is a combination of PRs merged. Current suspect:

Both passed tested independently, but in together there seem to be DSP panics.

To Reproduce
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html

Reproduction Rate
Very high (most PR CI runs on 27th June)

Expected behavior
No DSP panics

Impact
Blocking CI

Environment
https://sof-ci.01.org/sofpr/PR9265/build6047/devicetest/index.html
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html

kv2019i · 2024-06-27T10:44:00Z

Also seen in:

lyakh · 2024-06-27T13:20:04Z

git bisect reliably brought to an SOF commit, that obviously cannot be the real reason - in a sense that that commit only adds 2 new functions without calling them and without touching a single line of runnable code, so it's purely modifying the build layout.
The crash happens at https://github.com/zephyrproject-rtos/zephyr/blob/b82b5b0734c30490368e21627423f10036487343/soc/intel/intel_adsp/ace/power_down.S#L64
That location is trying to lock in cache a variable on caller's stack https://github.com/zephyrproject-rtos/zephyr/blob/b82b5b0734c30490368e21627423f10036487343/soc/intel/intel_adsp/ace/power.c#L342 which all looks valid. However, it causes an exception. As a WA moving that variable to .bss by making it static seems to eliminate the problem.

lyakh · 2024-06-27T13:31:31Z

@ceolin any ideas?

lyakh · 2024-06-27T13:47:46Z

teburd · 2024-06-27T16:03:50Z

What does the assembly dump look like for pm_state_set and power_down?

andyross · 2024-06-27T16:25:29Z

One thing to clean up in power_down assembly is window exceptions. You're called in a context where register windowing is enabled, which means that any access to registers other than A0-A3 can in principle trap. And the first use of A11+ (the last window, after which no window exceptions will occur) doesn't happen until after you've locked three data and four instruction lines into the cache. I don't see why that's illegal, but if I wanted to place bets on "how to exercise weird core behavior", this would be on the list. Strongly suggest "pre-spilling" all registers by e.g. putting a and a15, a15, a15 # force window exceptions right after the ENTRY instruction.

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines which are going to end up pointing into an arbitrary literals location. You need an explicit L32R to be sure it lands in valid memory, not a compiler-generated MOVI.

andyross · 2024-06-27T16:30:45Z

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

Someone needs to comb through these files and make sure there's no "MOVI" usage (which again to be clear: is only a CPU instruction for small values, for big ones the compiler gets fancy and emits a .literals record for the linker to find and place). I mean, maybe I'll lose the bet and these will turn out to all be valid/non-loading variants. But still I think style would demand this be cleaned up.

lyakh · 2024-06-28T09:42:02Z

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

Someone needs to comb through these files and make sure there's no "MOVI" usage (which again to be clear: is only a CPU instruction for small values, for big ones the compiler gets fancy and emits a .literals record for the linker to find and place). I mean, maybe I'll lose the bet and these will turn out to all be valid/non-loading variants. But still I think style would demand this be cleaned up.

@andyross ouch, yes, that sounds like a problem. Thanks for finding it! Now somebody just needs to fix it...

lyakh · 2024-06-28T10:55:00Z

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines which are going to end up pointing into an arbitrary literals location. You need an explicit L32R to be sure it lands in valid memory, not a compiler-generated MOVI.

@andyross I looked again at these. And TBH I don't see a problem with those specific ones. Here are the lines we're talking about:
https://github.com/zephyrproject-rtos/zephyr/blob/cfbe2adabc511663776642616cdc75510db882d3/soc/intel/intel_adsp/ace/power_down.S#L53-L61
Firstly, they just lock some cache lines, all the memory is still powered on. The problem would occur if we tried to access RAM after it's powered off, correct? So, those locations before powering down RAM should be safe? Same about these two lines https://github.com/zephyrproject-rtos/zephyr/blob/cfbe2adabc511663776642616cdc75510db882d3/soc/intel/intel_adsp/ace/power_down.S#L67-L68

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

These ones - yes, agree, look dangerous...

lyakh · 2024-06-28T14:35:16Z

These ones - yes, agree, look dangerous...

@andyross @teburd that was a nice theory, but it doesn't seem to help: zephyrproject-rtos/zephyr#75174 doesn't fix the problem. What exactly did we bet, Andy? ;-)

andyross · 2024-06-28T16:37:43Z

Bah. I was sure I had it. Is the panic you're seeing on the DPFL instruction itself? Reading the ISA ref, that's only supposed to happen if:

The MPU or MMU doesn't provide access to the virtual address. Obviously not relevant on MTL which has neither.
The cache in question (dcache or icache) does not have two[1] free lines/ways available at the addressed index.

Possibility 2 seems not... entirely impossible? This gets down to details about the hardware cache layout, which aren't completely clear from core-isa.h. But my read is that MTL has a 48k dcache laid out in three ways with a 16k stride. So if you have two cache lines pinned in the dcache at the same address modulo 16k, you can't add a third. I see the code here adding two essentially unrestricted dcache addresses (the literals and the stack). Is it possible there's another somewhere else, maybe leftover from the ROM? If so, then bad luck with memory layout would (I think) be able to make this happen.

[1] Presumably "two" so that there's always one available to populate via normal memory operation

andyross · 2024-06-28T17:50:25Z

(Deleted a comment again to avoid confusion: thought I had it, but missed a spot where it's loading the mask values.)

FWIW, regarding the earlier point: it wouldn't be too hard in the failing configuration to dump the actual addresses and see if the low 14 bits of the mask and literals regions match (there are two lines in literals).

And if that does turn out to be the problem, you could resolve it by reserving space in that literals area and coping the mask words there. That way they live sequentially in memory and can't collide on cache index.

marc-hb · 2024-06-29T00:04:19Z

IPC4 Daily tests are mostly red right now because of this, which means other, unrelated regressions WILL sneak unnoticed.

tl;dr: IPC4 CI is dead right now.

ceolin · 2024-07-01T06:05:40Z

(Deleted a comment again to avoid confusion: thought I had it, but missed a spot where it's loading the mask values.)

FWIW, regarding the earlier point: it wouldn't be too hard in the failing configuration to dump the actual addresses and see if the low 14 bits of the mask and literals regions match (there are two lines in literals).

And if that does turn out to be the problem, you could resolve it by reserving space in that literals area and coping the mask words there. That way they live sequentially in memory and can't collide on cache index.

If I have to bet I would put my money here. I was looking the documentation and stumbled in the same implementation notes that you commented. That would explain why moving that variable out of the stack solves the problem.

lyakh · 2024-07-01T12:06:17Z

If I have to bet I would put my money here. I was looking the documentation and stumbled in the same implementation notes that you commented. That would explain why moving that variable out of the stack solves the problem.

@ceolin @andyross this bug is making me rich. Looks like this idea isn't correct either. We've tried various ways to unlock or even to free all cache lines - no success. Maybe we do have to use zephyrproject-rtos/zephyr#75108 as long as we don't have a better solution

lyakh · 2024-07-01T12:26:13Z

There's also an incorrectness in the Zephyr code at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power_down.S#L92-L104 - that code expects pu32_hpsram_mask to point to an array of MAX_MEMORY_SEGMENTS masks. Whereas when locking at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power_down.S#L63-L64 the code only locks one cache-line. And it isn't just that it knows, that the array will fit in one cache line, it isn't even guaranteed that it's cache-line aligned. So if the array even just had 2 elements, it could cross 2 cache lines. And the caller at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power.c#L342-L352 doesn't bother with an array at all, it just uses a single 32-bit value for a mask.

The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by padding the hssram_mask to a full cacheline. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2024-07-01T16:24:08Z

One more idea. The exception seems to be right after first reference to windowed registers set up by call8. This is when the window overflow would be handled. The registers are stored to the stack frame, so these writes may go to the same cache line as we are trying to lock with dpfl (on caller's stack) -- or some other interaction between windowoverflow and dpfl. An ugly change following this hypothesis seems to be holding up -> zephyrproject-rtos/zephyr#75285 -- let's see full results.

Update Zephyr baseline to 650227d8c47f Change affecting SOF build targets: 32d05d360b93 intel_adsp/ace: power: fix firmware panic on MTL a3835041bd36 intel_adsp/ace: power: Use MMU reinit API on core context restore a983a5e399fd dts: xtensa: intel: Remove non-existent power domains from ACE30 PTL DTS a2eada74c663 dts: xtensa: intel: Remove ALH nodes from ACE 3.0 PTL DTS 442e697a8ff7 dts: xtensa: intel: Reorder power domains by bit position in ACE30 d1b5d7092e5a intel_adsp: ace30: Correct power control register bitfield definitions 31c96cf3957b xtensa: check stack boundaries during backtrace 5b84bb4f4a55 xtensa: check stack frame pointer before dumping registers cb9f8b1019f1 xtensa: separate FATAL EXCEPTION printout into two e9c23274afa2 Revert "soc: intel_adsp: only implement FW_STATUS boot protocol for cavs" 1198c7ec295b Drivers: DAI: Intel: Move ACE DMIC start reset clear to earlier 78920e839e71 Drivers: DAI: Intel: Reduce traces dai_dmic_start() 9db580357bc6 Drivers: DAI: Intel: Remove trace from dai_dmic_update_bits() f91700e62968 linker: nxp: adsp: add orphan linker section Link: thesofproject#9268 Link: thesofproject#9243 Link: thesofproject#9205 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

marc-hb · 2024-07-19T07:45:46Z

so even if complex, it seems to have been fairly low-maintainance to keep using this approach.

Did you say low maintenance?

Sorry if I'm jumping to conclusions; I couldn't resist :-)

The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> (cherry picked from commit b767597)

The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. Link: thesofproject/sof#9268 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

The power_down() function will lock dcache for the hpsram_mask array. On some platforms, the dcache lock will fail if the array is on cache line that can be used for window register context saves. Work around this by aligning and padding the hpsram_mask to cacheline size. (cherry picked from commit 2fcdbba) Original-Link: thesofproject/sof#9268 Original-Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> GitOrigin-RevId: 2fcdbba Cr-Build-Id: 8740088176811114337 Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8740088176811114337 Copybot-Job-Name: zephyr-main-copybot-downstream Change-Id: Ia4459bc8d6bbea78f2d1e4a4601d0396b2f3b7ef Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5776799 Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com> Commit-Queue: Jeremy Bettis <jbettis@chromium.org> Tested-by: Jeremy Bettis <jbettis@chromium.org> Reviewed-by: Jeremy Bettis <jbettis@chromium.org>

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh · 2024-09-05T13:32:28Z

It's back https://sof-ci.01.org/sofpr/PR9430/build7664/devicetest/index.html

lyakh · 2024-09-05T13:32:50Z

and a new attempt to fix it zephyrproject-rtos/zephyr#78057

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

andyross · 2024-09-05T15:07:24Z

Posted this in the PR by accident, copying here:

FWIW: I still bet that if you whiteboxed a rig to enumerate all the dcache indexes and check how many of them can be pinned, we'd discover that something in the boot ROM or elsewhere has left a line pinned accidentally, preventing more pinning at that index, and we're just hitting that by bad luck in the linker.

Would be non-trivial assembly to write, and tedious to debug as the only feedback is a panic, but not "hard" hopefully.

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh · 2024-09-06T06:18:22Z

@andyross I think what I've tested that back when debugging this last time by doing either one or both of the following things:

trying to lock (and unlock) cache lines at all offsets to cover the entire associativity set - all succeeded
by trying to speculatively unlock the entire data cache by using DIU

diff --git a/soc/intel/intel_adsp/ace/power.c b/soc/intel/intel_adsp/ace/power.c
index 20efc8e6f97..61830897397 100644
--- a/soc/intel/intel_adsp/ace/power.c
+++ b/soc/intel/intel_adsp/ace/power.c
@@ -46,6 +46,9 @@ __imr void power_init(void)
 	cache_data_flush_range((__sparse_force void *)
 			sys_cache_cached_ptr_get(&adsp_pending_buffer),
 			sizeof(adsp_pending_buffer));
+	/* unlock entire data cache */
+	for (unsigned int i = 0; i < 16 * 1024; i += 64)
+		__asm__ __volatile__("diu %0, 0" : "r"(0xa0000000 + i));
 #endif /* CONFIG_SOC_INTEL_ACE15_MTPM */
 }

Neither had helped

andyross · 2024-09-07T00:32:41Z

@lyakh my read is there are three ways, so you need to prove you can lock two lines at each index to saturate. IIRC there are two separate dcache regions in the code in question, right? So if those overlap in index and happen to collide with something already in the cache, then we'd be in a situation where we're trying to lock the same index in all three ways, which is illegal.

Maybe. :)

lyakh · 2024-09-09T06:12:58Z

@andyross sure, but as I said - I also tried unlocking all cache lines and it didn't help either

wszypelt · 2024-09-20T08:10:36Z

@lyakh is everything working now? Can we close this issue?

marc-hb · 2024-09-20T17:10:14Z

@wszypelt the last words are "... and it didn't help either"

?

lyakh · 2024-09-23T06:08:49Z

@wszypelt the last words are "... and it didn't help either"

?

@marc-hb that was a reply to a specific question, not a statement about the state of this bug

lyakh · 2024-09-23T06:10:46Z

@lyakh is everything working now? Can we close this issue?

@wszypelt yes, so far zephyrproject-rtos/zephyr#78283 has fixed it

kv2019i added bug Something isn't working as expected P1 Blocker bugs or important features Zephyr Issues only observed with Zephyr integrated MTL Applies to Meteor Lake platform labels Jun 27, 2024

kv2019i assigned lyakh Jun 27, 2024

marc-hb mentioned this issue Jun 27, 2024

google_aec: Sparse fixups #9265

Merged

kv2019i mentioned this issue Jun 28, 2024

[DNM] west.ytml: update to Zephyr 243783243cbd (June 25th) #9258

Closed

marc-hb mentioned this issue Jun 28, 2024

Global UUID registry, cleanup, simplification #9261

Merged

marc-hb added the urgent label Jun 29, 2024

kv2019i mentioned this issue Jul 1, 2024

intel_adsp/ace: power: flush and invalidate dcache before power_down zephyrproject-rtos/zephyr#75241

Closed

kv2019i mentioned this issue Jul 1, 2024

intel_adsp/ace: power: pad the hpsram_mask passed to power_down zephyrproject-rtos/zephyr#75285

Closed

marc-hb mentioned this issue Jul 1, 2024

intel-adsp/ace: fix firmware panic on MTL zephyrproject-rtos/zephyr#75108

Merged

kv2019i closed this as completed Jul 4, 2024

marc-hb mentioned this issue Jul 12, 2024

LLEXT: fix failures and make DRC an LLEXT module by default on MTL #9116

Merged

tmleman mentioned this issue Jul 18, 2024

intel_adsp/ace: power: pad the hpsram_mask passed to power_down zephyrproject-rtos/zephyr#76046

Merged

This was referenced Jul 19, 2024

Avoid recent MTL regression #9314

Closed

[BUG] MTL firmware does not resume any more #9308

Closed

lyakh mentioned this issue Sep 5, 2024

Intel: ADSP: move HPSRAM mask into assembly zephyrproject-rtos/zephyr#78057

Merged

lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024

[DNM] Another attempt to fix ADSP shutdown

b22321e

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh reopened this Sep 5, 2024

lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024

Another attempt to fix ADSP shutdown

25d6020

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024

Another attempt to fix ADSP shutdown

6748d91

thesofproject#9268 seems to be back, Zephyr PR 78057 is a new attempt to fix it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

kv2019i commented Jun 27, 2024 •

edited by marc-hb

Loading

kv2019i commented Jun 27, 2024 •

edited

Loading

lyakh commented Jun 27, 2024

lyakh commented Jun 27, 2024

lyakh commented Jun 27, 2024 •

edited by marc-hb

Loading

teburd commented Jun 27, 2024

andyross commented Jun 27, 2024

andyross commented Jun 27, 2024

lyakh commented Jun 28, 2024

lyakh commented Jun 28, 2024

lyakh commented Jun 28, 2024

andyross commented Jun 28, 2024 •

edited

Loading

andyross commented Jun 28, 2024

marc-hb commented Jun 29, 2024 •

edited

Loading

ceolin commented Jul 1, 2024

lyakh commented Jul 1, 2024

lyakh commented Jul 1, 2024

kv2019i commented Jul 1, 2024

marc-hb commented Jul 19, 2024 •

edited

Loading

lyakh commented Sep 5, 2024

lyakh commented Sep 5, 2024

andyross commented Sep 5, 2024

lyakh commented Sep 6, 2024

andyross commented Sep 7, 2024

lyakh commented Sep 9, 2024

wszypelt commented Sep 20, 2024

marc-hb commented Sep 20, 2024

lyakh commented Sep 23, 2024

lyakh commented Sep 23, 2024

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

Comments

kv2019i commented Jun 27, 2024 • edited by marc-hb Loading

kv2019i commented Jun 27, 2024 • edited Loading

lyakh commented Jun 27, 2024

lyakh commented Jun 27, 2024

lyakh commented Jun 27, 2024 • edited by marc-hb Loading

teburd commented Jun 27, 2024

andyross commented Jun 27, 2024

andyross commented Jun 27, 2024

lyakh commented Jun 28, 2024

lyakh commented Jun 28, 2024

lyakh commented Jun 28, 2024

andyross commented Jun 28, 2024 • edited Loading

andyross commented Jun 28, 2024

marc-hb commented Jun 29, 2024 • edited Loading

ceolin commented Jul 1, 2024

lyakh commented Jul 1, 2024

lyakh commented Jul 1, 2024

kv2019i commented Jul 1, 2024

marc-hb commented Jul 19, 2024 • edited Loading

lyakh commented Sep 5, 2024

lyakh commented Sep 5, 2024

andyross commented Sep 5, 2024

lyakh commented Sep 6, 2024

andyross commented Sep 7, 2024

lyakh commented Sep 9, 2024

wszypelt commented Sep 20, 2024

marc-hb commented Sep 20, 2024

lyakh commented Sep 23, 2024

lyakh commented Sep 23, 2024

kv2019i commented Jun 27, 2024 •

edited by marc-hb

Loading

kv2019i commented Jun 27, 2024 •

edited

Loading

lyakh commented Jun 27, 2024 •

edited by marc-hb

Loading

andyross commented Jun 28, 2024 •

edited

Loading

marc-hb commented Jun 29, 2024 •

edited

Loading

marc-hb commented Jul 19, 2024 •

edited

Loading