Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

Closed
kv2019i opened this issue Jun 27, 2024 · 33 comments
Closed

[BUG] DSP panic with Zephyr on Intel MTL, regression 27th June #9268

kv2019i opened this issue Jun 27, 2024 · 33 comments
Assignees
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features urgent Zephyr Issues only observed with Zephyr integrated

Comments

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 27, 2024

Describe the bug
A DSP panic started showing up in CI runs on 27th June. No individual PR merged recently shows this in test results, so suspcion is a combination of PRs merged. Current suspect:

Both passed tested independently, but in together there seem to be DSP panics.

To Reproduce
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html

Reproduction Rate
Very high (most PR CI runs on 27th June)

Expected behavior
No DSP panics

Impact
Blocking CI

Environment
https://sof-ci.01.org/sofpr/PR9265/build6047/devicetest/index.html
https://sof-ci.01.org/sofpr/PR9267/build6051/devicetest/index.html

@kv2019i kv2019i added bug Something isn't working as expected P1 Blocker bugs or important features Zephyr Issues only observed with Zephyr integrated MTL Applies to Meteor Lake platform labels Jun 27, 2024
@lyakh
Copy link
Collaborator

lyakh commented Jun 27, 2024

git bisect reliably brought to an SOF commit, that obviously cannot be the real reason - in a sense that that commit only adds 2 new functions without calling them and without touching a single line of runnable code, so it's purely modifying the build layout.
The crash happens at https://github.com/zephyrproject-rtos/zephyr/blob/b82b5b0734c30490368e21627423f10036487343/soc/intel/intel_adsp/ace/power_down.S#L64
That location is trying to lock in cache a variable on caller's stack https://github.com/zephyrproject-rtos/zephyr/blob/b82b5b0734c30490368e21627423f10036487343/soc/intel/intel_adsp/ace/power.c#L342 which all looks valid. However, it causes an exception. As a WA moving that variable to .bss by making it static seems to eliminate the problem.

@lyakh
Copy link
Collaborator

lyakh commented Jun 27, 2024

@ceolin any ideas?

@lyakh
Copy link
Collaborator

lyakh commented Jun 27, 2024

@teburd
Copy link
Contributor

teburd commented Jun 27, 2024

What does the assembly dump look like for pm_state_set and power_down?

@andyross
Copy link
Contributor

One thing to clean up in power_down assembly is window exceptions. You're called in a context where register windowing is enabled, which means that any access to registers other than A0-A3 can in principle trap. And the first use of A11+ (the last window, after which no window exceptions will occur) doesn't happen until after you've locked three data and four instruction lines into the cache. I don't see why that's illegal, but if I wanted to place bets on "how to exercise weird core behavior", this would be on the list. Strongly suggest "pre-spilling" all registers by e.g. putting a and a15, a15, a15 # force window exceptions right after the ENTRY instruction.

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines which are going to end up pointing into an arbitrary literals location. You need an explicit L32R to be sure it lands in valid memory, not a compiler-generated MOVI.

@andyross
Copy link
Contributor

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

Someone needs to comb through these files and make sure there's no "MOVI" usage (which again to be clear: is only a CPU instruction for small values, for big ones the compiler gets fancy and emits a .literals record for the linker to find and place). I mean, maybe I'll lose the bet and these will turn out to all be valid/non-loading variants. But still I think style would demand this be cleaned up.

@lyakh
Copy link
Collaborator

lyakh commented Jun 28, 2024

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

Someone needs to comb through these files and make sure there's no "MOVI" usage (which again to be clear: is only a CPU instruction for small values, for big ones the compiler gets fancy and emits a .literals record for the linker to find and place). I mean, maybe I'll lose the bet and these will turn out to all be valid/non-loading variants. But still I think style would demand this be cleaned up.

@andyross ouch, yes, that sounds like a problem. Thanks for finding it! Now somebody just needs to fix it...

@lyakh
Copy link
Collaborator

lyakh commented Jun 28, 2024

Also, I note there are two "MOVI" pseudo instructions after you start locking cache lines which are going to end up pointing into an arbitrary literals location. You need an explicit L32R to be sure it lands in valid memory, not a compiler-generated MOVI.

@andyross I looked again at these. And TBH I don't see a problem with those specific ones. Here are the lines we're talking about:
https://github.com/zephyrproject-rtos/zephyr/blob/cfbe2adabc511663776642616cdc75510db882d3/soc/intel/intel_adsp/ace/power_down.S#L53-L61
Firstly, they just lock some cache lines, all the memory is still powered on. The problem would occur if we tried to access RAM after it's powered off, correct? So, those locations before powering down RAM should be safe? Same about these two lines https://github.com/zephyrproject-rtos/zephyr/blob/cfbe2adabc511663776642616cdc75510db882d3/soc/intel/intel_adsp/ace/power_down.S#L67-L68

Oooh, and there are lots more in asm_memory_management.h in contexts where you've already started shutting down memory! I'm going to place my chips on this as the culprit. I give it 60%+ odds.

These ones - yes, agree, look dangerous...

@lyakh
Copy link
Collaborator

lyakh commented Jun 28, 2024

These ones - yes, agree, look dangerous...

@andyross @teburd that was a nice theory, but it doesn't seem to help: zephyrproject-rtos/zephyr#75174 doesn't fix the problem. What exactly did we bet, Andy? ;-)

@andyross
Copy link
Contributor

andyross commented Jun 28, 2024

Bah. I was sure I had it. Is the panic you're seeing on the DPFL instruction itself? Reading the ISA ref, that's only supposed to happen if:

  1. The MPU or MMU doesn't provide access to the virtual address. Obviously not relevant on MTL which has neither.
  2. The cache in question (dcache or icache) does not have two[1] free lines/ways available at the addressed index.

Possibility 2 seems not... entirely impossible? This gets down to details about the hardware cache layout, which aren't completely clear from core-isa.h. But my read is that MTL has a 48k dcache laid out in three ways with a 16k stride. So if you have two cache lines pinned in the dcache at the same address modulo 16k, you can't add a third. I see the code here adding two essentially unrestricted dcache addresses (the literals and the stack). Is it possible there's another somewhere else, maybe leftover from the ROM? If so, then bad luck with memory layout would (I think) be able to make this happen.

[1] Presumably "two" so that there's always one available to populate via normal memory operation

@andyross
Copy link
Contributor

(Deleted a comment again to avoid confusion: thought I had it, but missed a spot where it's loading the mask values.)

FWIW, regarding the earlier point: it wouldn't be too hard in the failing configuration to dump the actual addresses and see if the low 14 bits of the mask and literals regions match (there are two lines in literals).

And if that does turn out to be the problem, you could resolve it by reserving space in that literals area and coping the mask words there. That way they live sequentially in memory and can't collide on cache index.

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 29, 2024

IPC4 Daily tests are mostly red right now because of this, which means other, unrelated regressions WILL sneak unnoticed.

tl;dr: IPC4 CI is dead right now.

@ceolin
Copy link
Contributor

ceolin commented Jul 1, 2024

(Deleted a comment again to avoid confusion: thought I had it, but missed a spot where it's loading the mask values.)

FWIW, regarding the earlier point: it wouldn't be too hard in the failing configuration to dump the actual addresses and see if the low 14 bits of the mask and literals regions match (there are two lines in literals).

And if that does turn out to be the problem, you could resolve it by reserving space in that literals area and coping the mask words there. That way they live sequentially in memory and can't collide on cache index.

If I have to bet I would put my money here. I was looking the documentation and stumbled in the same implementation notes that you commented. That would explain why moving that variable out of the stack solves the problem.

@lyakh
Copy link
Collaborator

lyakh commented Jul 1, 2024

If I have to bet I would put my money here. I was looking the documentation and stumbled in the same implementation notes that you commented. That would explain why moving that variable out of the stack solves the problem.

@ceolin @andyross this bug is making me rich. Looks like this idea isn't correct either. We've tried various ways to unlock or even to free all cache lines - no success. Maybe we do have to use zephyrproject-rtos/zephyr#75108 as long as we don't have a better solution

@lyakh
Copy link
Collaborator

lyakh commented Jul 1, 2024

There's also an incorrectness in the Zephyr code at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power_down.S#L92-L104 - that code expects pu32_hpsram_mask to point to an array of MAX_MEMORY_SEGMENTS masks. Whereas when locking at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power_down.S#L63-L64 the code only locks one cache-line. And it isn't just that it knows, that the array will fit in one cache line, it isn't even guaranteed that it's cache-line aligned. So if the array even just had 2 elements, it could cross 2 cache lines. And the caller at https://github.com/zephyrproject-rtos/zephyr/blob/2c34da96f0e3ba07764db3ac7def9b400bbd1729/soc/intel/intel_adsp/ace/power.c#L342-L352 doesn't bother with an array at all, it just uses a single 32-bit value for a mask.

kv2019i added a commit to kv2019i/zephyr that referenced this issue Jul 1, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by padding the hssram_mask to a full cacheline.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jul 1, 2024

One more idea. The exception seems to be right after first reference to windowed registers set up by call8. This is when the window overflow would be handled. The registers are stored to the stack frame, so these writes may go to the same cache line as we are trying to lock with dpfl (on caller's stack) -- or some other interaction between windowoverflow and dpfl. An ugly change following this hypothesis seems to be holding up -> zephyrproject-rtos/zephyr#75285 -- let's see full results.

@kv2019i kv2019i closed this as completed Jul 4, 2024
eddy1021 pushed a commit to eddy1021/sof that referenced this issue Jul 15, 2024
Update Zephyr baseline to 650227d8c47f

Change affecting SOF build targets:

32d05d360b93 intel_adsp/ace: power: fix firmware panic on MTL
a3835041bd36 intel_adsp/ace: power: Use MMU reinit API on core context restore
a983a5e399fd dts: xtensa: intel: Remove non-existent power domains from ACE30 PTL DTS
a2eada74c663 dts: xtensa: intel: Remove ALH nodes from ACE 3.0 PTL DTS
442e697a8ff7 dts: xtensa: intel: Reorder power domains by bit position in ACE30
d1b5d7092e5a intel_adsp: ace30: Correct power control register bitfield definitions
31c96cf3957b xtensa: check stack boundaries during backtrace
5b84bb4f4a55 xtensa: check stack frame pointer before dumping registers
cb9f8b1019f1 xtensa: separate FATAL EXCEPTION printout into two
e9c23274afa2 Revert "soc: intel_adsp: only implement FW_STATUS boot protocol for cavs"
1198c7ec295b Drivers: DAI: Intel: Move ACE DMIC start reset clear to earlier
78920e839e71 Drivers: DAI: Intel: Reduce traces dai_dmic_start()
9db580357bc6 Drivers: DAI: Intel: Remove trace from dai_dmic_update_bits()
f91700e62968 linker: nxp: adsp: add orphan linker section

Link: thesofproject#9268
Link: thesofproject#9243
Link: thesofproject#9205
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
tmleman pushed a commit to tmleman/zephyr that referenced this issue Jul 18, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 19, 2024

so even if complex, it seems to have been fairly low-maintainance to keep using this approach.

Did you say low maintenance?

Sorry if I'm jumping to conclusions; I couldn't resist :-)

tmleman pushed a commit to tmleman/zephyr that referenced this issue Jul 26, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
(cherry picked from commit b767597)
tmleman pushed a commit to tmleman/zephyr that referenced this issue Jul 29, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
tmleman pushed a commit to tmleman/zephyr that referenced this issue Aug 5, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
tmleman pushed a commit to tmleman/zephyr that referenced this issue Aug 5, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
carlescufi pushed a commit to zephyrproject-rtos/zephyr that referenced this issue Aug 8, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

Link: thesofproject/sof#9268
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue Aug 9, 2024
The power_down() function will lock dcache for the hpsram_mask
array. On some platforms, the dcache lock will fail if the array
is on cache line that can be used for window register context
saves.

Work around this by aligning and padding the hpsram_mask to cacheline
size.

(cherry picked from commit 2fcdbba)

Original-Link: thesofproject/sof#9268
Original-Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
GitOrigin-RevId: 2fcdbba
Cr-Build-Id: 8740088176811114337
Cr-Build-Url: https://cr-buildbucket.appspot.com/build/8740088176811114337
Copybot-Job-Name: zephyr-main-copybot-downstream
Change-Id: Ia4459bc8d6bbea78f2d1e4a4601d0396b2f3b7ef
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5776799
Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com>
Commit-Queue: Jeremy Bettis <jbettis@chromium.org>
Tested-by: Jeremy Bettis <jbettis@chromium.org>
Reviewed-by: Jeremy Bettis <jbettis@chromium.org>
lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024
thesofproject#9268 seems to be back,
Zephyr PR 78057 is a new attempt to fix it.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh
Copy link
Collaborator

lyakh commented Sep 5, 2024

@lyakh lyakh reopened this Sep 5, 2024
@lyakh
Copy link
Collaborator

lyakh commented Sep 5, 2024

and a new attempt to fix it zephyrproject-rtos/zephyr#78057

lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024
thesofproject#9268 seems to be back,
Zephyr PR 78057 is a new attempt to fix it.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@andyross
Copy link
Contributor

andyross commented Sep 5, 2024

Posted this in the PR by accident, copying here:

FWIW: I still bet that if you whiteboxed a rig to enumerate all the dcache indexes and check how many of them can be pinned, we'd discover that something in the boot ROM or elsewhere has left a line pinned accidentally, preventing more pinning at that index, and we're just hitting that by bad luck in the linker.

Would be non-trivial assembly to write, and tedious to debug as the only feedback is a panic, but not "hard" hopefully.

lyakh added a commit to lyakh/sof that referenced this issue Sep 5, 2024
thesofproject#9268 seems to be back,
Zephyr PR 78057 is a new attempt to fix it.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh
Copy link
Collaborator

lyakh commented Sep 6, 2024

@andyross I think what I've tested that back when debugging this last time by doing either one or both of the following things:

  1. trying to lock (and unlock) cache lines at all offsets to cover the entire associativity set - all succeeded
  2. by trying to speculatively unlock the entire data cache by using DIU
diff --git a/soc/intel/intel_adsp/ace/power.c b/soc/intel/intel_adsp/ace/power.c
index 20efc8e6f97..61830897397 100644
--- a/soc/intel/intel_adsp/ace/power.c
+++ b/soc/intel/intel_adsp/ace/power.c
@@ -46,6 +46,9 @@ __imr void power_init(void)
 	cache_data_flush_range((__sparse_force void *)
 			sys_cache_cached_ptr_get(&adsp_pending_buffer),
 			sizeof(adsp_pending_buffer));
+	/* unlock entire data cache */
+	for (unsigned int i = 0; i < 16 * 1024; i += 64)
+		__asm__ __volatile__("diu %0, 0" : "r"(0xa0000000 + i));
 #endif /* CONFIG_SOC_INTEL_ACE15_MTPM */
 }

Neither had helped

@andyross
Copy link
Contributor

andyross commented Sep 7, 2024

@lyakh my read is there are three ways, so you need to prove you can lock two lines at each index to saturate. IIRC there are two separate dcache regions in the code in question, right? So if those overlap in index and happen to collide with something already in the cache, then we'd be in a situation where we're trying to lock the same index in all three ways, which is illegal.

Maybe. :)

@lyakh
Copy link
Collaborator

lyakh commented Sep 9, 2024

@andyross sure, but as I said - I also tried unlocking all cache lines and it didn't help either

@wszypelt
Copy link

@lyakh is everything working now? Can we close this issue?

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 20, 2024

@wszypelt the last words are "... and it didn't help either"

?

@lyakh
Copy link
Collaborator

lyakh commented Sep 23, 2024

@wszypelt the last words are "... and it didn't help either"

?

@marc-hb that was a reply to a specific question, not a statement about the state of this bug

@lyakh
Copy link
Collaborator

lyakh commented Sep 23, 2024

@lyakh is everything working now? Can we close this issue?

@wszypelt yes, so far zephyrproject-rtos/zephyr#78283 has fixed it

@lyakh lyakh closed this as completed Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features urgent Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

No branches or pull requests

7 participants