Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intel-adsp/ace: fix firmware panic on MTL #75108

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

lyakh
Copy link
Collaborator

@lyakh lyakh commented Jun 27, 2024

The power_down() function attempts to lock the hpsram_mask on-stack variable in data cache, which causes an exception. Moving it to .bss by making it static fixes it.

The power_down() function attempts to lock the hpsram_mask on-stack
variable in data cache, which causes an exception. Moving it to .bss
by making it static fixes it.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
@lyakh lyakh changed the title [DNM][TEST] fix firmware panic on MTL intel-adsp/ace: fix firmware panic on MTL Jul 1, 2024
@lyakh lyakh marked this pull request as ready for review July 1, 2024 13:07
@nashif nashif added this to the v3.7.0 milestone Jul 1, 2024
@zephyrbot zephyrbot added the platform: Intel ADSP Intel Audio platforms label Jul 1, 2024
@marc-hb marc-hb added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Jul 1, 2024
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 1, 2024

@andyross
Copy link
Contributor

andyross commented Jul 1, 2024

FWIW as far as poorly-understood fixes go, @kv2019i 's fix in #75285 seems to my gut to be closer to a root cause. Just making it static is likely just working by moving it around randomly, where "pad it to a cache line" seems like something the hardware would actually care about.

@kv2019i
Copy link
Collaborator

kv2019i commented Jul 2, 2024

@dcpleung @nashif Either this or alternative #75285 would be good to get to 3.7.0. The latter might be less intrusive, but I'll give reviewers chance to have a look.

@lyakh
Copy link
Collaborator Author

lyakh commented Jul 2, 2024

FWIW as far as poorly-understood fixes go, @kv2019i 's fix in #75285 seems to my gut to be closer to a root cause. Just making it static is likely just working by moving it around randomly, where "pad it to a cache line" seems like something the hardware would actually care about.

@andyross don't you think that the problem is also actually related to that memory being on stack? It seems possible to me that the problem arises because that location ends up sharing cache lines with the new stack frame? So by moving it out of stack we remove that possibility completely?

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on today's findings, problem is with dcache-locking a variable in middle of a stack frame, so this fix as good as #75285 . Given this has the approvals, why not just go with this as the fix.

@andyross
Copy link
Contributor

andyross commented Jul 2, 2024

don't you think that the problem is also actually related to that memory being on stack?

The hardware doesn't know what the "stack" is though. Xtensa doesn't have any (documented) special handling of a stack pointer, or internal stack engine, etc... the way big core architectures do. The A1 register is only SP by convention, ENTRY instructions work symmetrically with any register, window overflow and alloca exceptions do the memory accesses themselves with regular instruction addressing, etc...

I mean, absent getting Cadence people to explain the fault, I guess we'll never know. But my bet is on alignment and dcache line overlap as being closer to the real truth.

@lyakh
Copy link
Collaborator Author

lyakh commented Jul 2, 2024

The hardware doesn't know what the "stack" is though. Xtensa doesn't have any (documented) special handling of a stack pointer, or internal stack engine, etc... the way big core architectures do. The A1 register is only SP by convention, ENTRY instructions work symmetrically with any register, window overflow and alloca exceptions do the memory accesses themselves with regular instruction addressing, etc...

@andyross that's exactly what I meant - memory, pointed by the register, used in entry - that's "stack" in the sense that it's accessed by the DSP internally.

@andyross
Copy link
Contributor

andyross commented Jul 2, 2024

memory, pointed by the register, used in entry - that's "stack" in the sense that it's accessed by the DSP internally.

ENTRY doesn't touch memory though, it just adds an offset to the register based on some window bits in PS left by the CALLx instruction. It a weird instruction, but 100% internal to register state as far as the CPU is concerned. All the spilling and filling of registers from the stack is done in pure software, using regular[1] instructions.

Now, again, this is all just meaningless pontification without input from hardware people in Cadence. So no strong argument. Just about feelings.

[1] There actually are special L/S32E instructions for the loads/stores in the window exceptions, but they don't change the access at all, they use a different source for the MMU protection ring so that the access is done with the privilege of the trapping code and not the handler.

@kv2019i
Copy link
Collaborator

kv2019i commented Jul 3, 2024

@andyross @lyakh I think this is specific to DPFL. If I replace just this one DPFL with either DPFR or DPFW (prefetch without lock, either for read or write), the test case works. And as it happens, this power_down() is only place where we use DPFL.

@aescolar aescolar merged commit 32d05d3 into zephyrproject-rtos:main Jul 3, 2024
29 checks passed
@lyakh lyakh deleted the power-down branch July 3, 2024 13:44
@andyross
Copy link
Contributor

andyross commented Jul 3, 2024

I think this is specific to DPFL

That's consistent with the theory that the CPU is hitting its "too many pinned cache lines" exception, FWIW. The regular prefetch instructions don't have that restriction and won't trigger that exception (because their semantics are just hints: they don't want to change memory semantics, they just want to make sure the line is in cache where possible).

@andyross
Copy link
Contributor

andyross commented Jul 3, 2024

Heh, given that there are only three cache ways: has anyone tried to white box this? Maybe it's as simple as the CPU designer had an off-by-one in the Verilog and the CPU traps after the first collision instead of the second? Wouldn't be hard to manually lock a line at &my_char_array[0], and &my_char_array[0x4000], and &my_char_array[0x8000]. Per docs the third is the one that's supposed to fail, but...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants