Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded and unchecked IMR address map for Intel ADSP can cause memory corruption #76247

Open
marc-hb opened this issue Jul 23, 2024 · 4 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms priority: low Low impact/importance bug

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 23, 2024

Describe the bug

This is a follow-up to thesofproject/sof#9308 where the audio DSP firmware on MTL stopped booting across the board after a random combination of Pull Requests that all passed separately. After a long and tedious investigation by @tmleman and @lyakh , it was found that the IMR memory had been corrupted for a long time - we just never noticed it before.

The IMR (Isolated Memory Region) is an area of the main/host memory that stays up when the audio DSP (including its local memory) is off. It speeds up restarting the audio firmware. Starting with the "ACE" generation it is the standard way to restart (there is a sof_debug=0x80 debug bit/option to boot from scratch instead). Among others, the IMR contains a complete image of the firmware code which happened to be partially overwritten.

The reason it was partially overwritten is because the adsp_memory.h file (see 6069f946be1bd502) maps that IMR region with hardcoded constants that became disconnected with reality.

The urgent and temporary fix submitted in #76196 re-hardcodes "better" values but that's obviously not sustainable. This GitHub issue is to discuss and track the longer term solution(s).

To Reproduce

git clone git@github.com:thesofproject/sof.git
git checkout d629e521020c
west update
sof/scripts/xtensa-build-zephyr.py mtl

Inspect the code size and notice that it is to big for the IMR area meant for it.

Originally posted by @tmleman in thesofproject/sof#9308 (comment)

@lyakh thank you for the (temporary) fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack.

To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied:

L3_MEM_BASE_ADDR             = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE   = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE        = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE       = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE       = 0xa1049000 # The beginning of the memory copied to the hpsram.
                               0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE        = 0xa1110000
IMR_BOOT_LDR_STACK_BASE	     = 0xa1120000
IMR_L3_HEAP_BASE             = 0xa1121000
The last copied address      = 0xa1133000 -> 0xa011a000

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively. CI caught the problem only in builds with assertions enabled, but also in a normal build, the heap initialization was overwriting our FW, it just didn't cause crashes.

Expected behavior

At build time, a linker or elfutils script must 1. either compute some optimal allocation, or 2. at the very least check that areas are big enough for their intended purposes and fail when not big enough.

If possible, additional runtime checks/protections cannot hurt.

Impact

As usual with memory corruption: critical.

Failure to boot, security risks, etc.

@marc-hb marc-hb added bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms labels Jul 23, 2024
@andyross
Copy link
Contributor

andyross commented Jul 24, 2024

@ceolin @kv2019i will want to be in on this too I suspect.

FWIW an awful lot of that stuff doesn't need to be managed automatically manually at all. The .text/.literal/.data offsets all come out cleanly from the linker, the stack is just a memory array, we automatically compute general heap boundaries with symbols marking the extents. Things like the "boot loader stack" should really be unified with the existing core 0 interrupt stack for efficiency. The memory windows can be sized as general buffers with an __aligned tag, etc... There's a ton of manual management here that is just historical and needless.

And the remaining areas that can't be automatic will end up corresponding to "addresses the CSME, boot ROM or host loader code knows about external to the firmware". And those probably belong in device tree or an evolved loader protocol.

@marc-hb

This comment was marked as off-topic.

@jhedberg jhedberg added the priority: low Low impact/importance bug label Jul 30, 2024
marc-hb added a commit to marc-hb/sof that referenced this issue Aug 5, 2024
This reverts commit 8847de0.

This should now work thanks to the "better" IMR addresses in
zephyrproject-rtos/zephyr#76196

Note the longer term issue is still open:
zephyrproject-rtos/zephyr#76247

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
kv2019i pushed a commit to thesofproject/sof that referenced this issue Aug 6, 2024
This reverts commit 8847de0.

This should now work thanks to the "better" IMR addresses in
zephyrproject-rtos/zephyr#76196

Note the longer term issue is still open:
zephyrproject-rtos/zephyr#76247

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@dcpleung dcpleung assigned lyakh and unassigned dcpleung Sep 3, 2024
@dcpleung
Copy link
Member

dcpleung commented Sep 3, 2024

Assigning this to @lyakh for now, as I believe this would require some firmware loader works too for the addresses.

Copy link

github-actions bot commented Nov 3, 2024

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Nov 3, 2024
@marc-hb marc-hb removed the Stale label Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug platform: Intel ADSP Intel Audio platforms priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants