Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARL-S] Audio firmware download failure after S3/S4 #5135

Open
syedk008 opened this issue Aug 6, 2024 · 21 comments
Open

[ARL-S] Audio firmware download failure after S3/S4 #5135

syedk008 opened this issue Aug 6, 2024 · 21 comments
Assignees
Labels
ARL Applies to Intel Arrow Lake platform bug Something isn't working P1 Blocker bugs or important features

Comments

@syedk008
Copy link

syedk008 commented Aug 6, 2024

Describe the bug
Audio resume failed with fw load error

sof-audio-pci-intel-mtl 0000:80:1f.3: Code loader DMA did not complete
sof-audio-pci-intel-mtl 0000:80:1f.3: ------------[ DSP dump start ]------------
sof-audio-pci-intel-mtl 0000:80:1f.3: Firmware download failed
sof-audio-pci-intel-mtl 0000:80:1f.3: fw_state: SOF_FW_BOOT_READY_OK (6)
sof-audio-pci-intel-mtl 0000:80:1f.3: 0x50000005: module: ROM_EXT, state: FW_ENTERED, running
sof-audio-pci-intel-mtl 0000:80:1f.3: Firmware state: 0x5, status/error code: 0x0
sof-audio-pci-intel-mtl 0000:80:1f.3: Core dump is not available due to invalid separator 0xc0de
sof-audio-pci-intel-mtl 0000:80:1f.3: ------------[ DSP dump end ]------------
sof-audio-pci-intel-mtl 0000:80:1f.3: Failed to start DSP
sof-audio-pci-intel-mtl 0000:80:1f.3: error: failed to boot DSP firmware after resume -110

To Reproduce
Update BIOS setting for S3 and S4
• Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled
• Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

Try multiple times Suspend resume with below command:
sleep 10 && rtcwake -m mem -s 15

Reproduction Rate
5%

Impact
High impact

Environment
Kernel: sof_dev (commit: 7df4fc1)
SOF: v2.10
topology: sof-hda-generic-4ch.tplg
Platform: Ubuntu 24.04

@syedk008
Copy link
Author

syedk008 commented Aug 6, 2024

dmesg.txt

dmesg file with sof dynamic debug enabled. Search for keyword "Code loader".

@lgirdwood
Copy link
Member

@ssavati any chance you can write a small script that loops over sleep 10 && rtcwake -m mem -s 15 on MTL RVP for several hundred iterations.
@kv2019i @ujfalusi @plbossart IIUC, the FW is still running (or memories are not cleared) when we try and re load code, this could mean we have not put DSP into D3 ?

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 6, 2024

In the above log, FW_READY is received, so FW has been loaded successfully:

[ 202.961859] snd_sof:snd_sof_run_firmware: sof-audio-pci-intel-mtl 0000:80:1f.3: booting DSP firmware
...
[ 203.265654] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:80:1f.3: ipc rx : 0x1b080000|0x0: GLB_NOTIFICATION|FW_READY

So seems DMA transfer was successful, but host misses the transfer completion interrupt and raises the error (even if FW was loaded and booted).

@ssavati
Copy link

ssavati commented Aug 6, 2024

@ssavati any chance you can write a small script that loops over sleep 10 && rtcwake -m mem -s 15 on MTL RVP for several hundred iterations.

Sure will try on MTLP HDA config first.

@syedk008 We dont have ARL-S setup. is it possible to get one board for debug ?

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 6, 2024

@syedk008 can you try #5136

@plbossart
Copy link
Member

@kv2019i there's still something very strange if the DMA is programmed to generate an IOC interrupt upon the end of the transfer and we don't get receive it.

For IPC3 this was root-caused to something odd in SOF 2.0, which was fixed in 2.2

To me this still points to something not quite correct on the firmware, or ROM side.

@ujfalusi
Copy link
Collaborator

ujfalusi commented Aug 6, 2024

@kv2019i there's still something very strange if the DMA is programmed to generate an IOC interrupt upon the end of the transfer and we don't get receive it.

For IPC3 this was root-caused to something odd in SOF 2.0, which was fixed in 2.2

To me this still points to something not quite correct on the firmware, or ROM side.

@plbossart, this is the firmware booting, at this stage IPC version does not matter. We can load anything at this stage (another thing is that it is going to be rejected as not valid). I cannot believe that ROM booting can depend on the IPC protocol used by the transferred binary, which is not even started at this point.

@plbossart
Copy link
Member

you missed the point @ujfalusi

the code loader sets-up the DMA with the IOC bit set. If we don't get an interrupt, then something is wrong in the firmware or ROM handling.

We previously disabled the IPC3 because the problems we found were related to old firmware. I agree this has nothing to do with the IPC proper, but is related to the firmware infrastructure.

@ujfalusi
Copy link
Collaborator

ujfalusi commented Aug 6, 2024

@plbossart, the initial firmware loading has nothing to do with the payload itself. The DMA will load the amount of data and that's it. The firmware has nothing to do with this, it is ROM and second stage, soft ROM code.
Can you point me to the firmware fix for IPC3 you mentioned?

@plbossart
Copy link
Member

if the HDaudio DMA is programmed with the IOC bit set, then do we agree the IOC interrupt SHALL be generated?

We've seen in some cases of IPC3 firmware that it was not, see #5072

I don't really care if this we remove the wait for this interrupt, but the fact that different machines have different unexplained behaviors is concerning. What exactly makes ARL-S different to all our CI devices?

@ssavati
Copy link

ssavati commented Aug 6, 2024

@lgirdwood @syedk008 I have tried on MTL HDA to reproduce issue

I have applied below BIOS settings
• Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled
• Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

With above settings system is going to “PM: suspend entry (deep)“ and not resume back and need to restart device

On our Devices Go to "S0 Idle Low Power Idle Capability -> Enabled". With this system goes to “PM: suspend entry (s2idle)” and resume back. I have kept “sleep 10 && rtcwake -m mem -s 15” in loop it able to complete 200 iteration without any issue

Tested on below config
Linux Branch: topic/sof-dev
Linux Commit: 7df4fc116381
OF Branch: v2.10
SOF Commit: b15f1f1a3238
All our systems are on Ubuntu 22,04

@ujfalusi
Copy link
Collaborator

ujfalusi commented Aug 6, 2024

if the HDaudio DMA is programmed with the IOC bit set, then do we agree the IOC interrupt SHALL be generated?

Yes, we agree on this.

We've seen in some cases of IPC3 firmware that it was not, see #5072

That is exactly the same issue, I agree again.

I don't really care if this we remove the wait for this interrupt, but the fact that different machines have different unexplained behaviors is concerning. What exactly makes ARL-S different to all our CI devices?

That I cannot explain, but the fact is that it can also fail as some TGL device is curious.
We don't have problems not waiting for IOC in case of IPC3, but we want to wait for it if the payload is IPC4? Does this makes sense?
The 'lost' IOC has nothing to do with the IPC version, do you agree? If so then why would we use different mode to press the power button?

@plbossart
Copy link
Member

"The 'lost' IOC has nothing to do with the IPC version, do you agree?"

We had evidence that some versions of SOF 2.2 firmware didn't work and some did. We ended-up disabling the wait for all IPC3 devices to avoid having to special-case which versions didn't work. the blanket "all IPC3 devices" was a simplification, not a statement that IPC was involved.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 6, 2024

Let's see the results and see whether #5136 helps with this issue. It's clear IOC complete should work, but it's not so clear whether this wait is something we need to have in the FW load sequence to begin with.

@syedk008
Copy link
Author

syedk008 commented Aug 6, 2024

@lgirdwood @syedk008 I have tried on MTL HDA to reproduce issue

I have applied below BIOS settings • Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled • Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

With above settings system is going to “PM: suspend entry (deep)“ and not resume back and need to restart device

On our Devices Go to "S0 Idle Low Power Idle Capability -> Enabled". With this system goes to “PM: suspend entry (s2idle)” and resume back. I have kept “sleep 10 && rtcwake -m mem -s 15” in loop it able to complete 200 iteration without any issue

Tested on below config Linux Branch: topic/sof-dev Linux Commit: 7df4fc116381 OF Branch: v2.10 SOF Commit: b15f1f1a3238 All our systems are on Ubuntu 22,04

@ssavati I see the same behavior with sof-dev config file, please use the attached config file, this is being used in our BKC.
arl_defconfig.txt

@syedk008
Copy link
Author

syedk008 commented Aug 6, 2024

@syedk008 can you try #5136

Thanks. With this patch, I could not reproduce the issue. we will test more with this and let you know.

@plbossart
Copy link
Member

@kv2019i I thought it was an interesting data point to see when the transfer is complete v. when we get the first response from firmware.

We can of course remove the wait_for_completion(), it's not strictly required, but that would be an acknowledgement that we have no idea how the code download works and what makes it fail.

@ujfalusi
Copy link
Collaborator

ujfalusi commented Aug 7, 2024

@kv2019i I thought it was an interesting data point to see when the transfer is complete v. when we get the first response from firmware.

Yes, it can be interesting, true.

Note: The boot flow charts I have seen never includes IOC waiting, it is always load fw and wait for the FW_READY.

We can of course remove the wait_for_completion(), it's not strictly required, but that would be an acknowledgement that we have no idea how the code download works and what makes it fail.

We could do something like this:

  1. start the DMA transfer
    In IOC irq handler set flag that we received it for code loader
  2. wait for FW_READY
  3. if FW_READY did not came then we print different error depending on IOC reception

This could be racy, but if we hard wait for the IOC we might get the FW_READY and things will be amazingly confused.

I would remove the IOC wait as a fix and iterate on it probably with the example to have a bit more data point on real boot failures. It is an interesting detail if the FW is not booted and the DMA is not sent the IOC interrupt.

Btw, the IOC is purely HDA DMA affair, it has nothing to do with FW or type of data.

@ujfalusi
Copy link
Collaborator

ujfalusi commented Aug 7, 2024

@ssavati, have you tried to remove the audio drivers and then do the deep suspend on MTL? Does that work?
I don't think MTL supports deep sleep, it has been deprecated for recent Intel platforms for some time..

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 7, 2024

@syedk008 Could you test with this alternative PR that adds more debug #5142 (expected to fail but with more debug). If the results are as expected, I propose we proceed with #5136 as the fix, and potentially follow-up with a PR like #5141 to keep some of the debug capabilities even if IOC wait is removed.

@kv2019i kv2019i added bug Something isn't working P1 Blocker bugs or important features ARL Applies to Intel Arrow Lake platform labels Aug 7, 2024
@kv2019i kv2019i self-assigned this Aug 7, 2024
kv2019i added a commit to kv2019i/linux that referenced this issue Aug 7, 2024
Commit 9ee3f0d (" ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: d5263db ("ASoC: SOF: Intel: don't ignore IOC interrupts for non-audio transfers")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
kv2019i added a commit to kv2019i/linux that referenced this issue Aug 8, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
ujfalusi pushed a commit that referenced this issue Aug 16, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: #5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Aug 22, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Aug 22, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Aug 30, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Sep 10, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Sep 10, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
bardliao pushed a commit to bardliao/linux that referenced this issue Sep 12, 2024
Commit 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for
HDaudio IOC for IPC4 devices") removed DMA wait for IPC3 case.
Proceed and remove the wait for IPC4 devices as well.

There is no dependency to IPC version in the load logic and
checking the firmware status is a sufficient check in case of
errors.

The removed code also had a bug in that -ETIMEDOUT is returned
without stopping the DMA transfer.

Link: thesofproject#5135
Fixes: 9ee3f0d ("ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices")
Suggested-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
@ujfalusi
Copy link
Collaborator

Is this still a valid issue? #5136 should have fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARL Applies to Intel Arrow Lake platform bug Something isn't working P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

6 participants