Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LNL] SoundWire: pm_runtime resume timed out - interrupts lost #4943

Closed
marc-hb opened this issue Apr 22, 2024 · 56 comments
Closed

[LNL] SoundWire: pm_runtime resume timed out - interrupts lost #4943

marc-hb opened this issue Apr 22, 2024 · 56 comments
Labels
LNL Applies to Lunar Lake platform P3 Low-impact bugs or features SDW Applies to SoundWire bus for codec connection

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 22, 2024

Spotted by chance on jf-lnlm-rvp-sdw-1 while testing something totally unrelated:

https://sof-ci.01.org/softestpr/PR1178/build324/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-sof-logger (internal run ID 40150)

2024-04-22 22:40:16 UTC [ERROR] Caught kernel log error
===========================>>
[  173.964280] kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
[  173.979644] kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
[  173.979746] kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110

(Spotted while testing unrelated thesofproject/sof-test#1178)

@marc-hb
Copy link
Collaborator Author

marc-hb commented Apr 22, 2024

Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-2: Slave attached, programming device number
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave Addr: 31025d131601
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x1, version 0x3
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-2: Slave already registered, reusing dev_num:1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
A

@plbossart
Copy link
Member

plbossart commented Apr 22, 2024

Looks like a hardware/board issue to me

The last log for the rt714 is this https://sof-ci.ostc.intel.com/#/result/planresultdetail/40150?model=LNLM_SDW_AIOC&testcase=verify-kernel-boot-log

[    5.655780] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x20000001

This means the device was assigned an address and was no longer seen as device 0. But then nothing happened after that.

@bardliao what do you think?

@bardliao
Copy link
Collaborator

@plbossart Device number has been assigned. And there is no trace in rt715_sdca_io_init(). I think it is probably why we don't see the next action in the log.

[    5.654922] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[    5.654946] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
...
[    5.655694] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6

To me, it is more like a codec driver error. See change below. We need to wait for initialization_complete instead of enumeration_complete when the codec get resume.

diff --git a/sound/soc/codecs/rt715-sdca-sdw.c b/sound/soc/codecs/rt715-sdca-sdw.c
index d3fb02e8f31a..c8dabb9b16b5 100644
--- a/sound/soc/codecs/rt715-sdca-sdw.c
+++ b/sound/soc/codecs/rt715-sdca-sdw.c
@@ -234,10 +234,10 @@ static int __maybe_unused rt715_dev_resume(struct device *dev)
        if (!slave->unattach_request)
                goto regmap_sync;

-       time = wait_for_completion_timeout(&slave->enumeration_complete,
+       time = wait_for_completion_timeout(&slave->initialization_complete,
                                           msecs_to_jiffies(RT715_PROBE_TIMEOUT));
        if (!time) {
-               dev_err(&slave->dev, "%s: Enumeration not complete, timed out\n", __func__);
+               dev_err(&slave->dev, "%s: Initialization not complete, timed out\n", __func__);
                sdw_show_ping_status(slave->bus, true);

                return -ETIMEDOUT;

@plbossart
Copy link
Member

I saw that confusion @bardliao but if the enumeration didn't complete by the deadline, then the initialization would not complete either?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Apr 23, 2024

Seen again today on a different device, this time it's ba-lnlm-rvp-sdw-01

https://sof-ci.01.org/softestpr/PR1075/build329/devicetest/index.html (internal run 40213)

@bardliao
Copy link
Collaborator

I saw that confusion @bardliao but if the enumeration didn't complete by the deadline, then the initialization would not complete either?

Fair point. Indeed, I can't find soundwire sdw-master-0-1: Slave attached, programming device number from #4943 (comment). And it happened again with different DUTs/codecs. #4943 (comment). Can it be AC timing issue again?

@plbossart
Copy link
Member

@bardliao possibly an AC timing problem indeed. With PR #4915, we modified the default for DOAIS from 0x1 to 0x3 (hardware default). It's worth checking if reverting this change 'fixes' the issue.

plbossart added a commit to plbossart/sound that referenced this issue Apr 24, 2024
go back to the hard-coded 0x1 which was changed in "soundwire: intel_ace2x: cleanup DOAIS/DODS settings"

Link: thesofproject#4943
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart plbossart added SDW Applies to SoundWire bus for codec connection LNL Applies to Lunar Lake platform labels Apr 25, 2024
@plbossart
Copy link
Member

reverting the DOAIS setting does not trigger any obvious change, so we'll have to track where and when this issue re-occurs, if at all.

@plbossart
Copy link
Member

plbossart commented May 6, 2024

similar problem reported on LNL RVP:

https://sof-ci.01.org/linuxpr/PR4857/build2721/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload

[ 1964.576041] kernel: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_dev_resume: Initialization not complete, timed out
[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
[ 1964.591437] kernel: rt711-sdca sdw:0:0:025d:0711:01: intel_resume_child_device: pm_runtime_resume failed: -110
[ 1964.591584] kernel: soundwire sdw-master-0-0: sdw_intel_exit: intel_resume_child_device failed: -110

It looks like the RT711 is attached on link0 but somehow we never deal with it. That's starting to look like an interrupt that's not detected.

It was the same pattern in the initial bug report above but this time for link1.

Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110

plbossart added a commit to plbossart/sound that referenced this issue May 6, 2024
Log what the interrupt enable state is.

we should merge this but there's no real merit in upstreaming this.

Link: thesofproject#4943
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit to plbossart/sound that referenced this issue May 6, 2024
Log what the interrupt enable state is.

we should merge this but there's no real merit in upstreaming this.

Link: thesofproject#4943
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart
Copy link
Member

@bardliao FYI. My two hypotheses are
a) this comes from a bad setup with delayed interrupt enable
b) there's an edge interrupt from two separate links at the same time and we miss one of the two.

@plbossart
Copy link
Member

The mechanism for interrupts is that we have ONE interrupt, and then we deal with all the links with this loop

	list_for_each_entry(link, &ctx->link_list, list)
		sdw_cdns_irq(irq, link->cdns);

If we have an interrupt on link2, we could very well miss an interrupt on link0 if it happens AFTER we deal with link0 in the list walk. If we look at the reports so far we seem to miss the interrupt for link0 and link1, in all cases we don't seem to have problems dealing with the amplifiers on link2-3.

@bardliao
Copy link
Collaborator

bardliao commented May 7, 2024

The mechanism for interrupts is that we have ONE interrupt, and then we deal with all the links with this loop

	list_for_each_entry(link, &ctx->link_list, list)
		sdw_cdns_irq(irq, link->cdns);

If we have an interrupt on link2, we could very well miss an interrupt on link0 if it happens AFTER we deal with link0 in the list walk. If we look at the reports so far we seem to miss the interrupt for link0 and link1, in all cases we don't seem to have problems dealing with the amplifiers on link2-3.

Good finding. And it explains why the issue only seen with AIOC. However, I am curious why we only see the issue on LNL?

@plbossart
Copy link
Member

@bardliao I am not sure this theory is correct, if it is that would mean the solution we implemented years ago was never robust... For LNL we start the links earlier, and IIRC we don't power-up all the links at once as we do for earlier generations. That could explain differences in timing during the enumeration stage.

@bardliao
Copy link
Collaborator

bardliao commented May 7, 2024

@bardliao I am not sure this theory is correct, if it is that would mean the solution we implemented years ago was never robust... For LNL we start the links earlier, and IIRC we don't power-up all the links at once as we do for earlier generations. That could explain differences in timing during the enumeration stage.

Hope we can find something with #4976. But if it is the case, how should we fix it?

@plbossart
Copy link
Member

If I am correct, we can loop multiple times to reduce the possibility of such issues, but it would not be a 100% robust fix.
If we still miss an interrupt, we could add a recovery procedure when the resume timeout happens, so that we force the sdw_interrupt thread to be revisited again. That would be really ugly.

@charleskeepax
Copy link

Question 1 is probably, is the interrupt definitely edge triggered? And secondly, does it have to be? I would assume it is here and there is no option to switch it to level triggered, which is generally much easier to get right. The normal handling for overlapping edge triggered IRQs works roughly like this:

  1. IRQ comes in, the IRQ is left enabled, the state is set to handling, and the handler is started.
  2. If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.
  3. When the first IRQ handler completes, we see the state is pending so the state is set back to handling, the IRQ is unmasked and the handler is re-entered.

Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

@plbossart
Copy link
Member

@charleskeepax we have ONE interrupt, but it's a combined status for IPC, streams, SoundWire and wakes.
The solution we used so far was the same as Windows and it's been quite reliable since CML. What is possible is that with the change to use the HDaudio multi-link stuff, we handle things differently and the combination of events across links is possibly questionable. It's not really that the interrupt generation changed.

@charleskeepax
Copy link

I have had very occasional enumeration failures with cs42l43 on the TGL UpX, that has been on my list to investigate for a quite a while. This is probably a good candidate for a cause. If I follow the code correctly in sound/soc/sof/intel/hda.c, it is masking the IRQ in the handler, and unmasking it at the bottom of the thread. That is a pretty questionable thing to be doing with an edge triggered IRQ, will depend on the exact details of the hardware but generally that would be broken as you lose any IRQs that happen during the time the handler is running. Also keep in mind this could be a result of there being more IRQs. The less IRQs there are the less likely you are to lose some, so maybe one of those other changes increased the IRQ volume, or just changed the time such that two IRQs were more likely to happen close to each other.

@charleskeepax
Copy link

But first step would really be to check the hardware behaviour, we still haven't had any luck getting docs for the Cadence IP, although I assume this bit falls outside of that anyway.

@plbossart
Copy link
Member

It's not a Cadence problem.

@plbossart
Copy link
Member

we did change the timing on LNL, now the links are started serially whereas in previous solutions we have to power them up at the same time. I think the serial part creates a case where a link has an alert while handling another, it would be less likely in the simultaneous case since all devices would be handled at the same time.

@marc-hb
Copy link
Collaborator Author

marc-hb commented May 7, 2024

  1. 1.1 IRQ comes in, the IRQ is left enabled, 1.2 the state is set to handling, ...
  2. If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.

What happens if 2) comes between 1.1 and 1.2?

  1. ... so 3.2 the state is set back to handling, 3.3 the IRQ is unmasked and...

Looks like there is another short window between 3.2 and 3.3 for another race here...

The normal handling for overlapping edge triggered IRQs works roughly like this
...
Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

The generic issue should indeed be a "solved problem": do you have a pointer to a good and not "rough" reference documentation? Unless edge-triggered interrupts are broken by design :-D

@charleskeepax
Copy link

  1. 1.1 IRQ comes in, the IRQ is left enabled, 1.2 the state is set to handling, ...
  2. If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.

What happens if 2) comes between 1.1 and 1.2?

I believe the state variable would need to be either handled as an atomic or be guarded with a spin lock.
I think the IRQ code uses a spin lock if memory serves but will try to find time to have a look tomorrow.

  1. ... so 3.2 the state is set back to handling, 3.3 the IRQ is unmasked and...

Looks like there is another short window between 3.2 and 3.3 for another race here...

Not sure this one actually counts since until the IRQ is unmasked you know no new handlers will kick off but depends a bit on how you order the code. Probably safest to wrap all access to the state in a spin lock.

The normal handling for overlapping edge triggered IRQs works roughly like this
...
Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

The generic issue should indeed be a "solved problem": do you have a pointer to a good and not "rough" reference documentation? Unless edge-triggered interrupts are broken by design :-D

They are not broken by design, but they sure are much harder to get right than level triggered IRQs. Partly why it's a good idea to outsource the handling to the IRQ subsystem. I think I had a couple emails from Thomas Gleixner saved somewhere I will try to dig out.

@marc-hb
Copy link
Collaborator Author

marc-hb commented May 30, 2024

Last observation from me 8 days ago:
#4943 (comment)

@fredoh9
Copy link
Collaborator

fredoh9 commented May 31, 2024

Last seen is 41430 (generic_test_new number) "Start Time: 2024-05-17 08:27:34 UTC"

40617/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 1925.680979] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40686/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40703/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  247.913812] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40703/LNLM_SDW_AIOC/multiple-pipeline-capture/dmesg.html:[ 2120.781727] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
40889/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 2005.563315] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
40948/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2021.424492] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40988/LNLM_SDW_AIOC/check-capture-50rounds/dmesg.html:[ 1681.461390] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
41016/LNLM_SDW_AIOC/check-alsabat-headset-capture-599/dmesg.html:[ 2167.503016] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41089/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[  365.663480] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41118/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2118.263892] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41136/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2018.861151] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41224/LNLM_SDW_AIOC/check-signal-stop-start-capture-10/dmesg.html:[  690.746202] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41253/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1709.624393] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41269/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2036.784993] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
41264/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1730.617024] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41299/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  250.471737] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41311/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  249.447548] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41346/LNLM_SDW_AIOC/check-capture-all-formats/dmesg.html:[  290.918537] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41437/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2050.687592] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41405/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1628.698374] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41430/LNLM_SDW_AIOC/multiple-pipeline-all/dmesg.html:[ 1840.828359] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1

@plbossart
Copy link
Member

seen today on ba-lnlm-rvp-sdw-03 boot -28 2379f50512f242698128c0505c5f9e90 Thu 2024-06-06 16:00:15 UTC—Thu 2024-06-06 16:04:19 UTC

dmesg.log

@ssavati
Copy link

ssavati commented Jun 13, 2024

@marc-hb @plbossart
This issue constantly reproducing for below test case

TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-kmod-load-unload.sh -l 50

I have observed issue mostly occurring after 21th iteration (3 times observed)
load_unload_50times_log1.txt

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 13, 2024

Keep in mind that kmod-load-unload is part of check-sof-logger

@bardliao
Copy link
Collaborator

I am thinking will on board rt711 + external rt1316/rt714 cause any issue? Can we test on board rt711 only or use external rt711?

@plbossart
Copy link
Member

@bardliao is certainly correct that we should keep one unmodified RVP with just the on-board RT711 codec, to help triangulate issues with board reworks and integration of external AIOC.

@fredoh9 do you know if the 'nocodec' boards used on LNL can be retargeted for SoundWire tests?

That said, the problem did occur with link0 and RT711, so I am not sure if this is only a board/hardware problem.
Since this happens only on resume, there is a non-zero likelihood that we're missing something that's not restored properly.

@plbossart
Copy link
Member

seen again in daily tests planresultdetail/42570?model=LNLM_SDW_AIOC&testcase=check-runtime-pm-status

@plbossart
Copy link
Member

I ran 500 iterations of playback/pm_runtime suspend without seeing this enumeration problem with #5065

I did see a bunch of xruns reported in issue #5066, but that's a separate problem.

@bardliao @lgirdwood we are entering the second order twilight zone...

@plbossart
Copy link
Member

I managed to reproduce the issue with a non-stop test:

while true; do 
    TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p 2; 
done

The first error came after 3800s, over an hour. I am not sure how we are going to make progress on this one.
The link2 was powered-up 95 times before the failure....

log.txt.gz

@bardliao @lgirdwood this is going to be fun!

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 18, 2024

To stop the while loop immediately after the first failure:

while true; do
   VAR1=VAL1 test.sh || break
done

Or:

while VAR1=VAL1 test.sh; do
   :
done

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 18, 2024

multiple-pipeline.sh -h has two parameters that could speed up reproduction:

    -w parameter |  --wait parameter
	duration of one (sub)test iteration
	Default Value: 5

    -l parameter |  --loop parameter
	loop count
	Default Value: 1

@plbossart
Copy link
Member

It took 649 iterations to reproduce the problem!

2024-06-17 20:13:08 UTC [INFO] Test Result: PASS!
iteration 648 done
2024-06-17 20:13:13 UTC [INFO] ktime=12651 sof-test PID=218993: starting
2024-06-17 20:13:18 UTC [INFO] Starting /usr/local/bin/mtrace-reader.py >& /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-17-20:13:13-5762/mtrace.txt &
2024-06-17 20:13:18 UTC [INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-06-17 20:13:18 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-17 20:13:18 UTC [INFO] Run command to get pipeline parameters
2024-06-17 20:13:18 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:playback & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-17 20:13:18 UTC [INFO] ===== Testing: (Loop: 1/1) =====
2024-06-17 20:13:18 UTC [INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-06-17 20:13:18 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-17 20:13:18 UTC [INFO] Run command to get pipeline parameters
2024-06-17 20:13:18 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-17 20:13:18 UTC [INFO] Testing: Speaker [hw:0,2]
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,2 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] Testing: Jack Out [hw:0,0]
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] Testing: Deepbuffer Jack Out [hw:0,31]
2024-06-17 20:13:18 UTC [INFO] wait 0.5s for aplay_opts()
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] checking pipeline status
2024-06-17 20:13:18 UTC [INFO] Letting playback/capture run for 5s
2024-06-17 20:13:23 UTC [INFO] checking pipeline status again
2024-06-17 20:13:23 UTC [ERROR] Running process count is 2, but 3 is expected
219161 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
219170 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:24 UTC [INFO] Starting func_exit_handler(1)
2024-06-17 20:13:24 UTC [ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-06-17 20:13:24 UTC [ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-06-17 20:13:24 UTC [ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-06-17 20:13:24 UTC [ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:202
2024-06-17 20:13:26 UTC [INFO] pkill -TERM -f mtrace-reader.py
2024-06-17 20:13:26 UTC [INFO] nlines=129 /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-17-20:13:13-5762/mtrace.txt
2024-06-17 20:13:28 UTC [INFO] ktime=12666 sof-test PID=218993: ending
2024-06-17 20:13:28 UTC [INFO] Test Result: FAIL!
[12656.153835] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
[12656.153842] soundwire_intel:intel_link_power_up: soundwire_intel soundwire_intel.link.2: first link up, programming SYNCPRD
[12656.153972] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.1: clock source 0 LVSCTL 0x0
[12656.153986] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
[12656.154006] soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: initializing enumeration and init completion for Slave 7
[12656.154072] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.0: clock source 0 LVSCTL 0x0
[12656.154128] soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: initializing enumeration and init completion for Slave 6
[12656.154172] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
[12656.154228] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.1: mclk 19200000 max 4800000 row 50 col 4
[12656.154272] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.0: mclk 19200000 max 4800000 row 50 col 4
[12656.154306] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
[12656.154316] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
[12656.154355] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
[12656.155635] cdns_update_slave_status_work: 2 callbacks suppressed
[12656.155639] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2
[12656.155681] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-0: Slave attached, programming device number
[12656.155700] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[12656.155737] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
[12656.155806] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
[12656.155951] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-0: SDW Slave Addr: 30025d071101
[12656.155955] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-0: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0711, unique_id 0x0, version 0x3
[12656.155959] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-0: Slave already registered, reusing dev_num:6
[12656.155968] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave Addr: 30025d071401
[12656.155971] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0714, unique_id 0x0, version 0x3
[12656.155974] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-1: Slave already registered, reusing dev_num:7
[12656.155995] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
[12656.156306] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
[12656.156309] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x0, version 0x3
[12656.156312] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-3: Slave already registered, reusing dev_num:1
[12656.156324] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.0: Msg ignored for Slave 0
[12656.156327] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-0: No more devices to enumerate
[12656.156336] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[12656.156399] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-1: No more devices to enumerate
[12656.156474] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000001
[12656.156510] soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6
[12656.156525] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x20000001
[12656.156558] soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling enumeration completion for Slave 7
[12656.156632] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.3: Msg ignored for Slave 0
[12656.156634] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-3: No more devices to enumerate
[12656.156755] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x21
[12656.156836] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling enumeration completion for Slave 1
[12656.156844] soundwire_bus:sdw_slave_set_frequency: rt711-sdca sdw:0:0:025d:0711:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157120] soundwire_bus:sdw_slave_set_frequency: rt715-sdca sdw:0:1:025d:0714:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157311] soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:3:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157748] soundwire_bus:sdw_handle_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling initialization completion for Slave 7
[12656.177031] snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[12656.177038] soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[12656.358042] snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[12656.367475] snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[12656.367487] snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[12656.367490] soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[12656.371690] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[12656.380101] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.381118] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[12656.412867] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x0
[12656.412876] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x0
[12656.412879] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.412882] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12656.882461] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[12656.883993] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.885729] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[12656.919436] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x5
[12656.919446] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x3
[12656.919451] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.919454] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12656.920011] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.954731] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x5
[12656.954741] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x3
[12656.954744] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.954746] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12659.847445] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[12660.131857] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
[12661.260889] rt1316-sdca sdw:0:2:025d:1316:01: rt1316_dev_resume: Initialization not complete, timed out
[12661.276146] soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
[12661.276171] rt1316-sdca sdw:0:2:025d:1316:01: ASoC: error at snd_soc_pcm_component_pm_runtime_get on sdw:0:2:025d:1316:01: -110
[12661.276279]  SDW2-Playback: ASoC: error at __soc_pcm_open on SDW2-Playback: -110
[12661.276309]  Speaker: ASoC: error at dpcm_be_dai_startup on Speaker: -110
[12661.276335]  Speaker: ASoC: error at dpcm_fe_dai_startup on Speaker: -110

so far it's always link2 that shows the problem, not sure why. Maybe it's because it's the first link that gets used with this script.

@plbossart
Copy link
Member

My theory is wrong, there are examples of failures where the first link to be used is NOT the one that fails, example copied from https://sof-ci.01.org/linuxpr/PR4857/build2721/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload shows link1/rt715 is started first but link0/rt711 fails to resume.

[ 1959.256937] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.1: clock source 0 LVSCTL 0x0
[ 1959.256949] kernel: soundwire_intel:intel_link_power_up: soundwire_intel soundwire_intel.link.1: first link up, programming SYNCPRD
[ 1959.257044] kernel: soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: initializing enumeration and init completion for Slave 6
[ 1959.257065] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.1: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257217] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
[ 1959.257284] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
[ 1959.257306] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257444] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
[ 1959.257658] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
[ 1959.257675] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257765] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.0: clock source 0 LVSCTL 0x0
[ 1959.258299] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: initializing enumeration and init completion for Slave 7
[ 1959.258329] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.0: mclk 19200000 max 4800000 row 50 col 4
[ 1959.258940] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 1959.259073] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
[ 1959.259135] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
[ 1959.259178] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-2: Slave attached, programming device number
[ 1959.259201] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x1
[ 1959.259309] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave Addr: 30025d071401
[ 1959.259313] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0714, unique_id 0x0, version 0x3
[ 1959.259318] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-1: Slave already registered, reusing dev_num:6
[ 1959.259439] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave Addr: 31025d131601
[ 1959.259442] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x1, version 0x3
[ 1959.259446] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-2: Slave already registered, reusing dev_num:1
[ 1959.259668] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
[ 1959.259711] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
[ 1959.259729] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1959.259734] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-1: No more devices to enumerate
[ 1959.259980] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2000001
[ 1959.260013] kernel: soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling enumeration completion for Slave 6
[ 1959.260074] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
[ 1959.260085] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x0, version 0x3
[ 1959.260100] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-3: Slave already registered, reusing dev_num:1
[ 1959.260267] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 1959.260278] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
[ 1959.260360] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
[ 1959.260393] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
[ 1959.260479] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.3: Msg ignored for Slave 0
[ 1959.260485] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-3: No more devices to enumerate
[ 1959.260673] kernel: soundwire_bus:sdw_slave_set_frequency: rt715-sdca sdw:0:1:025d:0714:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.260697] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x21
[ 1959.260796] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling enumeration completion for Slave 1
[ 1959.260805] kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.261408] kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:3:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.261440] kernel: soundwire_bus:sdw_handle_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling initialization completion for Slave 6
[ 1959.274495] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
[ 1959.274499] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
[ 1959.275046] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[ 1959.275048] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[ 1959.278233] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1959.278536] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 1959.278555] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1959.278559] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1959.438214] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 1959.438494] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1963.028494] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[ 1963.028662] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[ 1963.028754] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[ 1964.576041] kernel: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_dev_resume: Initialization not complete, timed out
[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
[ 1964.591437] kernel: rt711-sdca sdw:0:0:025d:0711:01: intel_resume_child_device: pm_runtime_resume failed: -110
[ 1964.591584] kernel: soundwire sdw-master-0-0: sdw_intel_exit: intel_resume_child_device failed: -110

@fredoh9
Copy link
Collaborator

fredoh9 commented Jun 18, 2024

@fredoh9 do you know if the 'nocodec' boards used on LNL can be retargeted for SoundWire tests?

I can change one nocodec RVP to AIOC model today

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 20, 2024

Another one today. The logs look a little bit different

https://sof-ci.01.org/sofpr/PR9185/build5843/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=test-speaker

@plbossart
Copy link
Member

Another one today. The logs look a little bit different

https://sof-ci.01.org/sofpr/PR9185/build5843/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=test-speaker

It's really the same issue.

@ssavati
Copy link

ssavati commented Jul 16, 2024

@plbossart @kv2019i Can we close this issue? I'm not observing this issue. if anyone observing sporadically let me know we can keep this issue with reduce priority and remove “production” label

@plbossart
Copy link
Member

we need a stress test with PR #5106. If we see any occurrences of 'Peripheral 0' in the dmesg log, then it's a fail.

@ssavati
Copy link

ssavati commented Jul 16, 2024

Ok I 'll schedule stress test with PR #5106

@marc-hb marc-hb added the P3 Low-impact bugs or features label Jul 16, 2024
@ssavati
Copy link

ssavati commented Jul 17, 2024

we need a stress test with PR #5106. If we see any occurrences of 'Peripheral 0' in the dmesg log, then it's a fail.

@plbossart here is the stress report planresultdetail/43888. For failure I did not see 'Peripheral 0' in dmesg. For suspend/resume dmesg logs are missing so run it manual for 100 iteration for 3 different tests. Tests are able to complete 50+ iteration before it fails due to known S/R issue but dmesg does not show up 'Peripheral 0' when it fails.

@plbossart
Copy link
Member

50 iterations is hardly a stress test, given the previous observation of about 0.6% likelihood of getting the enumeration issue.

we don't need to test the system suspend-resume, which will have other issues. It's enough to go to D3/pm_runtime suspend.
I used the following script:

set -e

cnt=0

while true; do
	TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p 2; 
	sleep 5;
	cnt=$((cnt + 1));
	echo "iteration $cnt done";
done

and let it rip for over 2000 iterations.

@ssavati
Copy link

ssavati commented Jul 18, 2024

I ran the script for 15+ hours, 2557 iterations completed and still running(Peripheral 0 did not show up) for multiple pipeline playback scenarios. I will try capture stress scenario once then we can conclude.

@ssavati
Copy link

ssavati commented Jul 19, 2024

For multiple pipeline capture 2014 iteration completed and still running, issue not reproduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LNL Applies to Lunar Lake platform P3 Low-impact bugs or features SDW Applies to SoundWire bus for codec connection
Projects
None yet
Development

No branches or pull requests

7 participants