schedule: zephyr_ll: Fix schedule bug for multi instances #8653

TangleZ · 2023-12-20T05:37:12Z

When run two instances at same time the sound has noise, for example: playback and record in i.MX8QM or i.MX8ULP platform.

The reason is one dma interrupt will process two task cause the schedule have no mechanism to make sure only handle the task needed to process.

Fix this issue by adding check if task is in pending state.

iuliana-prodan · 2023-12-20T08:59:00Z

@TangleZ you hear noise when playing & recoding in parallel?
But why this happens, what changed lately?
What introduced this bug?

When the dma_domain for Zephyr was introduced, I believe @LaurentiuM1234 tested this scenario (aplay || arecord).
Can you confirm @LaurentiuM1234?
Any idea why this happens now? Why, so far, we didn't need is_pending for Zephyr?
I know for XTOS we have it.

LaurentiuM1234 · 2023-12-20T09:10:36Z

When the dma_domain for Zephyr was introduced, I believe @LaurentiuM1234 tested this scenario (aplay || arecord).
Can you confirm @LaurentiuM1234?
Any idea why this happens now? Why, so far, we didn't need is_pending for Zephyr?
I know for XTOS we have it.

I guess this was bound to happen at some point. I did indeed test the case in which we have multiple pipeline tasks and it didn't happen. I'm assuming this didn't happen because the buffers were keeping pipeline_copy() from overwriting data from the DMA buffer on DAI's side (i.e: the buffers are full of data so you can't copy anymore from sources to sinks). There's also the timing that may affect this or not (i.e: the DMA interrupts for the pipeline tasks are very close to each other). But yeah this is a known flaw of zephyr_ll and zephyr_dma_domain.

Anyways, I don't believe this solution is the way to go. It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a chan_data structure. As such, you'd end up with the non-registrable pipeline tasks not being executed at all.

Also, why are you using the notifier API? It doesn't seem necessary?

lgirdwood

I've not seen this issues on Intel platforms with 8 streams running at the same time with timer based scheduling, so I suspect relates to DMA based scheduling.
However, both streams with DMA scheduling

need to be independent as they will have different time domains.
cant be connected due to 1.
may need different priorities to ensure no xruns.
It looks like here you are using the same DMA IRQ to schedule both pipelines ? if so, maybe better to use timer ?

TangleZ · 2023-12-20T12:14:24Z

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running.
And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).

What I want to do is to make sure we only process the task that is caused by the right DMA irq.

TangleZ · 2023-12-20T12:31:03Z

@LaurentiuM1234 let us use an example to clear the issue: there are two instances the one is playback(edma 0) and the second(edma 1) is record inthe qm platform. When interrupt 0 causes the schedule run task, there is no code to guarantee only process this task, it also processes the task belonging to the record. And such a situation also happened in interrupt 1.

"It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a chan_data structure."
Actually the task which is not bound to a chan_data they haven't implent pending function, so this will not impact.

"Also, why are you using the notifier API? It doesn't seem necessary?"
I'm not sure notifier API is good, we can discuss to improve this.

lgirdwood · 2023-12-20T14:58:25Z

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running. And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).

What I want to do is to make sure we only process the task that is caused by the right DMA irq.

ok, got you - but the DMA scheduler domain logic should have a callback for each DMA irq source to allow correct irq-> pipeline mapping ?

TangleZ · 2023-12-21T02:50:29Z

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running. And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).
What I want to do is to make sure we only process the task that is caused by the right DMA irq.

ok, got you - but the DMA scheduler domain logic should have a callback for each DMA irq source to allow correct irq-> pipeline mapping ?

Yes, you are right. The problem is as I reply to LaurentiuM1234: "There is no code to guarantee only process the right task". For there are no issues found on Inter platform, I guess the reason is the same as our iMX8MP platform which use SDMA and processes the task that have no data to copy nicely.

src/schedule/zephyr_dma_domain.c

lyakh · 2023-12-21T09:30:23Z

src/schedule/zephyr_ll.c

 			}
+			/* update task state */
+			task->state = state;


There are a few changes in this commit, including this one, that modify behaviour of the (pipeline) task scheduler which is very central to SOF. If anything this will need a thorough review and extensive testing.

...and we have a failure in https://sof-ci.01.org/sofpr/PR8653/build1331/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pipeline-playback

and multiple failures in https://sof-ci.01.org/sofpr/PR8653/build1332/devicetest/index.html

Yes, we need a lot of testing about this change.
About the failures, I can't get any info about what make the CI fail, can you please clarify it to me?

lyakh · 2023-12-21T09:31:16Z

src/include/sof/schedule/ll_schedule_domain.h

@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain,
 {
 	bool ret;

-	assert(domain->ops->domain_is_pending);
+	if (!domain->ops->domain_is_pending)
+		return true;


you add a .domain_is_pending method to the DMA scheduling domain, so presumably this change isn't needed?

@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.

kv2019i

Hmm, I guess this is ok. The diff in zephyr_ll.c looks a bit scary in github view, but as far as I can tell, this should be a no-op for timer-driven use-cases.

Not 100% how suitable the is_pending() interface is to implement this. I think originally this was in the domain interface to cater for one-shot and delayed tasks. Now with move to Zephyr, the recommendation is to use native Zephyr interfaces to do delayed processing and what remains in the LL scheduler is really about running the LL tasks every timer tick and the domain is really only used to create manage the timer thread.

So in that context, checking of is_pending() is a bit of waste of DSP cycles for the timer driven case.

For DMA driven case, in theory it should be enough to run the all the tasks once per LL tick (whatever is the first DMA IRQ that fires). But alas, with current arch for DMA driven scheduling, it's not so easy to elect the "main IRQ". This would have to be the IRQ with the highest rate, and if a task is removed, a new "main IRQ" would have to be elected.

So with that, maybe this is the best approach in the end.

kv2019i · 2023-12-21T15:40:42Z

src/include/sof/schedule/ll_schedule_domain.h

@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain,
 {
 	bool ret;

-	assert(domain->ops->domain_is_pending);
+	if (!domain->ops->domain_is_pending)
+		return true;


@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.

src/schedule/zephyr_ll.c

kv2019i

Thanks, code looks clean to me now. But, but, tests in CI still fail, it would seem to be linked to one particular user of the LL scheduler, the chain-dmai. See my comment inline.

kv2019i · 2023-12-22T11:01:09Z

src/schedule/zephyr_ll.c

@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data)
 				break;
 			}
 		}
+		/* update task state */
+		task->state = state;


I'm wondering if it's this addition on L247-248 that is causing some unexpected failures. It seems test configurations that use the chain-dma feature (basicly bypasses creation of full audio pipelines and uses a simple copy LL task to push data through two interfaces). I can see DSP panics in configurations where chain-dma is enabled:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html

According to the CI I can't see any info about fails in my side:
TEST #{{ internalResultID }}
CI build start time:
On-device test with:
Kernel Build Info
SOF Build Info
PR ID: {{ planResult.linuxPrID }}
Linux Branch: {{ planResult.linuxBranch }}
Commit: Merge {{ shortenCommit(planResult.linuxParents.head) }} into {{ shortenCommit(planResult.linuxParents.target) }} ⇒ {{ shortenCommit(planResult.linuxSource.linuxCommit) }}
Kconfig Branch: {{ shortenCommit(planResult.kconfigBranch) }}
Kconfig Commit: {{ shortenCommit(planResult.linuxSource.kconfigCommit) }}
PR ID: {{ planResult.sofPrID }}
SOF Branch: {{ planResult.sofBranch }}
Commit: Merge {{ shortenCommit(planResult.sofParents.head) }} into {{ shortenCommit(planResult.sofParents.target) }} ⇒ {{ shortenCommit(planResult.sofSource.sofCommit) }}
Zephyr Commit: {{ shortenCommit(planResult.sofSource.zephyrCommit) }}
Copy Link

Can you please help me to understand where show DSP panics and I will try to fix it.

@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.

@TangleZ Sorry for late response, out for xmas. Here's example log showing a panic:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html?model=TGLU_UP_HDA-ipc4&testcase=multiple-pause-resume-50

This is running "~/sof-test/test-case/multiple-pause-resume.sh -r 50" on a Intel TGL based laptop. The panic hits when the test uses the HDMI/DP PCMs that use chain-dma.

dbaluta · 2023-12-22T11:49:25Z

@TangleZ can you also please test mixer scenarios on 8qxp or 8qm? I'm heading out for christmas holidays but will follow this PR.

TangleZ · 2023-12-23T07:10:57Z

@TangleZ can you also please test mixer scenarios on 8qxp or 8qm? I'm heading out for christmas holidays but will follow this PR.

Sure. Can you please send me the test cmd or where I can find it?

TangleZ · 2024-01-03T08:47:14Z

@dbaluta I fixed the issue with the mixer by checking if task is registerable in pending function.

lgirdwood · 2024-01-04T13:55:56Z

src/schedule/zephyr_ll.c

@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data)
 				break;
 			}
 		}
+		/* update task state */
+		task->state = state;


@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.

lgirdwood · 2024-01-04T13:57:17Z

src/schedule/zephyr_ll.c


 		/* Move the task to a temporary list */
 		list_item_del(list);
 		list_item_append(list, &task_head);

+		if (task->state != SOF_TASK_STATE_RUNNING)


ditto, re inline comments - scheduler can be complex so lets add comments to help.

Also, I can see the fails in CI, and it's related intel test. But I can't do the test in my side cause we have no intel board.
Can you please help? I also will look the patch to see if any I can change to fix the CI fails.

Ack, @TangleZ I'll try to debug this a bit early next week, agreed this is difficult to debug without a board available.

Thanks for your help!

When run two instances at same time the sound has noise, for example: playback and record in i.MX8QM or i.MX8ULP platform. The reason is one dma interrupt will process two task cause the schedule have no mechanism to make sure only handle the task needed to process. Fix this issue by adding check if task is in pending state. Signed-off-by: Zhang Peng <peng.zhang_8@nxp.com>

LaurentiuM1234 · 2024-01-08T08:14:37Z

src/schedule/zephyr_dma_domain.c

+
+	pipe_task = pipeline_task_get(task);
+
+	if (!pipe_task->registrable)


What if you have 2 registrable pipeline tasks and 2 non-registrable? If I recall correctly this case is supported by the mixer topologies we have (registrable 1 is playback while registrable 2 is capture). With the current approach we're still going to end up executing the non-registrable tasks (i.e: those that transfer data into the mixer) every time the registrable tasks are scheduled. Also, I'm not sure messing around with the state of the tasks inside the domain is an OK idea as they're used by the scheduler.

Also, is this patch still needed? If I recall correctly you already submitted a workaround for this issue. If not then I would suggest waiting for the transition to the native interface + timer domain. For now, it seems to work fine on i.MX93 and I'm hoping it will also work for our other platforms.

We did have a workaround for this issue, but it not fix the root cause. As I said before, the root cause is schedule bug.
If you think don't need this patch, we'd better switch to native interface + timer domain in next release.

dbaluta · 2024-01-10T07:55:51Z

We have an workaround for this. We will have a proper fix with the next release. The ideal solution is to align with Intel on this and use timer domain.

TangleZ requested review from pblaszko, marcinszkudlinski, dbaluta, LaurentiuM1234, lgirdwood, plbossart, mmaka1, lbetlej and kv2019i as code owners December 20, 2023 05:37

TangleZ force-pushed the sche_fix branch 3 times, most recently from 3d4136e to f140a8d Compare December 20, 2023 06:05

lgirdwood reviewed Dec 20, 2023

View reviewed changes

lyakh reviewed Dec 21, 2023

View reviewed changes

kv2019i reviewed Dec 21, 2023

View reviewed changes

TangleZ force-pushed the sche_fix branch 2 times, most recently from 962b4cf to 2c470f3 Compare December 22, 2023 08:18

kv2019i reviewed Dec 22, 2023

View reviewed changes

TangleZ force-pushed the sche_fix branch from 2c470f3 to 57774a4 Compare January 3, 2024 08:44

lgirdwood reviewed Jan 4, 2024

View reviewed changes

TangleZ force-pushed the sche_fix branch from 57774a4 to 1f93883 Compare January 5, 2024 06:02

LaurentiuM1234 reviewed Jan 8, 2024

View reviewed changes

dbaluta closed this Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedule: zephyr_ll: Fix schedule bug for multi instances #8653

schedule: zephyr_ll: Fix schedule bug for multi instances #8653

TangleZ commented Dec 20, 2023

iuliana-prodan commented Dec 20, 2023

LaurentiuM1234 commented Dec 20, 2023 •

edited

Loading

lgirdwood left a comment

TangleZ commented Dec 20, 2023

TangleZ commented Dec 20, 2023

lgirdwood commented Dec 20, 2023

TangleZ commented Dec 21, 2023

lyakh Dec 21, 2023

lyakh Dec 21, 2023

lyakh Dec 21, 2023

TangleZ Dec 22, 2023

lyakh Dec 21, 2023

kv2019i Dec 21, 2023

kv2019i left a comment

kv2019i Dec 21, 2023

kv2019i left a comment

kv2019i Dec 22, 2023

TangleZ Dec 23, 2023

lgirdwood Jan 4, 2024

kv2019i Jan 4, 2024

dbaluta commented Dec 22, 2023

TangleZ commented Dec 23, 2023

TangleZ commented Jan 3, 2024

lgirdwood Jan 4, 2024

lgirdwood Jan 4, 2024

TangleZ Jan 5, 2024

TangleZ Jan 5, 2024

kv2019i Jan 5, 2024

TangleZ Jan 10, 2024

LaurentiuM1234 Jan 8, 2024

TangleZ Jan 10, 2024

dbaluta commented Jan 10, 2024


		pipe_task = pipeline_task_get(task);

		if (!pipe_task->registrable)

schedule: zephyr_ll: Fix schedule bug for multi instances #8653

schedule: zephyr_ll: Fix schedule bug for multi instances #8653

Conversation

TangleZ commented Dec 20, 2023

iuliana-prodan commented Dec 20, 2023

LaurentiuM1234 commented Dec 20, 2023 • edited Loading

lgirdwood left a comment

Choose a reason for hiding this comment

TangleZ commented Dec 20, 2023

TangleZ commented Dec 20, 2023

lgirdwood commented Dec 20, 2023

TangleZ commented Dec 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbaluta commented Dec 22, 2023

TangleZ commented Dec 23, 2023

TangleZ commented Jan 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbaluta commented Jan 10, 2024

LaurentiuM1234 commented Dec 20, 2023 •

edited

Loading