Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule: zephyr_ll: Fix schedule bug for multi instances #8653

Closed
wants to merge 1 commit into from

Conversation

TangleZ
Copy link

@TangleZ TangleZ commented Dec 20, 2023

When run two instances at same time the sound has noise, for example: playback and record in i.MX8QM or i.MX8ULP platform.

The reason is one dma interrupt will process two task cause the schedule have no mechanism to make sure only handle the task needed to process.

Fix this issue by adding check if task is in pending state.

@iuliana-prodan
Copy link
Contributor

@TangleZ you hear noise when playing & recoding in parallel?
But why this happens, what changed lately?
What introduced this bug?

When the dma_domain for Zephyr was introduced, I believe @LaurentiuM1234 tested this scenario (aplay || arecord).
Can you confirm @LaurentiuM1234?
Any idea why this happens now? Why, so far, we didn't need is_pending for Zephyr?
I know for XTOS we have it.

@LaurentiuM1234
Copy link
Contributor

LaurentiuM1234 commented Dec 20, 2023

When the dma_domain for Zephyr was introduced, I believe @LaurentiuM1234 tested this scenario (aplay || arecord).
Can you confirm @LaurentiuM1234?
Any idea why this happens now? Why, so far, we didn't need is_pending for Zephyr?
I know for XTOS we have it.

I guess this was bound to happen at some point. I did indeed test the case in which we have multiple pipeline tasks and it didn't happen. I'm assuming this didn't happen because the buffers were keeping pipeline_copy() from overwriting data from the DMA buffer on DAI's side (i.e: the buffers are full of data so you can't copy anymore from sources to sinks). There's also the timing that may affect this or not (i.e: the DMA interrupts for the pipeline tasks are very close to each other). But yeah this is a known flaw of zephyr_ll and zephyr_dma_domain.

Anyways, I don't believe this solution is the way to go. It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a chan_data structure. As such, you'd end up with the non-registrable pipeline tasks not being executed at all.

Also, why are you using the notifier API? It doesn't seem necessary?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not seen this issues on Intel platforms with 8 streams running at the same time with timer based scheduling, so I suspect relates to DMA based scheduling.
However, both streams with DMA scheduling

  1. need to be independent as they will have different time domains.
  2. cant be connected due to 1.
  3. may need different priorities to ensure no xruns.
    It looks like here you are using the same DMA IRQ to schedule both pipelines ? if so, maybe better to use timer ?

@TangleZ
Copy link
Author

TangleZ commented Dec 20, 2023

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running.
And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).

What I want to do is to make sure we only process the task that is caused by the right DMA irq.

@TangleZ
Copy link
Author

TangleZ commented Dec 20, 2023

@LaurentiuM1234 let us use an example to clear the issue: there are two instances the one is playback(edma 0) and the second(edma 1) is record inthe qm platform. When interrupt 0 causes the schedule run task, there is no code to guarantee only process this task, it also processes the task belonging to the record. And such a situation also happened in interrupt 1.

"It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a chan_data structure."
Actually the task which is not bound to a chan_data they haven't implent pending function, so this will not impact.

"Also, why are you using the notifier API? It doesn't seem necessary?"
I'm not sure notifier API is good, we can discuss to improve this.

@lgirdwood
Copy link
Member

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running. And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).

What I want to do is to make sure we only process the task that is caused by the right DMA irq.

ok, got you - but the DMA scheduler domain logic should have a callback for each DMA irq source to allow correct irq-> pipeline mapping ?

@TangleZ
Copy link
Author

TangleZ commented Dec 21, 2023

@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running. And here it can't say "using the same DMA IRQ to schedule both pipelines", based on different types dma, they may have same DMA IRQ(SDMA) or have different DMA IRQ(EDMA).
What I want to do is to make sure we only process the task that is caused by the right DMA irq.

ok, got you - but the DMA scheduler domain logic should have a callback for each DMA irq source to allow correct irq-> pipeline mapping ?

Yes, you are right. The problem is as I reply to LaurentiuM1234: "There is no code to guarantee only process the right task". For there are no issues found on Inter platform, I guess the reason is the same as our iMX8MP platform which use SDMA and processes the task that have no data to copy nicely.

src/schedule/zephyr_dma_domain.c Outdated Show resolved Hide resolved
}
/* update task state */
task->state = state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few changes in this commit, including this one, that modify behaviour of the (pipeline) task scheduler which is very central to SOF. If anything this will need a thorough review and extensive testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need a lot of testing about this change.
About the failures, I can't get any info about what make the CI fail, can you please clarify it to me?

@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain,
{
bool ret;

assert(domain->ops->domain_is_pending);
if (!domain->ops->domain_is_pending)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you add a .domain_is_pending method to the DMA scheduling domain, so presumably this change isn't needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess this is ok. The diff in zephyr_ll.c looks a bit scary in github view, but as far as I can tell, this should be a no-op for timer-driven use-cases.

Not 100% how suitable the is_pending() interface is to implement this. I think originally this was in the domain interface to cater for one-shot and delayed tasks. Now with move to Zephyr, the recommendation is to use native Zephyr interfaces to do delayed processing and what remains in the LL scheduler is really about running the LL tasks every timer tick and the domain is really only used to create manage the timer thread.

So in that context, checking of is_pending() is a bit of waste of DSP cycles for the timer driven case.

For DMA driven case, in theory it should be enough to run the all the tasks once per LL tick (whatever is the first DMA IRQ that fires). But alas, with current arch for DMA driven scheduling, it's not so easy to elect the "main IRQ". This would have to be the IRQ with the highest rate, and if a task is removed, a new "main IRQ" would have to be elected.

So with that, maybe this is the best approach in the end.

@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain,
{
bool ret;

assert(domain->ops->domain_is_pending);
if (!domain->ops->domain_is_pending)
return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.

src/schedule/zephyr_ll.c Outdated Show resolved Hide resolved
@TangleZ TangleZ force-pushed the sche_fix branch 2 times, most recently from 962b4cf to 2c470f3 Compare December 22, 2023 08:18
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, code looks clean to me now. But, but, tests in CI still fail, it would seem to be linked to one particular user of the LL scheduler, the chain-dmai. See my comment inline.

@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data)
break;
}
}
/* update task state */
task->state = state;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's this addition on L247-248 that is causing some unexpected failures. It seems test configurations that use the chain-dma feature (basicly bypasses creation of full audio pipelines and uses a simple copy LL task to push data through two interfaces). I can see DSP panics in configurations where chain-dma is enabled:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the CI I can't see any info about fails in my side:
TEST #{{ internalResultID }}
CI build start time:
On-device test with:
Kernel Build Info
SOF Build Info
PR ID: {{ planResult.linuxPrID }}
Linux Branch: {{ planResult.linuxBranch }}
Commit: Merge {{ shortenCommit(planResult.linuxParents.head) }} into {{ shortenCommit(planResult.linuxParents.target) }} ⇒ {{ shortenCommit(planResult.linuxSource.linuxCommit) }}
Kconfig Branch: {{ shortenCommit(planResult.kconfigBranch) }}
Kconfig Commit: {{ shortenCommit(planResult.linuxSource.kconfigCommit) }}
PR ID: {{ planResult.sofPrID }}
SOF Branch: {{ planResult.sofBranch }}
Commit: Merge {{ shortenCommit(planResult.sofParents.head) }} into {{ shortenCommit(planResult.sofParents.target) }} ⇒ {{ shortenCommit(planResult.sofSource.sofCommit) }}
Zephyr Commit: {{ shortenCommit(planResult.sofSource.zephyrCommit) }}
Copy Link

Can you please help me to understand where show DSP panics and I will try to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TangleZ Sorry for late response, out for xmas. Here's example log showing a panic:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html?model=TGLU_UP_HDA-ipc4&testcase=multiple-pause-resume-50

This is running "~/sof-test/test-case/multiple-pause-resume.sh -r 50" on a Intel TGL based laptop. The panic hits when the test uses the HDMI/DP PCMs that use chain-dma.

@dbaluta
Copy link
Collaborator

dbaluta commented Dec 22, 2023

@TangleZ can you also please test mixer scenarios on 8qxp or 8qm? I'm heading out for christmas holidays but will follow this PR.

@TangleZ
Copy link
Author

TangleZ commented Dec 23, 2023

@TangleZ can you also please test mixer scenarios on 8qxp or 8qm? I'm heading out for christmas holidays but will follow this PR.

Sure. Can you please send me the test cmd or where I can find it?

@TangleZ
Copy link
Author

TangleZ commented Jan 3, 2024

@dbaluta I fixed the issue with the mixer by checking if task is registerable in pending function.

@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data)
break;
}
}
/* update task state */
task->state = state;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.


/* Move the task to a temporary list */
list_item_del(list);
list_item_append(list, &task_head);

if (task->state != SOF_TASK_STATE_RUNNING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, re inline comments - scheduler can be complex so lets add comments to help.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I can see the fails in CI, and it's related intel test. But I can't do the test in my side cause we have no intel board.
Can you please help? I also will look the patch to see if any I can change to fix the CI fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, @TangleZ I'll try to debug this a bit early next week, agreed this is difficult to debug without a board available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help!

When run two instances at same time the sound has noise, for
example: playback and record in i.MX8QM or i.MX8ULP platform.

The reason is one dma interrupt will process two task cause the
schedule have no mechanism to make sure only handle the task
needed to process.

Fix this issue by adding check if task is in pending state.

Signed-off-by: Zhang Peng <peng.zhang_8@nxp.com>

pipe_task = pipeline_task_get(task);

if (!pipe_task->registrable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you have 2 registrable pipeline tasks and 2 non-registrable? If I recall correctly this case is supported by the mixer topologies we have (registrable 1 is playback while registrable 2 is capture). With the current approach we're still going to end up executing the non-registrable tasks (i.e: those that transfer data into the mixer) every time the registrable tasks are scheduled. Also, I'm not sure messing around with the state of the tasks inside the domain is an OK idea as they're used by the scheduler.

Also, is this patch still needed? If I recall correctly you already submitted a workaround for this issue. If not then I would suggest waiting for the transition to the native interface + timer domain. For now, it seems to work fine on i.MX93 and I'm hoping it will also work for our other platforms.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did have a workaround for this issue, but it not fix the root cause. As I said before, the root cause is schedule bug.
If you think don't need this patch, we'd better switch to native interface + timer domain in next release.

@dbaluta
Copy link
Collaborator

dbaluta commented Jan 10, 2024

We have an workaround for this. We will have a proper fix with the next release. The ideal solution is to align with Intel on this and use timer domain.

@dbaluta dbaluta closed this Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants