-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedule: zephyr_ll: Fix schedule bug for multi instances #8653
Conversation
3d4136e
to
f140a8d
Compare
@TangleZ you hear noise when playing & recoding in parallel? When the dma_domain for Zephyr was introduced, I believe @LaurentiuM1234 tested this scenario (aplay || arecord). |
I guess this was bound to happen at some point. I did indeed test the case in which we have multiple pipeline tasks and it didn't happen. I'm assuming this didn't happen because the buffers were keeping Anyways, I don't believe this solution is the way to go. It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a Also, why are you using the notifier API? It doesn't seem necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not seen this issues on Intel platforms with 8 streams running at the same time with timer based scheduling, so I suspect relates to DMA based scheduling.
However, both streams with DMA scheduling
- need to be independent as they will have different time domains.
- cant be connected due to 1.
- may need different priorities to ensure no xruns.
It looks like here you are using the same DMA IRQ to schedule both pipelines ? if so, maybe better to use timer ?
@lgirdwood thanks for your comments. And I'm not very clear on what you mean about 1 and 2. but for 3, the answer should be no. In my understanding, every instance should have the same priority when running. What I want to do is to make sure we only process the task that is caused by the right DMA irq. |
@LaurentiuM1234 let us use an example to clear the issue: there are two instances the one is playback(edma 0) and the second(edma 1) is record inthe qm platform. When interrupt 0 causes the schedule run task, there is no code to guarantee only process this task, it also processes the task belonging to the record. And such a situation also happened in interrupt 1. "It doesn't seem to take into account that you can also have non-registrable pipeline tasks which are not bound to a chan_data structure." "Also, why are you using the notifier API? It doesn't seem necessary?" |
ok, got you - but the DMA scheduler domain logic should have a callback for each DMA irq source to allow correct irq-> pipeline mapping ? |
Yes, you are right. The problem is as I reply to LaurentiuM1234: "There is no code to guarantee only process the right task". For there are no issues found on Inter platform, I guess the reason is the same as our iMX8MP platform which use SDMA and processes the task that have no data to copy nicely. |
src/schedule/zephyr_ll.c
Outdated
} | ||
/* update task state */ | ||
task->state = state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few changes in this commit, including this one, that modify behaviour of the (pipeline) task scheduler which is very central to SOF. If anything this will need a thorough review and extensive testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and multiple failures in https://sof-ci.01.org/sofpr/PR8653/build1332/devicetest/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need a lot of testing about this change.
About the failures, I can't get any info about what make the CI fail, can you please clarify it to me?
@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain, | |||
{ | |||
bool ret; | |||
|
|||
assert(domain->ops->domain_is_pending); | |||
if (!domain->ops->domain_is_pending) | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you add a .domain_is_pending
method to the DMA scheduling domain, so presumably this change isn't needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess this is ok. The diff in zephyr_ll.c looks a bit scary in github view, but as far as I can tell, this should be a no-op for timer-driven use-cases.
Not 100% how suitable the is_pending() interface is to implement this. I think originally this was in the domain interface to cater for one-shot and delayed tasks. Now with move to Zephyr, the recommendation is to use native Zephyr interfaces to do delayed processing and what remains in the LL scheduler is really about running the LL tasks every timer tick and the domain is really only used to create manage the timer thread.
So in that context, checking of is_pending() is a bit of waste of DSP cycles for the timer driven case.
For DMA driven case, in theory it should be enough to run the all the tasks once per LL tick (whatever is the first DMA IRQ that fires). But alas, with current arch for DMA driven scheduling, it's not so easy to elect the "main IRQ". This would have to be the IRQ with the highest rate, and if a task is removed, a new "main IRQ" would have to be elected.
So with that, maybe this is the best approach in the end.
@@ -222,7 +222,8 @@ static inline bool domain_is_pending(struct ll_schedule_domain *domain, | |||
{ | |||
bool ret; | |||
|
|||
assert(domain->ops->domain_is_pending); | |||
if (!domain->ops->domain_is_pending) | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lyakh I think zephyr_domain.c doesn't implement this, so this is needed not to break non-DMA LL cases.
962b4cf
to
2c470f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, code looks clean to me now. But, but, tests in CI still fail, it would seem to be linked to one particular user of the LL scheduler, the chain-dmai. See my comment inline.
@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data) | |||
break; | |||
} | |||
} | |||
/* update task state */ | |||
task->state = state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's this addition on L247-248 that is causing some unexpected failures. It seems test configurations that use the chain-dma feature (basicly bypasses creation of full audio pipelines and uses a simple copy LL task to push data through two interfaces). I can see DSP panics in configurations where chain-dma is enabled:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the CI I can't see any info about fails in my side:
TEST #{{ internalResultID }}
CI build start time:
On-device test with:
Kernel Build Info
SOF Build Info
PR ID: {{ planResult.linuxPrID }}
Linux Branch: {{ planResult.linuxBranch }}
Commit: Merge {{ shortenCommit(planResult.linuxParents.head) }} into {{ shortenCommit(planResult.linuxParents.target) }} ⇒ {{ shortenCommit(planResult.linuxSource.linuxCommit) }}
Kconfig Branch: {{ shortenCommit(planResult.kconfigBranch) }}
Kconfig Commit: {{ shortenCommit(planResult.linuxSource.kconfigCommit) }}
PR ID: {{ planResult.sofPrID }}
SOF Branch: {{ planResult.sofBranch }}
Commit: Merge {{ shortenCommit(planResult.sofParents.head) }} into {{ shortenCommit(planResult.sofParents.target) }} ⇒ {{ shortenCommit(planResult.sofSource.sofCommit) }}
Zephyr Commit: {{ shortenCommit(planResult.sofSource.zephyrCommit) }}
Copy Link
Can you please help me to understand where show DSP panics and I will try to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TangleZ Sorry for late response, out for xmas. Here's example log showing a panic:
https://sof-ci.01.org/sofpr/PR8653/build1416/devicetest/index.html?model=TGLU_UP_HDA-ipc4&testcase=multiple-pause-resume-50
This is running "~/sof-test/test-case/multiple-pause-resume.sh -r 50" on a Intel TGL based laptop. The panic hits when the test uses the HDMI/DP PCMs that use chain-dma.
@TangleZ can you also please test mixer scenarios on 8qxp or 8qm? I'm heading out for christmas holidays but will follow this PR. |
Sure. Can you please send me the test cmd or where I can find it? |
@dbaluta I fixed the issue with the mixer by checking if task is registerable in pending function. |
@@ -237,6 +244,8 @@ static void zephyr_ll_run(void *data) | |||
break; | |||
} | |||
} | |||
/* update task state */ | |||
task->state = state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TangleZ can you refresh and enable scripts in your browser for this site, it should then show you logs.
|
||
/* Move the task to a temporary list */ | ||
list_item_del(list); | ||
list_item_append(list, &task_head); | ||
|
||
if (task->state != SOF_TASK_STATE_RUNNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, re inline comments - scheduler can be complex so lets add comments to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I can see the fails in CI, and it's related intel test. But I can't do the test in my side cause we have no intel board.
Can you please help? I also will look the patch to see if any I can change to fix the CI fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, @TangleZ I'll try to debug this a bit early next week, agreed this is difficult to debug without a board available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help!
When run two instances at same time the sound has noise, for example: playback and record in i.MX8QM or i.MX8ULP platform. The reason is one dma interrupt will process two task cause the schedule have no mechanism to make sure only handle the task needed to process. Fix this issue by adding check if task is in pending state. Signed-off-by: Zhang Peng <peng.zhang_8@nxp.com>
|
||
pipe_task = pipeline_task_get(task); | ||
|
||
if (!pipe_task->registrable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you have 2 registrable pipeline tasks and 2 non-registrable? If I recall correctly this case is supported by the mixer topologies we have (registrable 1 is playback while registrable 2 is capture). With the current approach we're still going to end up executing the non-registrable tasks (i.e: those that transfer data into the mixer) every time the registrable tasks are scheduled. Also, I'm not sure messing around with the state of the tasks inside the domain is an OK idea as they're used by the scheduler.
Also, is this patch still needed? If I recall correctly you already submitted a workaround for this issue. If not then I would suggest waiting for the transition to the native interface + timer domain. For now, it seems to work fine on i.MX93 and I'm hoping it will also work for our other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did have a workaround for this issue, but it not fix the root cause. As I said before, the root cause is schedule bug.
If you think don't need this patch, we'd better switch to native interface + timer domain in next release.
We have an workaround for this. We will have a proper fix with the next release. The ideal solution is to align with Intel on this and use timer domain. |
When run two instances at same time the sound has noise, for example: playback and record in i.MX8QM or i.MX8ULP platform.
The reason is one dma interrupt will process two task cause the schedule have no mechanism to make sure only handle the task needed to process.
Fix this issue by adding check if task is in pending state.