-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kpb: Fix draining task and LL conflict #8936
Conversation
63998ff
to
fdf4dab
Compare
* Hence k_sched_lock() is used temporary to block LL from preempting EDF task thread. | ||
*/ | ||
k_sched_lock(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this break the concept of LL? The draining part could take quite a while for a large buffer, and even with this cycle of lock/unlock the entire premise of LL executing within 1ms is broken.
That's a priority inversion in my book where a low-priority task prevents a high-priority/low-latency one from being executed...
Why can't we fix the broken state of the sink instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On each loop iteration draining task basically just calls comp_copy() on component connected to KPB sink. So each loop iteration should take much less then 1 ms.
If timer interrupt increased semaphore on which LL thread waits, immediately at k_yield() scheduler will switch to executing LL thread since LL thread has higher priority then EDF task thread. Or am I overlooking something?
Looks like original draining task code was written in pre-Zephyr era. And probably the code works or worked without Zephyr just fine. However, the code seems were never properly ported to Zephyr. Original draining task implementation has multiple problems:
- there is no sleep/yield in loop iteration and so task blocks IPC thread for a long time (several seconds in our test) and waists power (CPU cannot reach idle between loop iterations).
- LL can preempt draining task and call comp_copy() for same component or access/modify same buffer/stream.
This fix is an attempt to fix CI failure, not to do proper reimplementation of draining task. Without the fix we have random failures on KPB test. We have to remove test temporary just not to block all the work on SOF.
The test fails with draining task runs forever because KPB sink audio_stream has broken state: avail == free == 0. Such state normally cannot happen as those values are always calculated as avail = size - free and free = size - avail. avail and free simultaneously equal to 0 can only happen when LL preempts draining task in unlucky moment.
The fix passes the test. It has no intent to re-implement draining task properly: I'm not sure if KPB is used in any production case other then demo. The intent is to fix CI and return back KPB test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like you said no one is using KPB in any firmware based on Zephyr (post 2.2).
I am not getting the idea of 'fixing CI' which consists in re-testing a broken solution based on a workaround impacting other usages.
Is KPB is so badly broken, it needs to be disabled and redone, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's value in keeping kpb tests. This exercizes a certain type of flow (running an EDF task) in CI that we don't have many others in CI and if we remove the test, this code will bitrot very quickly. The workaround here is local to kpb, so I see no harm outside kpb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i "The workaround here is local to kpb" -> not really, it would impact any other compound use case with e.g. speaker protection running.
If KPB is a test component, them let's mark it as non-default and opt-in for CI purposes. This shouldn't go in any releases IMHO.
KPB draining task is executed by low priority preemptible EDF thread. The task accesses KPB sink buffer and calls comp_copy() for component connected to sink. If LL thread preempts draining task that could result in broken state of sink buffer or component connected to sink. This fix prevents LL from preempting draining task in bad moment. Signed-off-by: Serhiy Katsyuba <serhiy.katsyuba@intel.com>
fdf4dab
to
2b74d2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this PR as 'Request Changes' so that my comment on the conceptual issue is addressed. Is this really what we want to do?
This looks like a slippery slope towards an 'almost low latency' and 'your mileage may vary' real-time scheduling.
@lgirdwood @kv2019i FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand @plbossart your concern, but I also see value in keeping this in CI. See my rationale inline. We do need to do something as this is currently slowing down e.g. backporting critical fixes for SOF2.9.
* Hence k_sched_lock() is used temporary to block LL from preempting EDF task thread. | ||
*/ | ||
k_sched_lock(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's value in keeping kpb tests. This exercizes a certain type of flow (running an EDF task) in CI that we don't have many others in CI and if we remove the test, this code will bitrot very quickly. The workaround here is local to kpb, so I see no harm outside kpb.
CI discussion to be handled by firmware folks
@kv2019i the KPB tests should remain in CI as it is one of the features that is required for Windows. @serhiy-katsyuba-intel , @plbossart this change should be treated as workaround, because by design KPB should be interrupted by LL. My understanding is that for the sake of CI as a short term solution we will prevent LL from preempting draining task, but the long term goal should be to fix KPB and draining implementation (adapt to DP). |
Filed #8977 to follow-up, merging this now. |
KPB draining task is executed by low priority preemptible EDF thread. The task accesses KPB sink buffer and calls comp_copy() for component connected to sink. If LL thread preempts draining task that could result in broken state of sink buffer or component connected to sink. This fix prevents LL from preempting draining task in bad moment.
This fixes CI failures on test_003_01_kd_topology_host_gw[4ch-24in32bit-].