Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DP: provide data to next LL module no earlier than DP deadline #8511

Merged
merged 1 commit into from
Dec 19, 2023

Conversation

marcinszkudlinski
Copy link
Contributor

lets assume DP with 10ms period (a.k.a a deadline). It starts and finishes earlier, i.e. in 2ms providing 10ms of data LL starts consuming data in 1ms chunks and will drain 10ms buffer in 10ms, expecting a new portion of data on 11th ms

BUT - the DP module deadline is still 10ms,
regardless if it had finished earlier and it is completely fine that processing in next cycle takes full 10ms - as long as it fits into the deadline.

It may lead to underruns:

LL1 (1ms) ---> DP (10ms) -->LL2 (1ms)

ticks 0..9 -> LL1 is producing 1ms data portions,
DP is waiting, LL2 is waiting
tick 10 - DP has enough data to run, it starts processing tick 12 - DP finishes earlier, LL2 starts consuming,
LL1 is producing data
ticks 13-19 LL1 is producing data,
LL2 is consuming data (both in 1ms chunks)
tick 20 - DP starts processing a new portion of 10ms data,
having 10ms to finish
!!!! but LL2 has already consumed 8ms !!!!
tick 22 - LL2 is consuming the last 1ms data chunk tick 23 - DP is still processing, LL2 has no data to process
!!! UNDERRUN !!!!
tick 19 - DP finishes properly in a deadline time

Solution: even if DP finishes before its deadline, the data must be held till deadline time, so LL2 may start processing no earlier than tick 20

@marcinszkudlinski
Copy link
Contributor Author

comment from prev PR:

@lgirdwood thinking aloud - would this be better as a int i.e. a counter and could be set to any delay value needed and decremented on each LL tick ?

Such counter is in DP scheduler context
I did not put it here as this delayed startup matters at 1st run only, but the counter is valid at all times

* ticks 13-19 LL1 is producing data, LL2 is consuming data (both in 1ms chunks)
* tick 20 - DP starts processing a new portion of 10ms data, having 10ms to finish
* !!!! but LL2 has already consumed 8ms !!!!
* tick 22 - LL2 is consuming the last 1ms data chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you assumption, DP need 2ms to generate 10ms data, for the second time, DP start processing on tick 20, so it should finish on tick 22, at tick 22, LL2 can get 1ms data.

assumption maybe based on DP need take 3ms to finish 10ms process.

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why do you think that DP will finish in 2ms at second cycle? And 3rd? and later? What if a second pipeline starts with 2 more DPs with 3ms deadlines (having shorter deadlines and therefore scheduled before 10ms one)?

And - why do you think that every process takes the same number of CPU cycles?

The only guaranteed time is a deadline. And the infrastructure must ensure that data flow is not disturbed as long as the module finished within the deadline

Copy link
Contributor Author

@marcinszkudlinski marcinszkudlinski Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e.

we have 2DPs - one 10ms, process time 5ms, second 5ms, process time 2ms

LL --> DP1 (10ms / 5ms) ---> LL ---> DP2 (5ms / 2ms) --> LL

1st cycle - DP2 is not yet processing, full CPU power to DP1
| DP1 1,2,3,4,5| finish in 5ms

later - CPU time is sliced between DP1 and DP2

|DP1 1,2,3|   |DP2 1, 2|  |DP1 4,5| (idle for 1ms) |DP2 1, 2|  |DP1 1,2,3|   |DP2 1, 2|  |DP1 4,5|
                                 ^
                           finish in 7ms

CPU loaded in 90%, DP1 finishes in 7ms (deadline met) regardless it was processing only 5ms, DP2 finishes in 2ms (deadline met)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the underrun case, just check the example, laterexample exactly show the real scenario for dsp work load.

So, still the questions are:

  1. whether DP processing period are configurable? I mean it can also config with 2ms?
  2. if not, then there would be significant delays, take DTS as example, if DTS require a delay < 10ms, with above case, we have 20ms+ delay, how we handle this situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is configurable by the module itself.
It can set any period it wants during prepare method
or it will be calculated based on OBS and data rate

@marcinszkudlinski
Copy link
Contributor Author

rebase to newest head - retrigger stalled CI


/* trigger the task */
curr_task->state = SOF_TASK_STATE_RUNNING;
k_sem_give(&pdata->sem);
pdata->ll_cycles_to_deadline = pdata->deadline_ll_cycles;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go before the semaphore to avoid a race

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not matter, the semaphore give is releasing a module thread, which has no access to pdata-> context. The context itself is protected by scheduler_dp_lock / scheduler_dp_unlock

Anyway I see CI is still stalled so I have to retrigger anyway - so I'll change it because it looks suspicious at first

@kv2019i
Copy link
Collaborator

kv2019i commented Nov 24, 2023

Note, we can ignore "sof-ci/jenkins/pr-fw-build" fails. We have a new test job "sof-ci/jenkins/pr-build" that covers both FW and the tools in one job. (and @keqiaozhang we need to make the new one "Required").

@marcinszkudlinski marcinszkudlinski changed the title DP: provide data to next LL module no earlier than DP deadline [DNM] DP: provide data to next LL module no earlier than DP deadline Nov 24, 2023
@marcinszkudlinski
Copy link
Contributor Author

marcinszkudlinski commented Nov 24, 2023

please DNM, some internal full range tests failed because of this patch - must double check the rootcause

@lgirdwood
Copy link
Member

please DNM, some internal full range tests failed because of this patch - must double check the rootcause

btw, west update fixed a lot of CI results today so pls make sure you rebase before retest. Thanks !

@lgirdwood lgirdwood added this to the v2.9 milestone Dec 11, 2023
lets assume DP with 10ms period (a.k.a a deadline).
It starts and finishes earlier, i.e. in 2ms providing 10ms of data
LL starts consuming data in 1ms chunks and will drain
10ms buffer in 10ms, expecting a new portion of data on 11th ms

BUT - the DP module deadline is still 10ms,
regardless if it had finished earlier and it is completely fine
that processing in next cycle takes full 10ms - as long as it
fits into the deadline.

It may lead to underruns:

LL1 (1ms) ---> DP (10ms) -->LL2 (1ms)

ticks 0..9 -> LL1 is producing 1ms data portions,
             DP is waiting, LL2 is waiting
tick 10 - DP has enough data to run, it starts processing
tick 12 - DP finishes earlier, LL2 starts consuming,
          LL1 is producing data
ticks 13-19 LL1 is producing data,
            LL2 is consuming data (both in 1ms chunks)
tick 20  - DP starts processing a new portion of 10ms data,
           having 10ms to finish
	      !!!! but LL2 has already consumed 8ms !!!!
tick 22 - LL2 is consuming the last 1ms data chunk
tick 23 - DP is still processing, LL2 has no data to process
	 			!!! UNDERRUN !!!!
tick 19 - DP finishes properly in a deadline time

Solution: even if DP finishes before its deadline,
the data must be held till deadline time, so LL2 may
start processing no earlier than tick 20

Signed-off-by: Marcin Szkudlinski <marcin.szkudlinski@intel.com>
@marcinszkudlinski marcinszkudlinski changed the title [DNM] DP: provide data to next LL module no earlier than DP deadline DP: provide data to next LL module no earlier than DP deadline Dec 15, 2023
@marcinszkudlinski
Copy link
Contributor Author

rebased to newest main
Removing dnm, please proceed

Copy link
Contributor

@btian1 btian1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if LL2 start from 20ms, how about at this time, DP already ready to output the second 10ms data to output buffer, will DP on hold until LL2 consumed the first 10ms?

@marcinszkudlinski
Copy link
Contributor Author

if LL2 start from 20ms, how about at this time, DP already ready to output the second 10ms data to output buffer, will DP on hold until LL2 consumed the first 10ms?

buffer at DP output is set to 2*OBS, DP always starts in 10ms period, so even if DP finishes in 0.000001ms - there will still be enough space for store processed data - because following LL will drain 10ms data till then

as to latency:
DP must produce data within deadline, a deadline is the worst case for DP
If it is certain that the DP will always be faster than its processing time - there's no problem with setting shorter deadline, even 1ms for 10ms processing.
Default deadline is a processing period - meaning the longest possible value. Deadline cannot, under any circumstances, be longer than processing chunk because following LL module won't get data on time

Example - let's say we do have a DP module with 10ms chunks, 8ms processing time, 10ms deadline

  1. use LL with 10ms period

DMIC DMA buffer (10ms) - processing may start when the data are fully loaded, so 10ms latency here

  • process of 10ms data LL1
  • process of 10ms data LL2
  • process of 10ms data LL3
  • process of 10ms data LL4
  • process of 10ms data LL5 (in total 10ms of LL processing - including 8 of DP)
    HOST DMA buffer (10ms) - transfer may start when the data are loaded, immediately when LL finishes.

Total latency -> 20ms, no matter if processing takes 8ms or 2ms - always 2*LL period

  1. LL 1ms + DP 10ms data chunks with 10 ms deadline

DMIC DMA (1ms) - processing may start when the data are loaded, so 1ms latency here

  • process of 1ms data LL1
  • process of 1ms data LL2 (1ms latency for processing)
    accumulate 10ms of data for DP (processing may start when buffer is loaded - 10 ms latency)
  • process of 10ms data DP - 8ms, with 10ms declared deadline = 10ms latency - 22ms in total
    accumulate 10ms of data for DP (no latency here - just data storage till deadline)
  • process of 1ms data LL4
  • process of 1ms data LL5 (1ms latency for processing)
    HOST DMA (1ms) transfer may start when the data are loaded, immediately when LL finishes.

total latency is 23 ms
only 3 more than in case of LL 10ms

It may be shorter if you be certain that DP will finish sooner every single cycle - i.e. in 4.9ms:

LL 1ms + DP 10ms with 5ms deadline

DMIC DMA (1ms) - processing may start when the data are loaded, so 1ms latency here

  • process of 1ms data LL1
  • process of 1ms data LL2 (1ms latency for processing)
    accumulate 10ms of data for DP (processing may start when buffer is loaded - 10 ms latency)
  • process of 10ms data DP1 - 4,9ms, declared deadline 5ms, so latency is 5ms
    accumulate 10ms of data for DP (no latency here)
  • process of 1ms data LL4
  • process of 1ms data LL5 (1ms latency for processing)
    HOST DMA (1ms) transfer may start when the data are loaded, immediately when LL finishes.

total latency is 18 ms, shorter than 10ms LL !!

DP in fact introduces only 3ms of additional delay because data must wait for next LL cycle, but that's it.
In fact, you're able to have shorter latency using DP

@btian1
Copy link
Contributor

btian1 commented Dec 19, 2023

"buffer at DP output is set to 2*OBS, DP always starts in 10ms period, so even if DP finishes in 0.000001ms - there will still be enough space for store processed data - because following LL will drain 10ms data till then"

Thanks, for this case, how to handle linear buffer for LL2? I added some comments in another PR.

  1. use wrapped buffer in DP output buffer.
  2. always move data to head once 1ms consumed by LL2.

will you take 2?

@marcinszkudlinski
Copy link
Contributor Author

marcinszkudlinski commented Dec 19, 2023

@btian1

use wrapped buffer in DP output buffer.

it is a wrapped/cicrular buffer

always move data to head once 1ms consumed by LL2.

What good would it do?

  • DP won't be able to use additional space,
  • LL2 won't be able to use linearity - as it won't know if it's being fed by DP (linear in your case) or by LL (currently circular)

It will , no doubt, cost additional copying every LL cycle

@lgirdwood lgirdwood merged commit 3d4883a into thesofproject:main Dec 19, 2023
43 of 44 checks passed
@marcinszkudlinski marcinszkudlinski deleted the dp-fix2 branch June 21, 2024 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants