Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: refactor agent to be more useful #1855

Merged
merged 7 commits into from
Oct 24, 2019

Conversation

tlauda
Copy link
Contributor

@tlauda tlauda commented Sep 23, 2019

Refactors agent's functionality to really verify DSP
responsiveness. Instead of updating last_idle time
on passive level before entering waiti, let's change it
to last_check and update it on actual scheduler task
execution. Task is executed above passive level, so it
should preempt every other long running processing.
Since we don't have any additional precautions to guarantee
that low latency tick will happen exactly at the scheduled time,
let's define warning and panic thresholds to control stability
of the system.

Signed-off-by: Tomasz Lauda tomasz.lauda@linux.intel.com

src/lib/agent.c Outdated Show resolved Hide resolved
@tlauda tlauda force-pushed the topic/agent-change-check branch from 2cb6c03 to c82408b Compare September 23, 2019 09:18
src/lib/agent.c Outdated
/* warning timeout */
if (delta > sa->warn_timeout)
trace_sa_error("validate(), ll drift detected, delta = "
"%u", delta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about logging a time of the delay in microseconds instead of bare delta? I am afraid during debugging delta won't tell us much and we will end up calculating it manually anyway.

@tlauda
Copy link
Contributor Author

tlauda commented Sep 24, 2019

SOFCI TEST

@tlauda tlauda force-pushed the topic/agent-change-check branch 2 times, most recently from 5c427a8 to a702441 Compare September 24, 2019 08:44
@tlauda
Copy link
Contributor Author

tlauda commented Sep 24, 2019

SOFCI TEST

@tlauda tlauda force-pushed the topic/agent-change-check branch 3 times, most recently from 7614dd8 to 1ee5b4a Compare September 25, 2019 10:00
@tlauda
Copy link
Contributor Author

tlauda commented Sep 25, 2019

SOFCI TEST

1 similar comment
@tlauda
Copy link
Contributor Author

tlauda commented Sep 25, 2019

SOFCI TEST

@tlauda
Copy link
Contributor Author

tlauda commented Sep 25, 2019

@xiulipan This change causes agent panic on cAVS platforms with QEMU, however works perfectly fine on real HW. Can you look into this?

@xiulipan
Copy link
Contributor

@tlauda Sure will check the issue with QEMU to see if it QEMU issue or FW issue.

@tlauda tlauda force-pushed the topic/agent-change-check branch from 1ee5b4a to b47cb70 Compare September 26, 2019 09:15
@xiulipan
Copy link
Contributor

@tlauda Checked the code with QEMU, it seems QEMU can not work was the 1ms agent scheduler.
Do we really need to make an agent to update itself every 1ms?

I am still working on the QEMU code to make it work as fast as HW but it seems very difficult to achieve to goal. @lgirdwood any idea about how to make the QEMU work as the requirement of this high rate timer setting.

@tlauda
Copy link
Contributor Author

tlauda commented Sep 27, 2019

@xiulipan It's intended to detect drift in our system tick. I've already added huge panic threshold until some additional tweaks won't be implemented, which will stabilize it more. Why it isn't failing on other platforms? The requirement is the same.

@mmaka1
Copy link

mmaka1 commented Sep 27, 2019

@xiulipan @tlauda @lgirdwood Perhaps this is the place where things might be configured more flexibly per specific platform/environment requirements:

ticks = clock_ms_to_ticks(PLATFORM_DEFAULT_CLOCK, 1) * timeout / 1000;

As Tomek explained, the intention was to add a heartbeat to something we may call a system tick. Typically it is set to 1ms but may be different per platform. Therefore I'd rephrase the formula to use CONFIG_ kind of timeout configured to longer periods in other environments if they are unable to keep that pace.

@tlauda tlauda force-pushed the topic/agent-change-check branch from b47cb70 to 19cc72b Compare September 27, 2019 09:07
@tlauda tlauda requested a review from jajanusz as a code owner September 27, 2019 09:07
@xiulipan
Copy link
Contributor

@tlauda @macchian
The QEMU is used to run a very short time and sometimes the agent is not even take effect(BYT and HSW).

I tried to refine the cavs timer implementation in QEME to make sure timer ticks can work as we expected but it seems not that easy. The older agent have a through-hold of 750ms but now we have 1.5ms.
And I am not sure if QEMU can support run that fast to make the heartbeat work. I will try to have a sync with you next monday to talk about my finding

@tlauda tlauda force-pushed the topic/agent-change-check branch from 51597b0 to 37282e8 Compare October 9, 2019 10:21
@tlauda
Copy link
Contributor Author

tlauda commented Oct 11, 2019

@xiulipan When will you have time to build separate config for QEMU? It's the only thing blocking this PR.

@xiulipan
Copy link
Contributor

@tlauda for that I will need to refine the whole logic for both Travis and Jenkins CI. It would not be that easy. Best ETA is next week.

@tlauda
Copy link
Contributor Author

tlauda commented Oct 12, 2019

@xiulipan Not a problem. Go for it.

@xiulipan
Copy link
Contributor

@tlauda Can you check https://sof-ci.01.org/sofpr/PR1855/build3509/boottest/dump-skl.txt it seems 5ms will still have random issue. Will try with 10ms or 100ms.

@tlauda
Copy link
Contributor Author

tlauda commented Oct 17, 2019

@xiulipan 10 ms helped?

@tlauda tlauda force-pushed the topic/agent-change-check branch from 37282e8 to 89fb047 Compare October 17, 2019 11:47
@tlauda
Copy link
Contributor Author

tlauda commented Oct 22, 2019

@xiulipan Any updates?

@xiulipan
Copy link
Contributor

@tlauda could you rebase, I will retry the test about 10 more times to avoid regression in CI.

@tlauda tlauda force-pushed the topic/agent-change-check branch from 89fb047 to 3d8e030 Compare October 23, 2019 08:18
@tlauda
Copy link
Contributor Author

tlauda commented Oct 23, 2019

@xiulipan Done.

@xiulipan
Copy link
Contributor

@tlauda Let merge this. I have already make the QEMU test not blocking with 10ms agent time.

I also run manually test with 100 times. No dsp panic happen. But be aware that the qemu test may be failed by the agent. Please check the QEMU test result avoid false alarm blocking test.

Copy link
Contributor

@xiulipan xiulipan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent works much better than old one. Nice patch.

@tlauda tlauda force-pushed the topic/agent-change-check branch from 3d8e030 to 6f851c5 Compare October 23, 2019 10:23
@tlauda
Copy link
Contributor Author

tlauda commented Oct 23, 2019

@xiulipan I'm not sure if I understand correctly. So from now on the Travis QEMU will always fail?

@jajanusz
Copy link
Contributor

@xiulipan Where did you adjust new config for qemu? In scripts I see that you still build firmware with just one flag changed for qemu CONFIG_BUILD_VM_ROM, I guess there should be also some custom CONFIG_SYSTICK_PERIOD ?

@xiulipan
Copy link
Contributor

@tlauda Forget about the travis, will send the patch to for Travis.
But you may need to re-base after the scripts merged to make sure it will take effect on this PR.

@xiulipan
Copy link
Contributor

@jajanusz I did some find and replace in the source code to change the CONFIG_SYSTICK_PERIOD default value.

@xiulipan
Copy link
Contributor

@tlauda Travis update is here: #1996

local test with PR1855 and the travis update is here https://travis-ci.org/xiulipan/sof/builds/602186422
All platform passed the QEMU test.

Please make sure we get the right order to merge these PRs

Adds additional CONFIG_SYSTICK_PERIOD to Kconfig,
which is used to drive timer based low latency scheduler
and also will be used as a timeout check value for system agent.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Removes PLATFORM_LL_DEFAULT_TIMEOUT definition. It has been
replaced by CONFIG_SYSTICK_PERIOD.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Refactors agent's functionality to really verify DSP
responsiveness. Instead of updating last_idle time
on passive level before entering waiti, let's change it
to last_check and update it on actual scheduler task
execution. Task is executed above passive level, so it
should preempt every other long running processing.
Since we don't have any additional precautions to guarantee
that low latency tick will happen exactly at the scheduled time,
let's define warning and panic thresholds to control stability
of the system.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Removes disable and enable functions, because they
are no longer needed. Agent will now update its lat_check
time above passive level, so every long running task will
be preempted.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Removes PLATFORM_IDLE_TIME definition, because it's
no longer used.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
Changes scheduler_init_edf and task_main_init function
definitions as sof structure is no longer needed. It was
only used to pass agent object to main idle task.

Signed-off-by: Tomasz Lauda <tomasz.lauda@linux.intel.com>
@tlauda tlauda force-pushed the topic/agent-change-check branch from 6f851c5 to 368a739 Compare October 24, 2019 08:19
@tlauda tlauda merged commit f724ef1 into thesofproject:master Oct 24, 2019
@tlauda tlauda deleted the topic/agent-change-check branch November 22, 2019 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants