-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed determine/process reboot-cause service dependency #88
Conversation
Signed-off-by: anamehra <anamehra@cisco.com>
After=rc-local.service database.service | ||
Requires=rc-local.service | ||
After=rc-local.service | ||
ConditionPathExists=!/tmp/determine-reboot-cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra now that you have removed the dependency on database.service, as mentioned in the PR description point 3, why would systemd restart this service if its already activated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process won't restart but just in case someone restarts it manually or my error the script code creates an unwanted reboot entry as the script does not handle the restart case. I think keeping this conditional check is safe and prevents any false restart. No harm in keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra no that is not a valid scenario to be taken care. someone running the service manually is outside the scope. The unit file is for orchestration by systemd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script does not handle the restart scenario. It's a gap even if the scenario is not valid. The code here fills that gap. It a good to have scenario rather to keep the gap open. Do you see any issue with this? If so, I can remove it. Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra if someone could restart the service inadvertently (or otherwise) then it could so happen someone deleting the ConditionPathExists
file from /tmp/ location or anyother file for that matter.
After=rc-local.service database.service | ||
Requires=rc-local.service | ||
After=rc-local.service | ||
ConditionPathExists=!/tmp/determine-reboot-cause | ||
|
||
[Service] | ||
Type=oneshot | ||
RemainAfterExit=yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra this is a one-shot service and should not be running anymore, suggest to remove RemainAfterExit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemainAfterExit only keeps the state Active even after the script exits. I will remove this.
|
||
[Timer] | ||
OnUnitActiveSec=0 sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra this means this timer will keep firing every 0 seconds once after this unit becomes active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the process-reboot-cause service runs and becomes active, systemd does not trigger any more timer. This is how its being used for other services as well like lldp@.timer
Here is a snapshot when the service is active:
root@sonic:/home/cisco# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
n/a n/a Mon 2023-11-27 20:38:33 UTC 11ms ago snmp.timer snmp.service
Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago logrotate.timer logrotate.service
Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago sysstat-collect.timer sysstat-collect.service
Mon 2023-11-27 22:36:29 UTC 1h 57min left Sun 2023-11-26 22:36:29 UTC 22h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2023-11-28 00:07:00 UTC 3h 28min left Mon 2023-11-27 00:07:00 UTC 20h ago sysstat-summary.timer sysstat-summary.service
Tue 2023-11-28 06:24:11 UTC 9h left Mon 2023-11-27 06:19:57 UTC 14h ago apt-daily-upgrade.timer apt-daily-upgrade.service
Tue 2023-11-28 13:10:39 UTC 16h left Mon 2023-11-27 20:01:01 UTC 37min ago apt-daily.timer apt-daily.service
Sun 2023-12-03 03:10:27 UTC 5 days left Sun 2023-11-26 03:10:43 UTC 1 day 17h ago e2scrub_all.timer e2scrub_all.service
Mon 2023-12-04 00:00:00 UTC 6 days left Mon 2023-11-27 00:00:00 UTC 20h ago fstrim.timer fstrim.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago hostcfgd.timer hostcfgd.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp.timer lldp.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp@0.timer lldp@0.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp@1.timer lldp@1.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago mgmt-framework.timer mgmt-framework.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago pmon.timer pmon.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago process-reboot-cause.timer process-reboot-cause.service
n/a n/a Wed 2023-11-22 22:22:47 UTC 4 days ago rasdaemon.timer rasdaemon.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago tacacs-config.timer tacacs-config.service
n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago telemetry.timer telemetry.service
19 timers listed.
Pass --all to see loaded but inactive timers, too.
root@sonic:/home/cisco# systemctl status process-reboot-cause.service
● process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2023-11-27 20:00:50 UTC; 37min ago
TriggeredBy: ● process-reboot-cause.timer
Main PID: 3946159 (code=exited, status=0/SUCCESS)
In case database service has failed causing process-reboot-cause dependency failure, systemd uses timer to trigger every sec until the service has started. I have not seen any adverse effect of that on system cpui usage and it should not happen in general.
19 timers listed.
Pass --all to see loaded but inactive timers, too.
root@sonic:/home/cisco# systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
n/a n/a Mon 2023-11-27 20:00:08 UTC 122ms ago hostcfgd.timer hostcfgd.service
n/a n/a Mon 2023-11-27 20:00:09 UTC 78ms ago lldp.timer lldp.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 274ms ago lldp@0.timer lldp@0.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 159ms ago lldp@1.timer lldp@1.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 337ms ago mgmt-framework.timer mgmt-framework.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 216ms ago pmon.timer pmon.service
n/a n/a Mon 2023-11-27 20:00:09 UTC 2ms ago process-reboot-cause.timer process-reboot-cause.service
n/a n/a Mon 2023-11-27 20:00:09 UTC 60ms ago snmp.timer snmp.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 356ms ago tacacs-config.timer tacacs-config.service
n/a n/a Mon 2023-11-27 20:00:08 UTC 237ms ago telemetry.timer telemetry.service
Mon 2023-11-27 20:10:00 UTC 9min left Mon 2023-11-27 20:00:00 UTC 9s ago logrotate.timer logrotate.service
Mon 2023-11-27 20:10:00 UTC 9min left Mon 2023-11-27 20:00:00 UTC 9s ago sysstat-collect.timer sysstat-collect.service
Mon 2023-11-27 22:36:29 UTC 2h 36min left Sun 2023-11-26 22:36:29 UTC 21h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2023-11-28 00:07:00 UTC 4h 6min left Mon 2023-11-27 00:07:00 UTC 19h ago sysstat-summary.timer sysstat-summary.service
Tue 2023-11-28 04:38:47 UTC 8h left Mon 2023-11-27 11:08:02 UTC 8h ago apt-daily.timer apt-daily.service
Tue 2023-11-28 06:19:57 UTC 10h left Mon 2023-11-27 06:19:57 UTC 13h ago apt-daily-upgrade.timer apt-daily-upgrade.service
Sun 2023-12-03 03:10:54 UTC 5 days left Sun 2023-11-26 03:10:43 UTC 1 day 16h ago e2scrub_all.timer e2scrub_all.service
Mon 2023-12-04 00:00:00 UTC 6 days left Mon 2023-11-27 00:00:00 UTC 20h ago fstrim.timer fstrim.service
n/a n/a Wed 2023-11-22 22:22:47 UTC 4 days ago rasdaemon.timer rasdaemon.service
19 timers listed.
Pass --all to see loaded but inactive timers, too.
root@sonic:/home/cisco# systemctl ^C
root@sonic:/home/cisco# top
top - 20:00:39 up 4 days, 21:39, 1 user, load average: 0.15, 0.41, 0.74
Tasks: 198 total, 1 running, 193 sleeping, 0 stopped, 4 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31322.7 total, 27975.5 free, 808.9 used, 2538.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30010.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
488 root 20 0 1862304 61600 30644 S 0.3 0.2 58:24.11 containerd
1527 root 20 0 320792 98444 15416 S 0.3 0.3 224:45.22 appDemo
3945764 root 20 0 10912 3852 3288 R 0.3 0.0 0:00.02 top
1 root 20 0 168152 13888 9036 S 0.0 0.0 10:29.59 systemd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra what is your understanding of "OnUnitActiveSec". As I understand it, will keep firing the timer After the unit has become Active every OnUnitActiveSec
seconds. Agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It triggers after the Unit was last activated but only if the unit goes down. If the unit is still active, it does not retrigger. Generally, this is used to run a service in a periodic manner where the service goes down after doing the job. In our case, that is not the situation and the service remains active as required, so no new trigger unless it goes down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra we don't need this timer workaround/hack in latest master running debian/bookworm release
Adde Upholds=
to database.service like below:-
I tested,
- Normal boot process where all is good
- Failure case where process-reboot-cause remains in dependency failure and restart of database service restarts this service
- During normal operation if database service is restarted, then process-reboot-cause is also restarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this on master. Do you know if 202305 has bookworm or plan to get it?
We will need this fix in 202305 and 202205.
@@ -1,7 +1,9 @@ | |||
[Unit] | |||
Description=Delays process-reboot-cause until network is stably connected | |||
PartOf=process-reboot-cause.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra How are you ensuring if process-reboot-service is restarted, the database service will be ready and only then process-reboot-cause will run? Please note during boot up, this is ensured by the OnBootSec time after which database service will be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .service Unit file has a dependency on the database service. After the timer is triggered 1m30sec post start, the process starts only when database dependency has been met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra 1m30sec is only for boot up scenario. What about when the database service gets restarted on a stable system? Previously, it was ensured that reboot-cause service will run only after 1m30sec of boot up. Once the system is up and running timer gets retriggered by OnUnitActiveSec=0 sec. correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the database service gets restarted, and if 1m30sec has elapsed, process-reboot-cause services will be restarted at the same time. The 'Require' with 'After' flag in the unit file takes care of the service going down and restart, while OnUnitActiveSec=0sec will trigger the restart if the dependency is met after the service goes into a dependency failed state.
Signed-off-by: anamehra <anamehra@cisco.com>
@anamehra can you check these service. I don't see a need for timer. Remove the timer service and use these unit service determine-reboot-cause.service process-reboot-cause.service |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anamehra starting systemd version >= 249, systemd now has fixed the previous limitation of not restarting the service in dependency failed state later on when dependency is satisfied. Please use Uphold=
which fixes this limitation (already available in Systemd version in Bookworm currently in master branch).
I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor . That feature won't be available on 202305 and 202205, so raised the following PR for those branches. |
@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca |
@anamehra can you share where you are with "different solution based on bookworm enhancement"? |
Hi @gechiang , I am seeing issues with build, trying to get it fixed. Let me know if anything like this is known: |
Signed-off-by: anamehra anamehra@cisco.com
Why I did it
Fixes sonic-net/sonic-buildimage#16990
MSFT ADO: 25892864
determine-reboot-cause and process-reboot-cause service does not start If the database service fails to restart in the first attempt. Even if the Database service succeeds in the next attempt, these reboot-cause services do not start.
The process-reboot-service does not restart if the docker or database service restarts, which leads to an empty reboot-cause history
deploy-mg from sonic-mgmt also triggers the docker service restart. The restart of the docker service caused the issue stated in 2 above. The docker restart also triggers determine-reboot-cause to restart which creates an additional reboot-cause file in history and modifies the last reboot-cause.
How I did it
This PR fixes these issues by
Removing the determine-reboot-cause service's dependency on the database service. This service script does not have any dependency on database service and should be independent and run once during the boot.
The service expects the platform service publishing the hardware reboot cause to be finished before determine-reboot-cause starts. The platform service may set "Before=determine-reboot-cause" dependency in Unit file.
Remove process-reboot-cause 'Require' dependency on determine-reboot-cause. This hard dependency is not required, moreover, it makes determine-reboot-cause restart if process-reboot-cause restarts which is not required as determine-reboot-cause is a one-shot service to be executed during boot.
Fixed timer Unit for process-reboot-cause to make it restartable once the dependency condition is met after the service has entered a dependency failed state.
How to verify it
On a pizza box:
On Chassis:
restart database service and verify that determine-reboot-cause is not restarted and process-reboot-cause service is restarted. Verify that 'show reboot-cause' and 'show reboot-cause history' show correct data and no new entry is created for restart.