Fixed determine/process reboot-cause service dependency #88

anamehra · 2023-11-23T05:59:21Z

Signed-off-by: anamehra anamehra@cisco.com

Why I did it
Fixes sonic-net/sonic-buildimage#16990

MSFT ADO: 25892864

determine-reboot-cause and process-reboot-cause service does not start If the database service fails to restart in the first attempt. Even if the Database service succeeds in the next attempt, these reboot-cause services do not start.
The process-reboot-service does not restart if the docker or database service restarts, which leads to an empty reboot-cause history
deploy-mg from sonic-mgmt also triggers the docker service restart. The restart of the docker service caused the issue stated in 2 above. The docker restart also triggers determine-reboot-cause to restart which creates an additional reboot-cause file in history and modifies the last reboot-cause.

How I did it
This PR fixes these issues by

Removing the determine-reboot-cause service's dependency on the database service. This service script does not have any dependency on database service and should be independent and run once during the boot.
The service expects the platform service publishing the hardware reboot cause to be finished before determine-reboot-cause starts. The platform service may set "Before=determine-reboot-cause" dependency in Unit file.
Remove process-reboot-cause 'Require' dependency on determine-reboot-cause. This hard dependency is not required, moreover, it makes determine-reboot-cause restart if process-reboot-cause restarts which is not required as determine-reboot-cause is a one-shot service to be executed during boot.
Fixed timer Unit for process-reboot-cause to make it restartable once the dependency condition is met after the service has entered a dependency failed state.

How to verify it
On a pizza box:

Installed the image and check reboot-cause history
restart the database service and verify that determine-reboot-cause is not restarted and process-reboot-cause service is restarted. Verify that 'show reboot-cause' and 'show reboot-cause history' show correct data and no new entry is created for restart.

On Chassis:

Installed the image and check reboot-cause history
restart database service and verify that determine-reboot-cause is not restarted and process-reboot-cause service is restarted. Verify that 'show reboot-cause' and 'show reboot-cause history' show correct data and no new entry is created for restart.
Reboot LC. On Supervicor, stop database-chassis service.
Let database service on LC fail the first time. determine-reboot-cause should succeed and process-reboot-cause should fail to start due to dependency failure
start database-chassis service on Supervisor. Database service on LC should now start successfully.
Verify process-reboot-cause also starts
Verify 'show reboot-cause history' and 'show reboot-cause' output

Signed-off-by: anamehra <anamehra@cisco.com>

anamehra · 2023-11-23T05:59:55Z

@prgeor , @abdosi , @gechiang , please review. Thanks

prgeor · 2023-11-24T00:16:27Z

data/debian/sonic-host-services-data.determine-reboot-cause.service

-After=rc-local.service database.service
+Requires=rc-local.service
+After=rc-local.service
+ConditionPathExists=!/tmp/determine-reboot-cause


@anamehra now that you have removed the dependency on database.service, as mentioned in the PR description point 3, why would systemd restart this service if its already activated

The process won't restart but just in case someone restarts it manually or my error the script code creates an unwanted reboot entry as the script does not handle the restart case. I think keeping this conditional check is safe and prevents any false restart. No harm in keeping it.

@anamehra no that is not a valid scenario to be taken care. someone running the service manually is outside the scope. The unit file is for orchestration by systemd

The script does not handle the restart scenario. It's a gap even if the scenario is not valid. The code here fills that gap. It a good to have scenario rather to keep the gap open. Do you see any issue with this? If so, I can remove it. Please let me know.

@anamehra if someone could restart the service inadvertently (or otherwise) then it could so happen someone deleting the ConditionPathExists file from /tmp/ location or anyother file for that matter.

prgeor · 2023-11-24T02:14:05Z

data/debian/sonic-host-services-data.determine-reboot-cause.service

-After=rc-local.service database.service
+Requires=rc-local.service
+After=rc-local.service
+ConditionPathExists=!/tmp/determine-reboot-cause

 [Service]
 Type=oneshot
 RemainAfterExit=yes


@anamehra this is a one-shot service and should not be running anymore, suggest to remove RemainAfterExit

RemainAfterExit only keeps the state Active even after the script exits. I will remove this.

prgeor · 2023-11-24T02:32:29Z

data/debian/sonic-host-services-data.process-reboot-cause.timer


 [Timer]
+OnUnitActiveSec=0 sec


@anamehra this means this timer will keep firing every 0 seconds once after this unit becomes active?

When the process-reboot-cause service runs and becomes active, systemd does not trigger any more timer. This is how its being used for other services as well like lldp@.timer

Here is a snapshot when the service is active:

root@sonic:/home/cisco# systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES n/a n/a Mon 2023-11-27 20:38:33 UTC 11ms ago snmp.timer snmp.service Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago logrotate.timer logrotate.service Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago sysstat-collect.timer sysstat-collect.service Mon 2023-11-27 22:36:29 UTC 1h 57min left Sun 2023-11-26 22:36:29 UTC 22h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service Tue 2023-11-28 00:07:00 UTC 3h 28min left Mon 2023-11-27 00:07:00 UTC 20h ago sysstat-summary.timer sysstat-summary.service Tue 2023-11-28 06:24:11 UTC 9h left Mon 2023-11-27 06:19:57 UTC 14h ago apt-daily-upgrade.timer apt-daily-upgrade.service Tue 2023-11-28 13:10:39 UTC 16h left Mon 2023-11-27 20:01:01 UTC 37min ago apt-daily.timer apt-daily.service Sun 2023-12-03 03:10:27 UTC 5 days left Sun 2023-11-26 03:10:43 UTC 1 day 17h ago e2scrub_all.timer e2scrub_all.service Mon 2023-12-04 00:00:00 UTC 6 days left Mon 2023-11-27 00:00:00 UTC 20h ago fstrim.timer fstrim.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago hostcfgd.timer hostcfgd.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp.timer lldp.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp@0.timer lldp@0.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago lldp@1.timer lldp@1.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago mgmt-framework.timer mgmt-framework.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago pmon.timer pmon.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago process-reboot-cause.timer process-reboot-cause.service n/a n/a Wed 2023-11-22 22:22:47 UTC 4 days ago rasdaemon.timer rasdaemon.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago tacacs-config.timer tacacs-config.service n/a n/a Mon 2023-11-27 20:00:48 UTC 37min ago telemetry.timer telemetry.service 19 timers listed. Pass --all to see loaded but inactive timers, too. root@sonic:/home/cisco# systemctl status process-reboot-cause.service ● process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; enabled; vendor preset: enabled) Active: active (exited) since Mon 2023-11-27 20:00:50 UTC; 37min ago TriggeredBy: ● process-reboot-cause.timer Main PID: 3946159 (code=exited, status=0/SUCCESS)

In case database service has failed causing process-reboot-cause dependency failure, systemd uses timer to trigger every sec until the service has started. I have not seen any adverse effect of that on system cpui usage and it should not happen in general.

19 timers listed. Pass --all to see loaded but inactive timers, too. root@sonic:/home/cisco# systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES n/a n/a Mon 2023-11-27 20:00:08 UTC 122ms ago hostcfgd.timer hostcfgd.service n/a n/a Mon 2023-11-27 20:00:09 UTC 78ms ago lldp.timer lldp.service n/a n/a Mon 2023-11-27 20:00:08 UTC 274ms ago lldp@0.timer lldp@0.service n/a n/a Mon 2023-11-27 20:00:08 UTC 159ms ago lldp@1.timer lldp@1.service n/a n/a Mon 2023-11-27 20:00:08 UTC 337ms ago mgmt-framework.timer mgmt-framework.service n/a n/a Mon 2023-11-27 20:00:08 UTC 216ms ago pmon.timer pmon.service n/a n/a Mon 2023-11-27 20:00:09 UTC 2ms ago process-reboot-cause.timer process-reboot-cause.service n/a n/a Mon 2023-11-27 20:00:09 UTC 60ms ago snmp.timer snmp.service n/a n/a Mon 2023-11-27 20:00:08 UTC 356ms ago tacacs-config.timer tacacs-config.service n/a n/a Mon 2023-11-27 20:00:08 UTC 237ms ago telemetry.timer telemetry.service Mon 2023-11-27 20:10:00 UTC 9min left Mon 2023-11-27 20:00:00 UTC 9s ago logrotate.timer logrotate.service Mon 2023-11-27 20:10:00 UTC 9min left Mon 2023-11-27 20:00:00 UTC 9s ago sysstat-collect.timer sysstat-collect.service Mon 2023-11-27 22:36:29 UTC 2h 36min left Sun 2023-11-26 22:36:29 UTC 21h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service Tue 2023-11-28 00:07:00 UTC 4h 6min left Mon 2023-11-27 00:07:00 UTC 19h ago sysstat-summary.timer sysstat-summary.service Tue 2023-11-28 04:38:47 UTC 8h left Mon 2023-11-27 11:08:02 UTC 8h ago apt-daily.timer apt-daily.service Tue 2023-11-28 06:19:57 UTC 10h left Mon 2023-11-27 06:19:57 UTC 13h ago apt-daily-upgrade.timer apt-daily-upgrade.service Sun 2023-12-03 03:10:54 UTC 5 days left Sun 2023-11-26 03:10:43 UTC 1 day 16h ago e2scrub_all.timer e2scrub_all.service Mon 2023-12-04 00:00:00 UTC 6 days left Mon 2023-11-27 00:00:00 UTC 20h ago fstrim.timer fstrim.service n/a n/a Wed 2023-11-22 22:22:47 UTC 4 days ago rasdaemon.timer rasdaemon.service 19 timers listed. Pass --all to see loaded but inactive timers, too. root@sonic:/home/cisco# systemctl ^C root@sonic:/home/cisco# top top - 20:00:39 up 4 days, 21:39, 1 user, load average: 0.15, 0.41, 0.74 Tasks: 198 total, 1 running, 193 sleeping, 0 stopped, 4 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31322.7 total, 27975.5 free, 808.9 used, 2538.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30010.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 488 root 20 0 1862304 61600 30644 S 0.3 0.2 58:24.11 containerd 1527 root 20 0 320792 98444 15416 S 0.3 0.3 224:45.22 appDemo 3945764 root 20 0 10912 3852 3288 R 0.3 0.0 0:00.02 top 1 root 20 0 168152 13888 9036 S 0.0 0.0 10:29.59 systemd

@anamehra what is your understanding of "OnUnitActiveSec". As I understand it, will keep firing the timer After the unit has become Active every OnUnitActiveSec seconds. Agree?

It triggers after the Unit was last activated but only if the unit goes down. If the unit is still active, it does not retrigger. Generally, this is used to run a service in a periodic manner where the service goes down after doing the job. In our case, that is not the situation and the service remains active as required, so no new trigger unless it goes down.

@anamehra we don't need this timer workaround/hack in latest master running debian/bookworm release

Adde Upholds= to database.service like below:-

I tested,

Normal boot process where all is good

Failure case where process-reboot-cause remains in dependency failure and restart of database service restarts this service

During normal operation if database service is restarted, then process-reboot-cause is also restarted

I will check this on master. Do you know if 202305 has bookworm or plan to get it?
We will need this fix in 202305 and 202205.

prgeor · 2023-11-24T02:35:37Z

data/debian/sonic-host-services-data.process-reboot-cause.timer

@@ -1,7 +1,9 @@
 [Unit]
 Description=Delays process-reboot-cause until network is stably connected
+PartOf=process-reboot-cause.service


@anamehra How are you ensuring if process-reboot-service is restarted, the database service will be ready and only then process-reboot-cause will run? Please note during boot up, this is ensured by the OnBootSec time after which database service will be available.

The .service Unit file has a dependency on the database service. After the timer is triggered 1m30sec post start, the process starts only when database dependency has been met.

@anamehra 1m30sec is only for boot up scenario. What about when the database service gets restarted on a stable system? Previously, it was ensured that reboot-cause service will run only after 1m30sec of boot up. Once the system is up and running timer gets retriggered by OnUnitActiveSec=0 sec. correct?

If the database service gets restarted, and if 1m30sec has elapsed, process-reboot-cause services will be restarted at the same time. The 'Require' with 'After' flag in the unit file takes care of the service going down and restart, while OnUnitActiveSec=0sec will trigger the restart if the dependency is met after the service goes into a dependency failed state.

Signed-off-by: anamehra <anamehra@cisco.com>

prgeor · 2023-11-29T19:38:26Z

@anamehra can you check these service. I don't see a need for timer. Remove the timer service and use these unit service

determine-reboot-cause.service

process-reboot-cause.service

anamehra · 2023-11-29T20:43:23Z

@anamehra can you check these service. I don't see a need for timer. Remove the timer service and use these unit service
Hi @prgeor , tested it but this does not help with dependency failure condition. systemd does not trigger the start of the service once dependency met for the service which failed due to dependency failure

root@sonic:/home/cisco# cat /lib/systemd/system/process-reboot-cause.service
[Unit]
Description=Retrieve the reboot cause from the history files and save them to StateDB
PartOf=database.service
Requires=database.service determine-reboot-cause.service
After=database.service determine-reboot-cause.service

[Service]
Type=simple
ExecStartPre=/bin/sleep 90
RemainAfterExit=yes
ExecStart=/usr/local/bin/process-reboot-cause
root@sonic:/home/cisco# systemctl status process-reboot-cause.service 
● process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: inactive (dead)

Nov 29 20:46:16 sonic systemd[1]: Dependency failed for Retrieve the reboot cause from the history files and save them to StateDB.
Nov 29 20:46:16 sonic systemd[1]: process-reboot-cause.service: Job process-reboot-cause.service/start failed with result 'dependency'.
root@sonic:/home/cisco# systemctl status database
● database.service - Database container
     Loaded: loaded (/lib/systemd/system/database.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2023-11-29 20:46:44 UTC; 1min 52s ago
    Process: 5740 ExecStartPre=/usr/local/bin/database.sh start (code=exited, status=0/SUCCESS)
   Main PID: 7141 (database.sh)
      Tasks: 13 (limit: 37545)
     Memory: 22.4M
     CGroup: /system.slice/database.service
             ├─7141 /bin/bash /usr/local/bin/database.sh wait
             ├─7147 /bin/bash /usr/bin/database.sh wait
             └─7153 docker wait database

Nov 29 20:46:17 sonic sudo[5751]: pam_unix(sudo:session): session closed for user root
Nov 29 20:46:17 sonic database.sh[5745]: Starting existing database container
Nov 29 20:46:17 sonic database.sh[5824]: database
Nov 29 20:46:44 sonic database.sh[7133]: True

prgeor · 2023-12-04T03:20:48Z

data/debian/sonic-host-services-data.process-reboot-cause.service

@anamehra starting systemd version >= 249, systemd now has fixed the previous limitation of not restarting the service in dependency failed state later on when dependency is satisfied. Please use Uphold= which fixes this limitation (already available in Systemd version in Bookworm currently in master branch).

anamehra · 2023-12-05T21:10:17Z

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .

That feature won't be available on 202305 and 202205, so raised the following PR for those branches.
@mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

prgeor · 2023-12-06T01:46:47Z

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .

That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

gechiang · 2023-12-21T02:26:56Z

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .
That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

@anamehra can you share where you are with "different solution based on bookworm enhancement"?

anamehra · 2023-12-21T04:37:17Z

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .
That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

@anamehra can you share where you are with "different solution based on bookworm enhancement"?

Hi @gechiang , I am seeing issues with build, trying to get it fixed. Let me know if anything like this is known:
Could not connect to debian-archive.trafficmanager.net:80 (40.64.88.253), connection timed out
E: Failed to fetch http://deb.debian.org/debian/pool/main/m/m2crypto/python3-m2crypto_0.38.0-4%2bb1_amd64.deb Could not connect to debian-archive.trafficmanager.net:80 (40.64.88.253), connection timed out

Fixed determine/process reboot-cause service dependency

5bef843

Signed-off-by: anamehra <anamehra@cisco.com>

prgeor requested a review from saiarcot895 November 23, 2023 23:54

prgeor reviewed Nov 24, 2023

View reviewed changes

Remove RemainAfterExit from determine-reboot-cause unit

11327dd

Signed-off-by: anamehra <anamehra@cisco.com>

anamehra requested a review from prgeor November 28, 2023 01:48

This was referenced Nov 28, 2023

Revert "Add support to make determine/process reboot-cause services restartable" #89

Merged

Fixed determine/process reboot-cause service dependency sonic-net/sonic-buildimage#17327

Closed

arlakshm requested review from mlok-nokia and kenneth-arista November 29, 2023 19:00

prgeor reviewed Dec 4, 2023

View reviewed changes

anamehra closed this Feb 3, 2024

prgeor mentioned this pull request Mar 29, 2024

Use upholds to ensure determine service startes after process service. #100

Closed

xincunli-sonic mentioned this pull request Apr 23, 2024

Add Process Reboot Cause Service as Upholds of Database Service sonic-net/sonic-buildimage#18772

Closed

13 tasks

anamehra deleted the anamehra/reboot_cause1 branch October 31, 2024 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed determine/process reboot-cause service dependency #88

Fixed determine/process reboot-cause service dependency #88

anamehra commented Nov 23, 2023

anamehra commented Nov 23, 2023

prgeor Nov 24, 2023

anamehra Nov 27, 2023

prgeor Nov 28, 2023

anamehra Nov 28, 2023

prgeor Dec 4, 2023

prgeor Nov 24, 2023

anamehra Nov 27, 2023

prgeor Nov 24, 2023

anamehra Nov 27, 2023

prgeor Nov 28, 2023

anamehra Nov 28, 2023

prgeor Dec 4, 2023

anamehra Dec 4, 2023

prgeor Nov 24, 2023

anamehra Nov 27, 2023

prgeor Nov 28, 2023

anamehra Nov 28, 2023 •

edited

Loading

prgeor commented Nov 29, 2023

anamehra commented Nov 29, 2023

prgeor Dec 4, 2023

anamehra commented Dec 5, 2023

prgeor commented Dec 6, 2023

gechiang commented Dec 21, 2023

anamehra commented Dec 21, 2023

Fixed determine/process reboot-cause service dependency #88

Fixed determine/process reboot-cause service dependency #88

Conversation

anamehra commented Nov 23, 2023

anamehra commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anamehra Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

prgeor commented Nov 29, 2023

anamehra commented Nov 29, 2023

Choose a reason for hiding this comment

anamehra commented Dec 5, 2023

prgeor commented Dec 6, 2023

gechiang commented Dec 21, 2023

anamehra commented Dec 21, 2023

anamehra Nov 28, 2023 •

edited

Loading