Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed determine/process reboot-cause service dependency #88

Closed
wants to merge 2 commits into from

Conversation

anamehra
Copy link
Contributor

Signed-off-by: anamehra anamehra@cisco.com

Why I did it
Fixes sonic-net/sonic-buildimage#16990

MSFT ADO: 25892864

  1. determine-reboot-cause and process-reboot-cause service does not start If the database service fails to restart in the first attempt. Even if the Database service succeeds in the next attempt, these reboot-cause services do not start.

  2. The process-reboot-service does not restart if the docker or database service restarts, which leads to an empty reboot-cause history

  3. deploy-mg from sonic-mgmt also triggers the docker service restart. The restart of the docker service caused the issue stated in 2 above. The docker restart also triggers determine-reboot-cause to restart which creates an additional reboot-cause file in history and modifies the last reboot-cause.

How I did it
This PR fixes these issues by

  1. Removing the determine-reboot-cause service's dependency on the database service. This service script does not have any dependency on database service and should be independent and run once during the boot.
    The service expects the platform service publishing the hardware reboot cause to be finished before determine-reboot-cause starts. The platform service may set "Before=determine-reboot-cause" dependency in Unit file.

  2. Remove process-reboot-cause 'Require' dependency on determine-reboot-cause. This hard dependency is not required, moreover, it makes determine-reboot-cause restart if process-reboot-cause restarts which is not required as determine-reboot-cause is a one-shot service to be executed during boot.

  3. Fixed timer Unit for process-reboot-cause to make it restartable once the dependency condition is met after the service has entered a dependency failed state.

How to verify it
On a pizza box:

  1. Installed the image and check reboot-cause history
  2. restart the database service and verify that determine-reboot-cause is not restarted and process-reboot-cause service is restarted. Verify that 'show reboot-cause' and 'show reboot-cause history' show correct data and no new entry is created for restart.

On Chassis:

  1. Installed the image and check reboot-cause history
    restart database service and verify that determine-reboot-cause is not restarted and process-reboot-cause service is restarted. Verify that 'show reboot-cause' and 'show reboot-cause history' show correct data and no new entry is created for restart.
  2. Reboot LC. On Supervicor, stop database-chassis service.
  3. Let database service on LC fail the first time. determine-reboot-cause should succeed and process-reboot-cause should fail to start due to dependency failure
  4. start database-chassis service on Supervisor. Database service on LC should now start successfully.
  5. Verify process-reboot-cause also starts
  6. Verify 'show reboot-cause history' and 'show reboot-cause' output

Signed-off-by: anamehra <anamehra@cisco.com>
@anamehra
Copy link
Contributor Author

@prgeor , @abdosi , @gechiang , please review. Thanks

@prgeor prgeor requested a review from saiarcot895 November 23, 2023 23:54
After=rc-local.service database.service
Requires=rc-local.service
After=rc-local.service
ConditionPathExists=!/tmp/determine-reboot-cause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra now that you have removed the dependency on database.service, as mentioned in the PR description point 3, why would systemd restart this service if its already activated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process won't restart but just in case someone restarts it manually or my error the script code creates an unwanted reboot entry as the script does not handle the restart case. I think keeping this conditional check is safe and prevents any false restart. No harm in keeping it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra no that is not a valid scenario to be taken care. someone running the service manually is outside the scope. The unit file is for orchestration by systemd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script does not handle the restart scenario. It's a gap even if the scenario is not valid. The code here fills that gap. It a good to have scenario rather to keep the gap open. Do you see any issue with this? If so, I can remove it. Please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra if someone could restart the service inadvertently (or otherwise) then it could so happen someone deleting the ConditionPathExists file from /tmp/ location or anyother file for that matter.

After=rc-local.service database.service
Requires=rc-local.service
After=rc-local.service
ConditionPathExists=!/tmp/determine-reboot-cause

[Service]
Type=oneshot
RemainAfterExit=yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra this is a one-shot service and should not be running anymore, suggest to remove RemainAfterExit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemainAfterExit only keeps the state Active even after the script exits. I will remove this.


[Timer]
OnUnitActiveSec=0 sec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra this means this timer will keep firing every 0 seconds once after this unit becomes active?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the process-reboot-cause service runs and becomes active, systemd does not trigger any more timer. This is how its being used for other services as well like lldp@.timer

Here is a snapshot when the service is active:

root@sonic:/home/cisco# systemctl list-timers 
NEXT                        LEFT          LAST                        PASSED        UNIT                         ACTIVATES
n/a                         n/a           Mon 2023-11-27 20:38:33 UTC 11ms ago      snmp.timer                   snmp.service
Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago      logrotate.timer              logrotate.service
Mon 2023-11-27 20:40:00 UTC 1min 26s left Mon 2023-11-27 20:30:01 UTC 8min ago      sysstat-collect.timer        sysstat-collect.service
Mon 2023-11-27 22:36:29 UTC 1h 57min left Sun 2023-11-26 22:36:29 UTC 22h ago       systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2023-11-28 00:07:00 UTC 3h 28min left Mon 2023-11-27 00:07:00 UTC 20h ago       sysstat-summary.timer        sysstat-summary.service
Tue 2023-11-28 06:24:11 UTC 9h left       Mon 2023-11-27 06:19:57 UTC 14h ago       apt-daily-upgrade.timer      apt-daily-upgrade.service
Tue 2023-11-28 13:10:39 UTC 16h left      Mon 2023-11-27 20:01:01 UTC 37min ago     apt-daily.timer              apt-daily.service
Sun 2023-12-03 03:10:27 UTC 5 days left   Sun 2023-11-26 03:10:43 UTC 1 day 17h ago e2scrub_all.timer            e2scrub_all.service
Mon 2023-12-04 00:00:00 UTC 6 days left   Mon 2023-11-27 00:00:00 UTC 20h ago       fstrim.timer                 fstrim.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     hostcfgd.timer               hostcfgd.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     lldp.timer                   lldp.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     lldp@0.timer                 lldp@0.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     lldp@1.timer                 lldp@1.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     mgmt-framework.timer         mgmt-framework.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     pmon.timer                   pmon.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     process-reboot-cause.timer   process-reboot-cause.service
n/a                         n/a           Wed 2023-11-22 22:22:47 UTC 4 days ago    rasdaemon.timer              rasdaemon.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     tacacs-config.timer          tacacs-config.service
n/a                         n/a           Mon 2023-11-27 20:00:48 UTC 37min ago     telemetry.timer              telemetry.service

19 timers listed.
Pass --all to see loaded but inactive timers, too.
root@sonic:/home/cisco# systemctl status process-reboot-cause.service 
● process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2023-11-27 20:00:50 UTC; 37min ago
TriggeredBy: ● process-reboot-cause.timer
   Main PID: 3946159 (code=exited, status=0/SUCCESS)

In case database service has failed causing process-reboot-cause dependency failure, systemd uses timer to trigger every sec until the service has started. I have not seen any adverse effect of that on system cpui usage and it should not happen in general.

19 timers listed.                                                                                                                            
Pass --all to see loaded but inactive timers, too.                                                                                        
root@sonic:/home/cisco# systemctl list-timers                                                                                                  
NEXT                        LEFT          LAST                        PASSED        UNIT                         ACTIVATES              
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 122ms ago     hostcfgd.timer               hostcfgd.service      
n/a                         n/a           Mon 2023-11-27 20:00:09 UTC 78ms ago      lldp.timer                   lldp.service             
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 274ms ago     lldp@0.timer                 lldp@0.service                
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 159ms ago     lldp@1.timer                 lldp@1.service                
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 337ms ago     mgmt-framework.timer         mgmt-framework.service        
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 216ms ago     pmon.timer                   pmon.service                
n/a                         n/a           Mon 2023-11-27 20:00:09 UTC 2ms ago       process-reboot-cause.timer   process-reboot-cause.service
n/a                         n/a           Mon 2023-11-27 20:00:09 UTC 60ms ago      snmp.timer                   snmp.service             
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 356ms ago     tacacs-config.timer          tacacs-config.service         
n/a                         n/a           Mon 2023-11-27 20:00:08 UTC 237ms ago     telemetry.timer              telemetry.service      
Mon 2023-11-27 20:10:00 UTC 9min left     Mon 2023-11-27 20:00:00 UTC 9s ago        logrotate.timer              logrotate.service           
Mon 2023-11-27 20:10:00 UTC 9min left     Mon 2023-11-27 20:00:00 UTC 9s ago        sysstat-collect.timer        sysstat-collect.service  
Mon 2023-11-27 22:36:29 UTC 2h 36min left Sun 2023-11-26 22:36:29 UTC 21h ago       systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Tue 2023-11-28 00:07:00 UTC 4h 6min left  Mon 2023-11-27 00:07:00 UTC 19h ago       sysstat-summary.timer        sysstat-summary.service       
Tue 2023-11-28 04:38:47 UTC 8h left       Mon 2023-11-27 11:08:02 UTC 8h ago        apt-daily.timer              apt-daily.service      
Tue 2023-11-28 06:19:57 UTC 10h left      Mon 2023-11-27 06:19:57 UTC 13h ago       apt-daily-upgrade.timer      apt-daily-upgrade.service   
Sun 2023-12-03 03:10:54 UTC 5 days left   Sun 2023-11-26 03:10:43 UTC 1 day 16h ago e2scrub_all.timer            e2scrub_all.service           
Mon 2023-12-04 00:00:00 UTC 6 days left   Mon 2023-11-27 00:00:00 UTC 20h ago       fstrim.timer                 fstrim.service                
n/a                         n/a           Wed 2023-11-22 22:22:47 UTC 4 days ago    rasdaemon.timer              rasdaemon.service      
                                                                                                                                          
19 timers listed.                                                                                                                            
Pass --all to see loaded but inactive timers, too.                                                                                      
root@sonic:/home/cisco# systemctl ^C                                                                                                           
root@sonic:/home/cisco# top                                                                                                             
top - 20:00:39 up 4 days, 21:39,  1 user,  load average: 0.15, 0.41, 0.74
Tasks: 198 total,   1 running, 193 sleeping,   0 stopped,   4 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31322.7 total,  27975.5 free,    808.9 used,   2538.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30010.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                  
    488 root      20   0 1862304  61600  30644 S   0.3   0.2  58:24.11 containerd
   1527 root      20   0  320792  98444  15416 S   0.3   0.3 224:45.22 appDemo
3945764 root      20   0   10912   3852   3288 R   0.3   0.0   0:00.02 top                                                                                                                                                                                      
      1 root      20   0  168152  13888   9036 S   0.0   0.0  10:29.59 systemd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra what is your understanding of "OnUnitActiveSec". As I understand it, will keep firing the timer After the unit has become Active every OnUnitActiveSec seconds. Agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It triggers after the Unit was last activated but only if the unit goes down. If the unit is still active, it does not retrigger. Generally, this is used to run a service in a periodic manner where the service goes down after doing the job. In our case, that is not the situation and the service remains active as required, so no new trigger unless it goes down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra we don't need this timer workaround/hack in latest master running debian/bookworm release

image

Adde Upholds= to database.service like below:-
image

I tested,

  1. Normal boot process where all is good
  2. Failure case where process-reboot-cause remains in dependency failure and restart of database service restarts this service
  3. During normal operation if database service is restarted, then process-reboot-cause is also restarted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this on master. Do you know if 202305 has bookworm or plan to get it?
We will need this fix in 202305 and 202205.

@@ -1,7 +1,9 @@
[Unit]
Description=Delays process-reboot-cause until network is stably connected
PartOf=process-reboot-cause.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra How are you ensuring if process-reboot-service is restarted, the database service will be ready and only then process-reboot-cause will run? Please note during boot up, this is ensured by the OnBootSec time after which database service will be available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .service Unit file has a dependency on the database service. After the timer is triggered 1m30sec post start, the process starts only when database dependency has been met.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra 1m30sec is only for boot up scenario. What about when the database service gets restarted on a stable system? Previously, it was ensured that reboot-cause service will run only after 1m30sec of boot up. Once the system is up and running timer gets retriggered by OnUnitActiveSec=0 sec. correct?

Copy link
Contributor Author

@anamehra anamehra Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the database service gets restarted, and if 1m30sec has elapsed, process-reboot-cause services will be restarted at the same time. The 'Require' with 'After' flag in the unit file takes care of the service going down and restart, while OnUnitActiveSec=0sec will trigger the restart if the dependency is met after the service goes into a dependency failed state.

Signed-off-by: anamehra <anamehra@cisco.com>
@prgeor
Copy link
Contributor

prgeor commented Nov 29, 2023

@anamehra can you check these service. I don't see a need for timer. Remove the timer service and use these unit service

determine-reboot-cause.service

image

process-reboot-cause.service

image

@anamehra
Copy link
Contributor Author

@anamehra can you check these service. I don't see a need for timer. Remove the timer service and use these unit service
Hi @prgeor , tested it but this does not help with dependency failure condition. systemd does not trigger the start of the service once dependency met for the service which failed due to dependency failure

root@sonic:/home/cisco# cat /lib/systemd/system/process-reboot-cause.service
[Unit]
Description=Retrieve the reboot cause from the history files and save them to StateDB
PartOf=database.service
Requires=database.service determine-reboot-cause.service
After=database.service determine-reboot-cause.service

[Service]
Type=simple
ExecStartPre=/bin/sleep 90
RemainAfterExit=yes
ExecStart=/usr/local/bin/process-reboot-cause
root@sonic:/home/cisco# systemctl status process-reboot-cause.service 
● process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: inactive (dead)

Nov 29 20:46:16 sonic systemd[1]: Dependency failed for Retrieve the reboot cause from the history files and save them to StateDB.
Nov 29 20:46:16 sonic systemd[1]: process-reboot-cause.service: Job process-reboot-cause.service/start failed with result 'dependency'.
root@sonic:/home/cisco# systemctl status database
● database.service - Database container
     Loaded: loaded (/lib/systemd/system/database.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2023-11-29 20:46:44 UTC; 1min 52s ago
    Process: 5740 ExecStartPre=/usr/local/bin/database.sh start (code=exited, status=0/SUCCESS)
   Main PID: 7141 (database.sh)
      Tasks: 13 (limit: 37545)
     Memory: 22.4M
     CGroup: /system.slice/database.service
             ├─7141 /bin/bash /usr/local/bin/database.sh wait
             ├─7147 /bin/bash /usr/bin/database.sh wait
             └─7153 docker wait database

Nov 29 20:46:17 sonic sudo[5751]: pam_unix(sudo:session): session closed for user root
Nov 29 20:46:17 sonic database.sh[5745]: Starting existing database container
Nov 29 20:46:17 sonic database.sh[5824]: database
Nov 29 20:46:44 sonic database.sh[7133]: True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra starting systemd version >= 249, systemd now has fixed the previous limitation of not restarting the service in dependency failed state later on when dependency is satisfied. Please use Uphold= which fixes this limitation (already available in Systemd version in Bookworm currently in master branch).

@anamehra
Copy link
Contributor Author

anamehra commented Dec 5, 2023

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .

That feature won't be available on 202305 and 202205, so raised the following PR for those branches.
@mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@prgeor
Copy link
Contributor

prgeor commented Dec 6, 2023

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .

That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

@gechiang
Copy link

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .
That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

@anamehra can you share where you are with "different solution based on bookworm enhancement"?

@anamehra
Copy link
Contributor Author

I will make changes to get a different solution based on bookworm enhancement as suggested by @prgeor .
That feature won't be available on 202305 and 202205, so raised the following PR for those branches. @mlok-nokia , @kenneth-arista , @abdosi please review it here sonic-net/sonic-buildimage#17406

@anamehra please raise a separate PR for 202205 branch as mentioned by @yxieca

@anamehra can you share where you are with "different solution based on bookworm enhancement"?

Hi @gechiang , I am seeing issues with build, trying to get it fixed. Let me know if anything like this is known:
Could not connect to debian-archive.trafficmanager.net:80 (40.64.88.253), connection timed out
E: Failed to fetch http://deb.debian.org/debian/pool/main/m/m2crypto/python3-m2crypto_0.38.0-4%2bb1_amd64.deb Could not connect to debian-archive.trafficmanager.net:80 (40.64.88.253), connection timed out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants