Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System ready] Extend sysmonitor functionality to wait for host daemons #18817

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

fastiuk
Copy link
Contributor

@fastiuk fastiuk commented Apr 28, 2024

Why I did it

To extend sysmonitor to report services states according to HLD: sonic-net/SONiC#1363

How I did it

  • Add timeout monitor task to sysmonitor to declare system ready state forcibly when timout happened.
  • sysmonitor now has a state machine, so the status now doesn't get back to "Not OK" after it was declared as "OK"
  • A new "irrel_for_sysready" configuration was added to "FEATURE" table to handle services for which sysmonitor shouldn't wait for.
  • Add "wait_services" field to platform health config file to explicitly specify services sysmonitor has to wait for.

How to verify it

  • Start NOS on the box
  • Make sure it defines system ready status
  • Make sure sysreadyshow shows all services as OK

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

Device metadata config

A picture of a cute animal (it is my cat Finn)

PXL_20230413_140941610 PORTRAIT

@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict

@liushilongbuaa
Copy link
Contributor

@fastiuk , PR need to rebase master branch.

files/build_templates/docker_image_ctl.j2 Outdated Show resolved Hide resolved
files/build_templates/docker_image_ctl.j2 Outdated Show resolved Hide resolved
@@ -108,6 +108,37 @@ function preStartAction()
fi
{%- elif docker_container_name == "snmp" %}
$SONIC_DB_CLI STATE_DB HSET 'DEVICE_METADATA|localhost' chassis_serial_number $(decode-syseeprom -s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you enabling this feature by default for snmp? Shouldn't it be based on configuration?

@dgsudharsan It is disabled for snmp by default: see files/build_templates/init_cfg.json.j2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is different. In the init_cfg.j2 you are considering snmp as irrelevent for system ready but using system ready as a criteria to spawn snmp. Is this correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider handling this logic through featured? Today we have a similar flow in featured where certain daemons are spawned after portinitdone. In this case the critera can be systemd. It can be much easy to extend this to multiple daemons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is different. In the init_cfg.j2 you are considering snmp as irrelevent for system ready but using system ready as a criteria to spawn snmp. Is this correct?

Yes, it is indeed correct. Irrelevant for system ready means that system ready feature won't just collect readiness status of snmp systemd service. But SNMP must be started after system is ready. The logic actually does't break each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider handling this logic through featured? Today we have a similar flow in featured where certain daemons are spawned after portinitdone. In this case the critera can be systemd. It can be much easy to extend this to multiple daemons

I looked at the code in featured and I don't want to break anything there by adding wait for system ready. Most of the delayed services there depend on portinit done, and I don't want to add another wait point there as it will just make the logic there more complex and less clear.

files/build_templates/init_cfg.json.j2 Show resolved Hide resolved
@@ -44,6 +44,8 @@ function startplatform() {
/usr/bin/mlnx-fw-upgrade.sh -v
if [[ "$?" -ne "${EXIT_SUCCESS}" ]]; then
debug "Failed to upgrade fw. " "$?" "Restart syncd"
sonic-db-cli STATE_DB HSET "FEATURE|$DEV_SRV" fail_reason \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we considering just the asic firmware update as sysready indication for syncd. Shouldn't it be the create switch success?

@dgsudharsan not only, we have PortInitDone event in swss and SFP ready in pmon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is syncd up_status is just set base on startplatform completion. Is it possible that in the actual syncd flow, switch_create might fail and we will still see the up_status as true. Can we include the switch create in the definition of syncd up_status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can happen I think. Set system ready condition you see here is just an internal NV requirement, but at the same time it is very generic, so won't conflict with any other platform. I prefer to leave it as it is.

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
@fastiuk fastiuk requested a review from prgeor May 20, 2024 20:38
@fastiuk fastiuk closed this Aug 15, 2024
@fastiuk fastiuk reopened this Aug 15, 2024
@fastiuk fastiuk closed this Nov 11, 2024
@fastiuk fastiuk reopened this Nov 11, 2024
@@ -82,11 +82,17 @@
"auto_restart": "{{autorestart}}",
"support_syslog_rate_limit" : "true",
{# Set check_up_status to true here when app readiness will be marked in state db #}
{%- if feature in ["swss", "syncd", "pmon"] %}
"check_up_status" : "true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change image config generation behavior. We will rely on external service to generate golden config, and ask sonic to config reload -f golden_config.json to consume them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wen587 to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants