Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Merged
merged 30 commits into from
Dec 5, 2024

Conversation

rameshraghupathy
Copy link
Contributor

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Can you add a section for DPU dark mode support. In this case,
NPU's PMON should honor the user configuration to power OFF the DPU via platform API.

Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor query, otherwise LGTM.

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved
vvolam
vvolam previously approved these changes Aug 21, 2024
Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rameshraghupathy
Copy link
Contributor Author

@rameshraghupathy Please update section 3.5 how the console utility be implemented.

Done
@prgeor

Copy link
Contributor

@prgeor prgeor Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Updated. It runs on the NPU.

#### REBOOT_CAUSE DB schema
```
Key: "REBOOT_CAUSE|2023_06_18_14_56_12"
* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy How? Which daemon/service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dpu_db_util/system_health service will update the ChassisStateDB table.

Comment on lines 688 to 689
* Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation.
* The DPUs will limit the number of history entries to a maximum of ten.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Why DPU pmon updates needs to be vendor dependent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that the SONiC running on the DPUs will necessarily be running Telemetry.

vvolam
vvolam previously approved these changes Oct 22, 2024
Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small spelling mistake. Otherwise looks good to me.

@@ -60,7 +61,7 @@ The picture below highlights the PMON vertical and its association with other lo
* The SmartSwitch host PMON should be able to Startup, Shutdown, Restart, and Soft Reboot the entire system or the individual DPUs. The DPU_MODULE will behave like the LINE_CARD_MODULE of a modular chassis with respect to these functions.

### SmartSwitch Power up/down sequence:
* When the smartswitch device is booted, the host will boot first and leave the DPUs either up or down depending on the configuration. The DPUs will be up by default.
* When the smartswitch device is booted, the host will boot first and leave the DPUs down by defualt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: default spelling mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

removed the ID cloum in "show system-health dpu DPU0", changed the DPU
admin_state default behavior, added the dpu_state transition update.
#### Schema for REBOOT_CAUSE - switch stateDB
#### Reboot Cause
1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various "show reboot-cause" options and their effects.
* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up and also persist this on the host. The recent reboot-cause is derived from that list of reboot-causes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.

Copy link
Contributor Author

@rameshraghupathy rameshraghupathy Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.

@prgeor 1. The switch hardware is capable of getting the DPU reboot-cause on the NPU side and the required software work is in progress to provide it when requested with "get_reboot_cause()" even if the DPU is not reachable though the midplane. 2. We will have support to indicate a DPU-reboot every time a DPU reboots. The existing workflow will remain the same but how the platform code fetches the reboot-cause will be as explained above when the implementation is ready. Updated this section.

* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism.

#### Schema for REBOOT_CAUSE - switch stateDB
#### Reboot Cause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy need separate section

  1. DPU reboot cause
  2. NPU reboot cause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Added separate sections for DPU reboot cause and NPU reboot cause

* The switch boots up. Determines the NPU reboot cause.
* Processes the previously stored NPU and DPU reboot-cause files and history files and updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB.
* The above process is a one-shot event on boot up.
* The module_db_update function in the NPU-PMON chassisd is an existing function constantly updating the operational status of the DPUs. This function looks for DPU operational status change events and when the DPUs come out of "offline" state, issues "get_reboot_cause" API to the platform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy can you state explicityl that out of offline state must mean DPU rebooted just to make sure DPU offline -> online does not happen without reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Done

vvolam
vvolam previously approved these changes Nov 11, 2024
Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

multiple places to be clear. Called out the output of the new CLI
extensions on the DPUs.
#### 2.1.1 DPUs in dark mode
* A smartswitch when configured to boot up with all the DPUs in it are powered down upon boot up is referred as DPUs in dark mode.
* In the dark mode the platform.json file shown in section "3.1.3" will not have the dictionary for the DPUS.
* The term dark mode is overloaded in some cases where the platform.json may have the dictionary but the config_db.json will have the admin_state of all DPU modules as "down".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the case when the platform.json has DPU information, but config DB doesn't have the DPU admin state configuration. The DPUs should be in downstate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv Done. This case is already shown in the example as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

@@ -219,7 +250,7 @@ SmartSwitch PMON block diagram

### 3.1. Platform monitoring and management
* SmartSwitch design Extends the existing chassis_base class and module_base class as described below.
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch
* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU to support SmartSwitch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is captured in Smart Switch reboot HLD. Why do we need to duplicate the information here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv Removed it. This was captured here way before that doc was generated. Now that the "Smart Switch reboot HLD" captures this, cleaned it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

* The switch boots up. Determines the NPU reboot cause.
* Processes the previously stored NPU and DPU reboot-cause files and history files.
* A maximum of ten reboot-cause history entries per dpu will be persisted just like the npu.
* Updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to differentiate between NPU and DPU configuration and store one in StateDB and anther in the ChassisStateDB? Won't it be easier to store everything in StateBD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv I raised the same question in one of the previous meetings and back then the consensus was to keep them in their respective DB. I'm ok either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed


```
2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities.
#### 2. DPU State
Though the get_oper_status(self) can get the operational status of the DPU modules, the current implementation only has limited capabilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which implementation has limitations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv The existing PMON can only indicate if a module is offline/online/faulty. It does not have the granularity to narrow it down to control-plane or data-plane issue.

Copy link
Contributor

@oleksandrivantsiv oleksandrivantsiv Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is not limited. It does exactly what is expected—provides a DPU operation state. We have a different API to query the software state. I suggest to remove the description of the limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

* Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery
* Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information.
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB.
* Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB using the push model specified in [section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-chassis-design.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that the dpu_control_plane_state and dpu_data_plane_state will be pushed by the DPU. But the dpu_midplane_link_state should be monitored and set by the NPU. Please update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

* Thermal manager reads all thermal sensor data, run thermal policy and take policy action Ex. Set fan speed, set alarm, set syslog, set LEDs
* Platform collects fan related data such as presence, failure and then applies fan algorithm to set the new fan speed
* The north bound CLI/Utils/App use DB data to ”show environment”, ”show platform temp” show platform fan”
* The DPUs will update the ChassisStateDB "TEMPERATURE_INFO" tables through redis client call which in turn will be pushed into the switch StateDB.
* The existing "TEMPERATURE_INFO" schema will be used to store the values and is shown below for convenience.
* For phase:1 implementation the sensor values collected by DPU will not be pushed to the chassisStateDB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Won't be pushed by whom and on which platform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv cleaned it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

#### 3.4.1 Reboot Cause CLIs
* There are two existing CLIs "show reboot-cause" and "show reboot-cause history"
* These two CLIs are extended to "show reboot-cause all" and "show reboot-cause history \<option\>", where the "option" could be DPUx, all or SWITCH
* When each DPU turns online the NPU chassisd will fetch the reboot-cause using the "get_reboot_cause()" API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to duplicate here information from #### Reboot workflow section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandrivantsiv There is some overlap mainly to give a quick summary for the convenience of people just reading only CLI section. Cleaned it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#closed

Copy link
Contributor

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* NPU-DPU (GNOI) soft reboot workflow will be captured in another document.
* The GNOI server runs on the DPU even after the DPU is pre-shutdown and listens until the graceful shutdown finishes.
* The host sends a GNOI signal to shutdown the DPU. The DPU does a graceful-shutdown if not already done and sends an ack back to the host.
* Upon receiving the ack or on a timeout the host may trigger the switch PMON vendor API to shutdown the DPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Here vendor API is to send PCI detachment but not shutdown

Copy link
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nit, looks good to me otherwise.

@prgeor prgeor merged commit 3b098b7 into sonic-net:master Dec 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants