Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

rameshraghupathy · 2024-06-06T23:40:47Z

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan

module_type_switch based on the implementation PR review

doc/smart-switch/pmon/smartswitch-pmon.md

prgeor

@rameshraghupathy Can you add a section for DPU dark mode support. In this case,
NPU's PMON should honor the user configuration to power OFF the DPU via platform API.

vvolam

Minor query, otherwise LGTM.

doc/smart-switch/pmon/smartswitch-pmon.md

data port association

config is used

vvolam

LGTM.

doc/smart-switch/pmon/smartswitch-pmon.md

rameshraghupathy · 2024-10-03T00:48:23Z

@rameshraghupathy Please update section 3.5 how the console utility be implemented.

Done
@prgeor

prgeor · 2024-10-03T03:03:41Z

doc/smart-switch/pmon/smartswitch-pmon.md

@rameshraghupathy In section 3.2 can you specify if the thermal management is in NPU or DPU?

@prgeor Updated. It runs on the NPU.

prgeor · 2024-10-07T05:04:05Z

doc/smart-switch/pmon/smartswitch-pmon.md

-#### REBOOT_CAUSE DB schema
-```
-Key: "REBOOT_CAUSE|2023_06_18_14_56_12"
+* Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up.


@rameshraghupathy How? Which daemon/service?

The dpu_db_util/system_health service will update the ChassisStateDB table.

prgeor · 2024-10-07T05:04:42Z

doc/smart-switch/pmon/smartswitch-pmon.md

+* Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation.
+* The DPUs will limit the number of history entries to a maximum of ten.


@rameshraghupathy Why DPU pmon updates needs to be vendor dependent?

There is no guarantee that the SONiC running on the DPUs will necessarily be running Telemetry.

vvolam

A small spelling mistake. Otherwise looks good to me.

vvolam · 2024-10-22T17:59:44Z

doc/smart-switch/pmon/smartswitch-pmon.md

@@ -60,7 +61,7 @@ The picture below highlights the PMON vertical and its association with other lo
 * The SmartSwitch host PMON should be able to Startup, Shutdown, Restart, and Soft Reboot the entire system or the individual DPUs. The DPU_MODULE will behave like the LINE_CARD_MODULE of a modular chassis with respect to these functions.

 ### SmartSwitch Power up/down sequence:
-* When the smartswitch device is booted, the host will boot first and leave the DPUs either up or down depending on the configuration. The DPUs will be up by default.
+* When the smartswitch device is booted, the host will boot first and leave the DPUs down by defualt.


nit: default spelling mistake.

removed the ID cloum in "show system-health dpu DPU0", changed the DPU admin_state default behavior, added the dpu_state transition update.

rameshraghupathy · 2024-11-05T15:52:09Z

Related PRs:

prgeor · 2024-11-05T20:18:16Z

doc/smart-switch/pmon/smartswitch-pmon.md

-#### Schema for REBOOT_CAUSE - switch stateDB
+#### Reboot Cause
+1. The smartswitch needs to know the reboot cause for DPUs. Please refer to the CLI section for the various "show reboot-cause" options and their effects.
+    * Each DPU will update its reboot cause history in the Switch ChassisStateDB upon boot up and also persist this on the host. The recent reboot-cause is derived from that list of reboot-causes.


@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.

@rameshraghupathy it seems some agent in the DPU is responsible for pushing the reboot cause...so this will not work if the midplane is down for whatever reason.

@prgeor 1. The switch hardware is capable of getting the DPU reboot-cause on the NPU side and the required software work is in progress to provide it when requested with "get_reboot_cause()" even if the DPU is not reachable though the midplane. 2. We will have support to indicate a DPU-reboot every time a DPU reboots. The existing workflow will remain the same but how the platform code fetches the reboot-cause will be as explained above when the implementation is ready. Updated this section.

prgeor · 2024-11-08T21:52:39Z

doc/smart-switch/pmon/smartswitch-pmon.md

-* For persistent storage of the DPU reboot-cause and reboot-caue-history files use the existing host storage path and mechanism.
-
-#### Schema for REBOOT_CAUSE - switch stateDB
+#### Reboot Cause


@rameshraghupathy need separate section

DPU reboot cause

NPU reboot cause

@prgeor Done

@prgeor Added separate sections for DPU reboot cause and NPU reboot cause

prgeor · 2024-11-08T22:00:42Z

doc/smart-switch/pmon/smartswitch-pmon.md

+* The switch boots up. Determines the NPU reboot cause. 
+* Processes the previously stored NPU and DPU reboot-cause files and history files and updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB.
+* The above process is a one-shot event on boot up.
+* The module_db_update function in the NPU-PMON chassisd is an existing function constantly updating the operational status of the DPUs.  This function looks for DPU operational status change events and when the DPUs come out of "offline" state, issues "get_reboot_cause" API to the platform.


@rameshraghupathy can you state explicityl that out of offline state must mean DPU rebooted just to make sure DPU offline -> online does not happen without reboot.

@prgeor Done

vvolam

Looks good to me.

multiple places to be clear. Called out the output of the new CLI extensions on the DPUs.

oleksandrivantsiv · 2024-11-14T00:20:06Z

doc/smart-switch/pmon/smartswitch-pmon.md

+#### 2.1.1 DPUs in dark mode
+* A smartswitch when configured to boot up with all the DPUs in it are powered down upon boot up is referred as DPUs in dark mode.
+* In the dark mode the platform.json file shown in section "3.1.3" will not have the dictionary for the DPUS.
+* The term dark mode is overloaded in some cases where the platform.json may have the dictionary but the config_db.json will have the admin_state of all DPU modules as "down".


Please add the case when the platform.json has DPU information, but config DB doesn't have the DPU admin state configuration. The DPUs should be in downstate

@oleksandrivantsiv Done. This case is already shown in the example as well.

oleksandrivantsiv · 2024-11-14T00:26:42Z

doc/smart-switch/pmon/smartswitch-pmon.md

@@ -219,7 +250,7 @@ SmartSwitch PMON block diagram

 ### 3.1. Platform monitoring and management
 * SmartSwitch design Extends the existing chassis_base class and module_base class as described below.
-* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU and MODULE_TYPE_SWITCH to support SmartSwitch
+* Extend MODULE_TYPE in ModuleBase class with MODULE_TYPE_DPU to support SmartSwitch


This is captured in Smart Switch reboot HLD. Why do we need to duplicate the information here?

@oleksandrivantsiv Removed it. This was captured here way before that doc was generated. Now that the "Smart Switch reboot HLD" captures this, cleaned it.

oleksandrivantsiv · 2024-11-14T01:20:13Z

doc/smart-switch/pmon/smartswitch-pmon.md

+* The switch boots up. Determines the NPU reboot cause. 
+* Processes the previously stored NPU and DPU reboot-cause files and history files.
+* A maximum of ten reboot-cause history entries per dpu will be persisted just like the npu.
+* Updates the NPU reboot-cause into the StateDB and the DPU reboot-cause into the ChassisStateDB.


Do we need to differentiate between NPU and DPU configuration and store one in StateDB and anther in the ChassisStateDB? Won't it be easier to store everything in StateBD?

@oleksandrivantsiv I raised the same question in one of the previous meetings and back then the consensus was to keep them in their respective DB. I'm ok either way.

oleksandrivantsiv · 2024-11-14T01:21:27Z

doc/smart-switch/pmon/smartswitch-pmon.md


 ```
-2. Though the get_oper_status(self) can get the operational status of the DPU Modules, the current implementation only has limited capabilities.
+#### 2. DPU State
+Though the get_oper_status(self) can get the operational status of the DPU modules, the current implementation only has limited capabilities.


Which implementation has limitations?

@oleksandrivantsiv The existing PMON can only indicate if a module is offline/online/faulty. It does not have the granularity to narrow it down to control-plane or data-plane issue.

The current implementation is not limited. It does exactly what is expected—provides a DPU operation state. We have a different API to query the software state. I suggest to remove the description of the limitation.

@oleksandrivantsiv Done

oleksandrivantsiv · 2024-11-14T01:23:18Z

doc/smart-switch/pmon/smartswitch-pmon.md

    * Can only state MODULE_STATUS_FAULT and can't show exactly where in the state progression the DPU failed. This is critical in fault isolation, DPU switchover decision, resiliency and recovery
    * Though this is platform implementation specific, in a multi vendor use case, there has to be a consistent way of storing and accessing the information.
-    * Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB.
+    * Store the state progression (dpu_midplane_link_state, dpu_control_plane_state, dpu_data_plane_state) on the host ChassisStateDB using the push model specified in [section: 3.2.4 of SONiC Chassis Platform Management & Monitoring HLD](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-chassis-design.md)


We agreed that the dpu_control_plane_state and dpu_data_plane_state will be pushed by the DPU. But the dpu_midplane_link_state should be monitored and set by the NPU. Please update this.

@oleksandrivantsiv Updated

oleksandrivantsiv · 2024-11-14T01:25:57Z

doc/smart-switch/pmon/smartswitch-pmon.md

 * Thermal manager reads all thermal sensor data, run thermal policy and take policy action Ex. Set fan speed, set alarm, set syslog, set LEDs 
 * Platform collects fan related data such as presence, failure and then applies fan algorithm to set the new fan speed
 * The north bound CLI/Utils/App use DB data to ”show environment”, ”show platform temp” show platform fan”
 * The DPUs will update the ChassisStateDB "TEMPERATURE_INFO" tables through redis client call which in turn will be pushed into the switch StateDB.
 * The existing "TEMPERATURE_INFO" schema will be used to store the values and is shown below for convenience.
+* For phase:1 implementation the sensor values collected by DPU will not be pushed to the chassisStateDB.


Why? Won't be pushed by whom and on which platform?

@oleksandrivantsiv cleaned it.

oleksandrivantsiv · 2024-11-14T01:28:21Z

doc/smart-switch/pmon/smartswitch-pmon.md

+#### 3.4.1 Reboot Cause CLIs
+* There are two existing CLIs "show reboot-cause" and "show reboot-cause history"
+* These two CLIs are extended to "show reboot-cause all" and "show reboot-cause history \<option\>", where the "option" could be DPUx, all or SWITCH
+* When each DPU turns online the NPU chassisd will fetch the reboot-cause using the "get_reboot_cause()" API.


Why do we need to duplicate here information from #### Reboot workflow section?

@oleksandrivantsiv There is some overlap mainly to give a quick summary for the convenience of people just reading only CLI section. Cleaned it.

oleksandrivantsiv

LGTM

vvolam · 2024-11-26T02:15:23Z

doc/smart-switch/pmon/smartswitch-pmon.md

-* NPU-DPU (GNOI) soft reboot workflow will be captured in another document.
+* The GNOI server runs on the DPU even after the DPU is pre-shutdown and listens until the graceful shutdown finishes.
+* The host sends a GNOI signal to shutdown the DPU. The DPU does a graceful-shutdown if not already done and sends an ack back to the host.
+* Upon receiving the ack or on a timeout the host may trigger the switch PMON vendor API to shutdown the DPU.


NIT: Here vendor API is to send PCI detachment but not shutdown

vvolam

A small nit, looks good to me otherwise.

rameshraghupathy added 4 commits June 6, 2024 13:29

Added schema for reboot-cause and health-info chassisStateDB

bd36cff

Fixed formatting

98e787b

Fixed formatting

b8b02c1

Added TestPlan link

7918edf

rameshraghupathy mentioned this pull request Jun 6, 2024

Platform APIs for SmartSwitch sonic-net/sonic-platform-common#454

Merged

Did some cleanup

1ae2222

dgsudharsan requested a review from oleksandrivantsiv June 11, 2024 21:13

rameshraghupathy added 2 commits June 12, 2024 06:58

Updated reboot-cause schema, system-health-info schema and removed the

a14fd88

module_type_switch based on the implementation PR review

updated the system-health info schema

f325667

oleksandrivantsiv reviewed Jun 14, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved

oleksandrivantsiv reviewed Jun 14, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Show resolved Hide resolved

oleksandrivantsiv reviewed Jun 14, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved

oleksandrivantsiv reviewed Jun 14, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved

rameshraghupathy added 6 commits July 8, 2024 11:02

Removed the get_health_info API and addressed a couple of minor comments

0fdd737

Fixed reboot-cause reboot-cause history and the all option outputs

aa10015

Updated rev 0.4

b7b341a

cleaned NPU to DPU data port mapping

65f9f33

dpu_id now ranges from 0

064ce29

Changed the test-plan link

7ae06d6

prgeor reviewed Aug 6, 2024

View reviewed changes

rameshraghupathy added 2 commits August 5, 2024 17:45

Added a section for DPU dark mode

82f01e1

Did a minor cleanup in 2.1.1

0a47f51

vvolam reviewed Aug 8, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved

rameshraghupathy added 3 commits August 8, 2024 19:09

replaced platform.json with hwsku.json file to represent the NPU-DPU

7d5e14e

data port association

platform.json is the right file for NPU-DPU port association

962a5e1

Removed get_module_dpu_data_port API. Showed hwo the role in interfaces

5926a44

config is used

vvolam previously approved these changes Aug 21, 2024

View reviewed changes

prgeor reviewed Aug 23, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Show resolved Hide resolved

prgeor reviewed Aug 23, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Outdated Show resolved Hide resolved

prgeor reviewed Aug 23, 2024

View reviewed changes

doc/smart-switch/pmon/smartswitch-pmon.md Show resolved Hide resolved

Addressed review comments

e734c2f

prgeor reviewed Oct 3, 2024

View reviewed changes

Addressed a couple of review comments

fa6af05

prgeor reviewed Oct 7, 2024

View reviewed changes

Addressed the default dpu mode all over

c04c606

vvolam previously approved these changes Oct 22, 2024

View reviewed changes

vvolam mentioned this pull request Oct 28, 2024

Add DPU dark-mode definition for SmartSwitch PMON #1818

Closed

Included dpu-reboot-seq diagram, updated the spec for reboot-sequence,

4b61f2a

removed the ID cloum in "show system-health dpu DPU0", changed the DPU admin_state default behavior, added the dpu_state transition update.

rameshraghupathy dismissed vvolam’s stale review via 4b61f2a November 5, 2024 15:44

prgeor reviewed Nov 5, 2024

View reviewed changes

updated reboot-cause section

2bcff7e

prgeor reviewed Nov 8, 2024

View reviewed changes

Addressed review comments

4f6e157

vvolam previously approved these changes Nov 11, 2024

View reviewed changes

Updated the reboot-cause cli section 3.4.1

8de1958

rameshraghupathy dismissed vvolam’s stale review via 8de1958 November 11, 2024 22:04

Updated the reboot-cause CLIs section and added the max history count in

bb53182

multiple places to be clear. Called out the output of the new CLI extensions on the DPUs.

oleksandrivantsiv suggested changes Nov 14, 2024

View reviewed changes

rameshraghupathy added 2 commits November 14, 2024 08:56

Addressed a few review comments

b0f1ca2

Addressed a review comment

9acaa87

oleksandrivantsiv reviewed Nov 20, 2024

View reviewed changes

oleksandrivantsiv approved these changes Nov 20, 2024

View reviewed changes

vvolam reviewed Nov 26, 2024

View reviewed changes

vvolam approved these changes Nov 26, 2024

View reviewed changes

prgeor approved these changes Dec 5, 2024

View reviewed changes

prgeor merged commit 3b098b7 into sonic-net:master Dec 5, 2024
1 check passed

		* Though how DPU pmon updates this is vendor dependent, it is recommended to use the sonic telemetry agent to align with the existing SONiC implementation.
		* The DPUs will limit the number of history entries to a maximum of ten.

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Conversation

rameshraghupathy commented Jun 6, 2024

prgeor left a comment

Choose a reason for hiding this comment

vvolam left a comment

Choose a reason for hiding this comment

vvolam left a comment

Choose a reason for hiding this comment

rameshraghupathy commented Oct 3, 2024

prgeor Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvolam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rameshraghupathy commented Nov 5, 2024

Choose a reason for hiding this comment

rameshraghupathy Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvolam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksandrivantsiv Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksandrivantsiv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvolam left a comment

Choose a reason for hiding this comment

prgeor Oct 3, 2024 •

edited

Loading

rameshraghupathy Nov 8, 2024 •

edited

Loading

oleksandrivantsiv Nov 20, 2024 •

edited

Loading