-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[swss] Chassis db clean up optimization and bug fixes #16454
Conversation
This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com>
Change is good. |
@arlakshm @judyjoseph @rlhui @gechiang This PR address the Pizzabox try to PING CHASSIS_APP_DB issue. It also contains the optimization (NO delay if there is Neighbor entry or no LAG_MEMBER_GROUP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as comments
|
||
sleep 30 | ||
if [[ $num_neigh > 0 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change? is there any issue seen without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues seen. This is an optimization change. For scenarios explained by @judyjoseph and @gechiang (for PR #16213) there may be situations when there will not be any entries to be cleaned up (for example when the asic is restarted second time or after). If there are no entries cleaned up the the delay is unnecessary. So we introduce delay conditionally only if there were some entries deleted.
@@ -275,7 +291,7 @@ start() { | |||
$SONIC_DB_CLI GB_ASIC_DB FLUSHDB | |||
$SONIC_DB_CLI GB_COUNTERS_DB FLUSHDB | |||
$SONIC_DB_CLI RESTAPI_DB FLUSHDB | |||
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*'" | |||
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*','SYSTEM_NEIGH_TABLE*'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will removing from state_db trigger removing of the entry from the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not removing the entries from kernel but avoid creating entries. When there are entries in the SYSTEM_NEIGH_TABLE in the STATE_DB, when nbrmgr comes up, it subscribes to this table. The existing entries in the table are subscribed as SET commands. As part of SET command processing for entries from this table in STATE_DB, we program kernel neighbors. By removing all the stale entries from this table we avoid nbrmgr getting the SET commands and hence the programming of the kernel entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vganesan-nokia for the explanation. I get this part. This change helps. I was thinking of a scenario where, after swss restart if a neighbor is not learnt anymore on the local linecard, then the kernel entry will not be removed, is that correct? this change may not help in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The STATE_DB SYSTEM_NEIGH_TABLE is only for remote neighbor entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
return " 0 $lc $asic | ||
return nlm" 0 $lc $asic` | ||
|
||
debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: $num_lag_mem" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add debug statements for the deletion of other entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll add debug for other table entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Debug logs added for deletion of other tables also.
Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com>
Microsoft ADO: 25128972 Thanks! |
@vganesan-nokia PR conflicts with 202211 branch |
@vganesan-nokia PR conflicts with 202205 branch |
Hi @vganesan-nokia , could you please create a manual PR for above conflict branches? |
Yes. I'll create PRs for both branches with conflicts fixed. |
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
@deepak-singhal0408: PRs have been raised with conflicts fixed for cherry-picking this PR to 202205 and 202211 branches. |
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
@StormLiangMS Can you please help with cherry-pick request for 202305? |
@vganesan-nokia I believe this may have conflicts for 202305. Can you please create a separate PR? |
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com>
We need to add PR #16213 in 202305 before we add this PR |
@vganesan-nokia PR conflicts with 202305 branch |
o.k. I'll create a separte PR with conflicts resolved. |
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
* [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com> (cherry picked from commit b13b41f)
Why I did it
The PR is for the following:
Work item tracking
N/A
How I did it
wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait.
creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
the local STATE_DB when sevice comes up.
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
master
Description for the changelog
Changes done in swss.sh service start script in function clean_up_chassis_db_tables().
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)