Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swss] Chassis db clean up optimization and bug fixes #16454

Merged
merged 2 commits into from
Sep 11, 2023

Conversation

vganesan-nokia
Copy link
Contributor

@vganesan-nokia vganesan-nokia commented Sep 6, 2023

Why I did it

The PR is for the following:

Work item tracking

N/A

How I did it

  • Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451)
    • Chassis clean up code is re-arranged to avoid accessing CHASSIS_APP_DB for non voq switches.
  • After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait.
  • Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait.
  • For the above two, the redis eval script to delete entries from chassis app db is made to return the number of entries deleted. Delay is introduced only if the number of entries deleted is greater than 0.
  • Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh entries are present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

How to verify it

  • Tested during config reload and chassis reboot.
  • When asic without system neigbor entries and system lag member entries are restarted it was observed that delay was not introduced. This was indirectly verified by checking the log for the number of entries deleted.
  • No stale kernel neighbors were found after swss came up.
  • Tested in pizzabox for config reload. The error "Invalid database name input : 'CHASSIS_APP_DB'" was not seen.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

master

Description for the changelog

Changes done in swss.sh service start script in function clean_up_chassis_db_tables().

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@mlok-nokia
Copy link
Contributor

mlok-nokia commented Sep 6, 2023

Change is good.

@mlok-nokia
Copy link
Contributor

mlok-nokia commented Sep 6, 2023

@arlakshm @judyjoseph @rlhui @gechiang This PR address the Pizzabox try to PING CHASSIS_APP_DB issue. It also contains the optimization (NO delay if there is Neighbor entry or no LAG_MEMBER_GROUP).

Copy link
Contributor

@arlakshm arlakshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comments


sleep 30
if [[ $num_neigh > 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? is there any issue seen without this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues seen. This is an optimization change. For scenarios explained by @judyjoseph and @gechiang (for PR #16213) there may be situations when there will not be any entries to be cleaned up (for example when the asic is restarted second time or after). If there are no entries cleaned up the the delay is unnecessary. So we introduce delay conditionally only if there were some entries deleted.

@@ -275,7 +291,7 @@ start() {
$SONIC_DB_CLI GB_ASIC_DB FLUSHDB
$SONIC_DB_CLI GB_COUNTERS_DB FLUSHDB
$SONIC_DB_CLI RESTAPI_DB FLUSHDB
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*'"
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*','SYSTEM_NEIGH_TABLE*'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will removing from state_db trigger removing of the entry from the kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not removing the entries from kernel but avoid creating entries. When there are entries in the SYSTEM_NEIGH_TABLE in the STATE_DB, when nbrmgr comes up, it subscribes to this table. The existing entries in the table are subscribed as SET commands. As part of SET command processing for entries from this table in STATE_DB, we program kernel neighbors. By removing all the stale entries from this table we avoid nbrmgr getting the SET commands and hence the programming of the kernel entries.

Copy link
Contributor

@arlakshm arlakshm Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vganesan-nokia for the explanation. I get this part. This change helps. I was thinking of a scenario where, after swss restart if a neighbor is not learnt anymore on the local linecard, then the kernel entry will not be removed, is that correct? this change may not help in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STATE_DB SYSTEM_NEIGH_TABLE is only for remote neighbor entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

return " 0 $lc $asic
return nlm" 0 $lc $asic`

debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: $num_lag_mem"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add debug statements for the deletion of other entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll add debug for other table entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Debug logs added for deletion of other tables also.

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@yxieca yxieca merged commit b13b41f into sonic-net:master Sep 11, 2023
@deepak-singhal0408
Copy link
Contributor

Microsoft ADO: 25128972
@yxieca @StormLiangMS
Appreciate approval for the requested branches...

Thanks!

@mssonicbld
Copy link
Collaborator

@vganesan-nokia PR conflicts with 202211 branch

@mssonicbld
Copy link
Collaborator

@vganesan-nokia PR conflicts with 202205 branch

@deepak-singhal0408
Copy link
Contributor

Hi @vganesan-nokia , could you please create a manual PR for above conflict branches?

@vganesan-nokia
Copy link
Contributor Author

Hi @vganesan-nokia , could you please create a manual PR for above conflict branches?

Yes. I'll create PRs for both branches with conflicts fixed.

vganesan-nokia added a commit to vganesan-nokia/sonic-buildimage that referenced this pull request Sep 13, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
vganesan-nokia added a commit to vganesan-nokia/sonic-buildimage that referenced this pull request Sep 13, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
@vganesan-nokia
Copy link
Contributor Author

@deepak-singhal0408: PRs have been raised with conflicts fixed for cherry-picking this PR to 202205 and 202211 branches.
The PRs are: For 202211: #16540 and for 202205: #16541

yxieca pushed a commit that referenced this pull request Sep 14, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
yxieca pushed a commit that referenced this pull request Sep 14, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
@deepak-singhal0408
Copy link
Contributor

@yxieca @lguohan could we please cherry pick above PR for 202305 branch as well? Thanks

@dgsudharsan
Copy link
Collaborator

@StormLiangMS Can you please help with cherry-pick request for 202305?

@dgsudharsan
Copy link
Collaborator

@vganesan-nokia I believe this may have conflicts for 202305. Can you please create a separate PR?

sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
@vganesan-nokia
Copy link
Contributor Author

vganesan-nokia commented Sep 20, 2023

@vganesan-nokia I believe this may have conflicts for 202305. Can you please create a separate PR?

We need to add PR #16213 in 202305 before we add this PR

@mssonicbld
Copy link
Collaborator

@vganesan-nokia PR conflicts with 202305 branch

@vganesan-nokia
Copy link
Contributor Author

@vganesan-nokia PR conflicts with 202305 branch

o.k. I'll create a separte PR with conflicts resolved.

vganesan-nokia added a commit to vganesan-nokia/sonic-buildimage that referenced this pull request Sep 21, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
@vganesan-nokia
Copy link
Contributor Author

vganesan-nokia commented Sep 21, 2023

Raised PR #16644 to cherry-pick PR #16454 with conflicts resolved.

StormLiangMS pushed a commit that referenced this pull request Sep 22, 2023
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants