-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[swss] Chassis db clean up optimization and bug fixes #16454
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -124,38 +124,47 @@ function clean_up_tables() | |
# SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET are adjusted appropriately | ||
function clean_up_chassis_db_tables() | ||
{ | ||
if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then | ||
return | ||
fi | ||
|
||
lc=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'hostname'` | ||
asic=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'asic_name'` | ||
switch_type=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'switch_type'` | ||
|
||
# Run clean up only in swss running for voq switches | ||
if is_chassis_supervisor || [[ $switch_type != 'voq' ]]; then | ||
return | ||
fi | ||
|
||
if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then | ||
return | ||
fi | ||
|
||
lc=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'hostname'` | ||
asic=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'asic_name'` | ||
|
||
# First, delete SYSTEM_NEIGH entries | ||
$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
num_neigh=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
local nn = 0 | ||
local host = string.gsub(ARGV[1], '%-', '%%-') | ||
local dev = ARGV[2] | ||
local ps = 'SYSTEM_NEIGH*|' .. host .. '|' .. dev | ||
local keylist = redis.call('KEYS', 'SYSTEM_NEIGH*') | ||
for j,key in ipairs(keylist) do | ||
if string.match(key, ps) ~= nil then | ||
redis.call('DEL', key) | ||
nn = nn + 1 | ||
end | ||
end | ||
return " 0 $lc $asic | ||
return nn" 0 $lc $asic` | ||
|
||
debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_NEIGH entries deleted: $num_neigh" | ||
|
||
# Wait for some time before deleting system interface so that the system interface's "object in use" | ||
# is cleared in both orchangent and in syncd. Without this delay, the orchagent clears the refcount | ||
# but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use" | ||
# error and aborts. | ||
# This delay is needed only if some system neighbors were deleted. | ||
|
||
sleep 30 | ||
if [[ $num_neigh > 0 ]]; then | ||
sleep 30 | ||
fi | ||
|
||
# Next, delete SYSTEM_INTERFACE entries | ||
$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
|
@@ -171,22 +180,29 @@ function clean_up_chassis_db_tables() | |
return " 0 $lc $asic | ||
|
||
# Next, delete SYSTEM_LAG_MEMBER_TABLE entries | ||
$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
num_lag_mem=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
local nlm = 0 | ||
local host = string.gsub(ARGV[1], '%-', '%%-') | ||
local dev = ARGV[2] | ||
local ps = 'SYSTEM_LAG_MEMBER_TABLE*|' .. host .. '|' .. dev | ||
local keylist = redis.call('KEYS', 'SYSTEM_LAG_MEMBER_TABLE*') | ||
for j,key in ipairs(keylist) do | ||
if string.match(key, ps) ~= nil then | ||
redis.call('DEL', key) | ||
nlm = nlm + 1 | ||
end | ||
end | ||
return " 0 $lc $asic | ||
return nlm" 0 $lc $asic` | ||
|
||
debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: $num_lag_mem" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add debug statements for the deletion of other entries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. I'll add debug for other table entries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. Debug logs added for deletion of other tables also. |
||
|
||
# Wait for some time before deleting system lag so that the all the memebers of the | ||
# system lag will be cleared. | ||
# This delay is needed only if some system lag members were deleted | ||
|
||
sleep 15 | ||
if [[ $num_lag_mem > 0 ]]; then | ||
sleep 15 | ||
fi | ||
|
||
# Finally, delete SYSTEM_LAG_TABLE entries and deallot LAG IDs | ||
$SONIC_DB_CLI CHASSIS_APP_DB EVAL " | ||
|
@@ -275,7 +291,7 @@ start() { | |
$SONIC_DB_CLI GB_ASIC_DB FLUSHDB | ||
$SONIC_DB_CLI GB_COUNTERS_DB FLUSHDB | ||
$SONIC_DB_CLI RESTAPI_DB FLUSHDB | ||
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*'" | ||
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*','SYSTEM_NEIGH_TABLE*'" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how will removing from state_db trigger removing of the entry from the kernel? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are not removing the entries from kernel but avoid creating entries. When there are entries in the SYSTEM_NEIGH_TABLE in the STATE_DB, when nbrmgr comes up, it subscribes to this table. The existing entries in the table are subscribed as SET commands. As part of SET command processing for entries from this table in STATE_DB, we program kernel neighbors. By removing all the stale entries from this table we avoid nbrmgr getting the SET commands and hence the programming of the kernel entries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @vganesan-nokia for the explanation. I get this part. This change helps. I was thinking of a scenario where, after swss restart if a neighbor is not learnt anymore on the local linecard, then the kernel entry will not be removed, is that correct? this change may not help in this case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The STATE_DB SYSTEM_NEIGH_TABLE is only for remote neighbor entries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks |
||
$SONIC_DB_CLI APPL_STATE_DB FLUSHDB | ||
clean_up_chassis_db_tables | ||
rm -rf /tmp/cache | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change? is there any issue seen without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues seen. This is an optimization change. For scenarios explained by @judyjoseph and @gechiang (for PR #16213) there may be situations when there will not be any entries to be cleaned up (for example when the asic is restarted second time or after). If there are no entries cleaned up the the delay is unnecessary. So we introduce delay conditionally only if there were some entries deleted.