Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swss] Chassis db clean up optimization and bug fixes #16454

Merged
merged 2 commits into from
Sep 11, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 40 additions & 16 deletions files/scripts/swss.sh
Original file line number Diff line number Diff line change
Expand Up @@ -124,72 +124,93 @@ function clean_up_tables()
# SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET are adjusted appropriately
function clean_up_chassis_db_tables()
{
if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then
return
fi

lc=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'hostname'`
asic=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'asic_name'`
switch_type=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'switch_type'`

# Run clean up only in swss running for voq switches
if is_chassis_supervisor || [[ $switch_type != 'voq' ]]; then
return
fi

if [[ !($($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; then
return
fi

lc=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'hostname'`
asic=`$SONIC_DB_CLI CONFIG_DB hget 'DEVICE_METADATA|localhost' 'asic_name'`

# First, delete SYSTEM_NEIGH entries
$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
num_neigh=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
local nn = 0
local host = string.gsub(ARGV[1], '%-', '%%-')
local dev = ARGV[2]
local ps = 'SYSTEM_NEIGH*|' .. host .. '|' .. dev
local keylist = redis.call('KEYS', 'SYSTEM_NEIGH*')
for j,key in ipairs(keylist) do
if string.match(key, ps) ~= nil then
redis.call('DEL', key)
nn = nn + 1
end
end
return " 0 $lc $asic
return nn" 0 $lc $asic`

debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_NEIGH entries deleted: $num_neigh"

# Wait for some time before deleting system interface so that the system interface's "object in use"
# is cleared in both orchangent and in syncd. Without this delay, the orchagent clears the refcount
# but the syncd (meta) still has no-zero refcount. Because of this, orchagent gets "object still in use"
# error and aborts.
# This delay is needed only if some system neighbors were deleted.

sleep 30
if [[ $num_neigh > 0 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? is there any issue seen without this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues seen. This is an optimization change. For scenarios explained by @judyjoseph and @gechiang (for PR #16213) there may be situations when there will not be any entries to be cleaned up (for example when the asic is restarted second time or after). If there are no entries cleaned up the the delay is unnecessary. So we introduce delay conditionally only if there were some entries deleted.

sleep 30
fi

# Next, delete SYSTEM_INTERFACE entries
$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
num_sys_intf=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
local nsi = 0
local host = string.gsub(ARGV[1], '%-', '%%-')
local dev = ARGV[2]
local ps = 'SYSTEM_INTERFACE*|' .. host .. '|' .. dev
local keylist = redis.call('KEYS', 'SYSTEM_INTERFACE*')
for j,key in ipairs(keylist) do
if string.match(key, ps) ~= nil then
redis.call('DEL', key)
nsi = nsi + 1
end
end
return " 0 $lc $asic
return nsi" 0 $lc $asic`

debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_INTERFACE entries deleted: $num_sys_intf"

# Next, delete SYSTEM_LAG_MEMBER_TABLE entries
$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
num_lag_mem=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
local nlm = 0
local host = string.gsub(ARGV[1], '%-', '%%-')
local dev = ARGV[2]
local ps = 'SYSTEM_LAG_MEMBER_TABLE*|' .. host .. '|' .. dev
local keylist = redis.call('KEYS', 'SYSTEM_LAG_MEMBER_TABLE*')
for j,key in ipairs(keylist) do
if string.match(key, ps) ~= nil then
redis.call('DEL', key)
nlm = nlm + 1
end
end
return " 0 $lc $asic
return nlm" 0 $lc $asic`

debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_LAG_MEMBER_TABLE entries deleted: $num_lag_mem"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add debug statements for the deletion of other entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll add debug for other table entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Debug logs added for deletion of other tables also.


# Wait for some time before deleting system lag so that the all the memebers of the
# system lag will be cleared.
# This delay is needed only if some system lag members were deleted

sleep 15
if [[ $num_lag_mem > 0 ]]; then
sleep 15
fi

# Finally, delete SYSTEM_LAG_TABLE entries and deallot LAG IDs
$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
num_sys_lag=`$SONIC_DB_CLI CHASSIS_APP_DB EVAL "
local nsl = 0
local host = string.gsub(ARGV[1], '%-', '%%-')
local dev = ARGV[2]
local ps = 'SYSTEM_LAG_TABLE*|' .. '(' .. host .. '|' .. dev ..'.*' .. ')'
Expand All @@ -201,9 +222,12 @@ function clean_up_chassis_db_tables()
local lagid = redis.call('HGET', 'SYSTEM_LAG_ID_TABLE', lagname)
redis.call('SREM', 'SYSTEM_LAG_ID_SET', lagid)
redis.call('HDEL', 'SYSTEM_LAG_ID_TABLE', lagname)
nsl = nsl + 1
end
end
return " 0 $lc $asic
return nsl" 0 $lc $asic`

debug "Chassis db clean up for ${SERVICE}$DEV. Number of SYSTEM_LAG_TABLE entries deleted: $num_sys_lag"

}

Expand Down Expand Up @@ -275,7 +299,7 @@ start() {
$SONIC_DB_CLI GB_ASIC_DB FLUSHDB
$SONIC_DB_CLI GB_COUNTERS_DB FLUSHDB
$SONIC_DB_CLI RESTAPI_DB FLUSHDB
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*'"
clean_up_tables STATE_DB "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'LAG_TABLE*', 'LAG_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*', 'FDB_TABLE*', 'FG_ROUTE_TABLE*', 'BUFFER_POOL*', 'BUFFER_PROFILE*', 'MUX_CABLE_TABLE*', 'ADVERTISE_NETWORK_TABLE*', 'VXLAN_TUNNEL_TABLE*', 'VNET_ROUTE*', 'MACSEC_PORT_TABLE*', 'MACSEC_INGRESS_SA_TABLE*', 'MACSEC_EGRESS_SA_TABLE*', 'MACSEC_INGRESS_SC_TABLE*', 'MACSEC_EGRESS_SC_TABLE*', 'VRF_OBJECT_TABLE*', 'VNET_MONITOR_TABLE*', 'BFD_SESSION_TABLE*','SYSTEM_NEIGH_TABLE*'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will removing from state_db trigger removing of the entry from the kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not removing the entries from kernel but avoid creating entries. When there are entries in the SYSTEM_NEIGH_TABLE in the STATE_DB, when nbrmgr comes up, it subscribes to this table. The existing entries in the table are subscribed as SET commands. As part of SET command processing for entries from this table in STATE_DB, we program kernel neighbors. By removing all the stale entries from this table we avoid nbrmgr getting the SET commands and hence the programming of the kernel entries.

Copy link
Contributor

@arlakshm arlakshm Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vganesan-nokia for the explanation. I get this part. This change helps. I was thinking of a scenario where, after swss restart if a neighbor is not learnt anymore on the local linecard, then the kernel entry will not be removed, is that correct? this change may not help in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STATE_DB SYSTEM_NEIGH_TABLE is only for remote neighbor entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

$SONIC_DB_CLI APPL_STATE_DB FLUSHDB
clean_up_chassis_db_tables
rm -rf /tmp/cache
Expand Down