Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202205] [TACACS] Improve per-command authorization performance by read passwd entry with getpwent #16658

Closed

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Sep 22, 2023

Improve per-command authorization performance by read passwd entry with getpwent.
This is manually cherry-pick PR for #16460

Why I did it

Currently per-command authorization will check if user is remote user with getpwnam API, which will trigger tacplus-nss for authentication with TACACS server.
But this is not necessary because when user login the user information already add to local passwd file.
Use getpwent API can directly read from passwd file, this will improve per-command authorization performance.

Work item tracking
  • Microsoft ADO: 25104723

How I did it

Improve per-command authorization performance by read passwd entry with getpwent.

How to verify it

Pass all UT.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

  • master-16460.356317-6c3424111

Description for the changelog

Improve per-command authorization performance by read passwd entry with getpwent.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld and others added 30 commits May 15, 2023 10:02
…t#14993) (sonic-net#15050)

Fix sonic-net#14974
Refs: docker/docker-py#3116

Co-authored-by: Konstantin Vasin <126960927+k-v1@users.noreply.github.com>
Why I did it
Support marvell/marvell-arm64 build

Work item tracking
Microsoft ADO (number only): 19995559
How I did it
… debs (sonic-net#12823)

Ubuntu 22.04 leverages Zstandard compression to dpkg by default.
Debian doesn't support it yet
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=892664

Fix sonic-net#12822

Signed-off-by: Cédric Ollivier <cedric.ollivier@orange.com>
…rd match more than hundred files. (sonic-net#14787)

Fix per-command authorization failed issue when a command with wildcard match more than hundred files.


#### Why I did it
When user enable TACACS per-command authorization, and run a command with wildcard , if the command match more than hundreds of files, the per-command authorization will failed with following message:
  *** authorize failed by TACACS+ with given arguments, not executing

The root cause of this issue is because bash will match files with wildcard and replace with wildcard args with matched files. when there are too many files, TACACS plugin will generate a big authorization request, which will be reject by server side. 

##### Work item tracking
- Microsoft ADO **(number only)**: 18074861

#### How I did it
Fix bash patch file, use original user inputs as authorization parameters.

#### How to verify it
Pass all UT.
Create new UT to validate the TACACS authorization request are using original command arguments.
UT PR: sonic-net/sonic-mgmt#8115

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [X] 202205
- [X] 202211

#### Tested branch (Please provide the tested image version)

- [x] 202205.258490-412b83d0f
- [x] 202211.71966120-1b971c54b5


#### Description for the changelog
Fix per-command authorization failed issue when a command with wildcard match more than hundred files.
…rser (sonic-net#14703)

* Support ACL interface type BmcData in minigraph parser

* Support ACL interface type BmcData in minigraph parser

* add unittest

* Add a global dict for storing the defination of custom acl tables
1. SONIC 20220531.25 OC Failure: Everflow testcases failing due to SAI orchagent crash
2. SONIC 20220531.25 OC Failure: ACL IPv6 testcases.
3. TPID support

Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
…-7150IXRE platform (sonic-net#14548)

Why I did it

Update sonic-platform submodule for Nokia-7250IXRE Platform. This requires the new NDK 22.9.8 and above

How I did it
Update submodule sonic-platform for Nokia-7250IXRE platform.
c9f316e Disparate process and thread-safe protection for MDIPC transport, and refactored presence logic to better align with SfpStateUpdateTask operation
a3486cc Added _get_module_bulk_info() and cache the info for 5 seconds to optimize the chassisd update.
4b2e729 Fixed the nokia_cmd show qfpga help display
7b87049 Fixed the nokia_cmd show midplane helper dispaly.
83eabea Add "nokia_cmd set ndk-monitor-action" and "nokia_cmd set ndk-log-level" commands
8aad7de Add nokia_cmd show ndk-version
d2c55e3 Modify the psu.py and module.py to optimize the psud running time


Signed-off-by: mlok <marty.lok@nokia.com>
…ot script (sonic-net#14568)

Why I did it

When reboot the chassis by issuing "sudo reboot" on Supervisor card. The internal midplane communication xe0 should be shutdown to avoid double reboot on the linecard.
Added a udev link rule to disable the autoneg on AMD xgbe port Xe0 and Xe1 and make the setting in sync with the peer Broadcom greyhound ports.

How I did it

Modify the Nokia-7250IXRE specific reboot script on the Supervisor card to shutdown the internal interface xe0. Also move reboot linecard code to the top of the script to make sure the notification has been send to Linecard before shutdown the xe0 interface.
Introduced a new rule 80-net-by-driver.link to disable the autoneg on the AMD size. This change requires the latest NDK which contains the change to set the autoneg on the xe0 and xe1 port on the Greyhound.

Signed-off-by: mlok <marty.lok@nokia.com>
…atically (sonic-net#14983)

src/sonic-utilities

* 62683097 - (HEAD -> 202205, origin/202205) [config]: Dynamically start and stop ndppd (sonic-net#2814) (5 days ago) [Lawrence Lee]
* 8adaa020 - Added platform plugin support in load_minigraph (sonic-net#2808) (sonic-net#2831) (8 days ago) [anamehra]
…e latest HEAD automatically (sonic-net#15016)

src/wpasupplicant/sonic-wpa-supplicant

* a24412c25 - (HEAD -> 202205, origin/master, origin/HEAD, origin/202211, origin/202205, master) [mka]: Fix unexpected cleanup (sonic-net#73) (8 days ago) [Ze Gan]
* 26d1da0bc - [mka]: Fix re-establishment by reset MI (sonic-net#72) (8 days ago) [Ze Gan]
* f07e0a097 - [azp]: Update build pipeline to build for Bullseye (sonic-net#70) (4 weeks ago) [Ze Gan]
*   2c69e2cda - Use github code scanning instead of LGTM (sonic-net#69) (6 months ago) [Liu Shilong]
|\  
| * 23abb04e5 - fix (6 months ago) [shilongliu]
| * f34d68fe6 - libdbus-1-dev (6 months ago) [shilongliu]
| * dc2dd881e - add dbus (6 months ago) [shilongliu]
| * 5de037661 - use swsscommon packages (6 months ago) [shilongliu]
| * 32c5a2729 - Use github code scanning instead of LGTM (6 months ago) [shilongliu]
|/  
* aa731b96f - [azp]: Install libyang in azure pipeline (sonic-net#68) (8 months ago) [Hua Liu]
* 71b635d74 - Revert "[Azp]: Upgrade Azp to bullseye (sonic-net#49)" (sonic-net#66) (9 months ago) [Ze Gan]
* 7aa4e6fa4 - Adding Microsoft SECURITY.MD (sonic-net#58) (9 months ago) [microsoft-github-policy-service[bot]]
… automatically (sonic-net#15033)

src/sonic-platform-common

* c7ce1a5 - (HEAD -> 202205, origin/202205) Prevent VDM dictionary related KeyError when a transceiver module is pulled while a bulk get method is interrogating said module (sonic-net#360) (5 days ago) [snider-nokia]
…lly (sonic-net#15034)

src/sonic-swss

* eb79dae - (HEAD -> 202205, origin/202205) [orchagent]: Handle additional SAI error conditions gracefully (sonic-net#2755) (5 days ago) [prabhataravind]
* d3c3a7d - [mux]: Implement rollback for failed mux switchovers (sonic-net#2714) (sonic-net#2761) (5 days ago) [Lawrence Lee]
…tty not running to avoid false alert. (sonic-net#14402) (sonic-net#15032)

[S6100] Improve S6100 serial-getty monitor, wait and re-check when getty not running to avoid false alert. 

#### Why I did it
On S6100, the serial-getty service some time can't auto-restart by systemd. So there is a monit unit to check serial-getty service status and restart it.

However, this monit will report false alert, because in most case when serial-getty not running, systemd can restart it successfully.

To avoid the false alert, improve the monitor to wait and re-check.

Steps to reproduce this issue:
1. User login to device via console, and keep the connection.
2. User login to device via SSH, check the serial-getty@ttyS1.service service, it's running.
3. Run 'monit reload' from SSH connection.
4. Check syslog 1 minutes later, there will be false alert: ' 'serial-getty' process is not running'

#### How I did it
Add check-getty.sh script to recheck again later when getty service not running.
And update monit unit to check serial-getty service status with this script to avoid false alert.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly:


```
admin@***:~$ sudo systemctl stop  serial-getty@ttyS1.service
admin@***:~$ sudo /usr/local/bin/check-getty.sh 
admin@***:~$ echo $?
1
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
     Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Tue 2023-03-28 07:15:21 UTC; 1min 13s ago

admin@***:~$ sudo /usr/local/bin/check-getty.sh 
admin@***:~$ echo $?
0
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
     Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
```

syslog:
```
Mar 28 07:10:37.597458 *** INFO systemd[1]: serial-getty@ttyS1.service: Succeeded.
Mar 28 07:12:43.010550 *** ERR monit[593]: 'serial-getty' status failed (1) -- no output
Mar 28 07:12:43.010744 *** INFO monit[593]: 'serial-getty' trying to restart
Mar 28 07:12:43.010846 *** INFO monit[593]: 'serial-getty' stop: '/bin/systemctl stop serial-getty@ttyS1.service'
Mar 28 07:12:43.132172 *** INFO monit[593]: 'serial-getty' start: '/bin/systemctl start serial-getty@ttyS1.service'
Mar 28 07:13:43.286276 *** INFO monit[593]: 'serial-getty' status succeeded (0) -- no output
```

#### Description for the changelog
[S6100] Improve S6100 serial-getty monitor.

#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
…net#14834) (sonic-net#15097)

This PR is to handle the override minigraph config by golden_config_db.json file if it is present in the backup location.
…onic-net#14736) (sonic-net#15079)

- Why I did it
There are chassis-packet and Single asic platforms which support this 400G to 100G/40G speed change via config.
Enabling this feature for all platforms which can support this. Keeping it enabled for all does not affect the platforms
which do not support this feature yet.

Work item tracking
Microsoft ADO (number only):
17952356
- How I did it
Removed switch_type and role type check.

- How to verify it
Loaded router with default 400G config. Loaded minigraph to convert 400G to 100G speed.

Signed-off-by: anamehra <anamehra@cisco.com>
Why I did it
ptf_nn_agent failed to start in dnx rpc syncd because module afpacket was not installed.
Please see issue sonic-net/sonic-mgmt#7822

How I did it
Add downloading ptf afpacket module in docker file.

How to verify it
Verified that ptf_nn_agent was started successfully in dnx rpc syncd with the change.
…show command output (sonic-net#13940)

This PR is to add the following

Add a new options "--profile" to the show macsec command, to show all profiles in device
Update the currentl show macsec command, to show profile in each interface o/p. This will tell which macsec profile the interface is attached to.
…ob (sonic-net#14702)

Why I did it
systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service.

Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers

root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>
Expected

Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB

root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
snmp                    Down              Down                Inactive
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>
How I did it
Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this

Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] 
Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead]

Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] 

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
What I did:
In FRR command update source <interface-name> is not at address-family level. Because of this
internal peer route-map for ipv6 were getting applied to ipv4 address family. As a result
TSA over iBGP for Ipv6 was not getting applied.

How I verify:

Manual Verification of TSA over both ipv4 and ipv6 after fix works fine.
Updated UT for this.

Added sonic-mgmt test gap: sonic-net/sonic-mgmt#8170

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
…onic-net#15140)

* [yang] Extend device_metadata yang model with rack_mgmt_map

* Update doc and sample
Why I did it
This PR is to backport PR sonic-net#11117 into 202205 branch.
This PR is to define Yang model for SYSTEM_DEFAULTS table.
The table was introduced in PR sonic-net/SONiC#982
The table will be like

"SYSTEM_DEFAULTS": {
        "tunnel_qos_remap": {
            "status": "enabled"
        }
}
Work item tracking
Microsoft ADO (https://msazure.visualstudio.com/One/_workitems/edit/23037078)
How I did it
Add a new yang file sonic-system-defaults. Yang.

How to verify it
Verified by UT
… automatically (sonic-net#15133)

src/sonic-platform-common

* ff72811 - (HEAD -> 202205, origin/202205) Fix issue '<' not supported between instances of 'NoneType' and 'int' (sonic-net#371) (5 hours ago) [Junchao-Mellanox]
* f2a419d - Render Media lane and Media assignment options info from Application Code (sonic-net#368) (8 hours ago) [rajann]
* d8bad10 - Retrieve FW version using CDB command for CMIS transceivers + handle single bank FW versioning (sonic-net#372) (8 hours ago) [mihirpat1]
…matically (sonic-net#15134)

src/sonic-py-swsssdk

* cd4cb1e - (HEAD -> 202205, origin/202205) Loosen the redis version requirement (sonic-net#138) (5 hours ago) [xumia]
…atically (sonic-net#15135)

src/sonic-utilities

* ff26d900 - (HEAD -> 202205, origin/202205) Fix the invalid variable issue when set-fips in uboot (sonic-net#2834) (5 hours ago) [xumia]
* d98033fa - correctly parsing complete ipv6 vnet info (sonic-net#2827) (5 hours ago) [Keith Lu]
* 14b21508 - [GCU]Fix rdma check failure (sonic-net#2824) (5 hours ago) [jingwenxie]
* e2f0f2c8 - LAG keepalive script to reduce lacp session wait during warm-reboot (sonic-net#2806) (5 hours ago) [Vaibhav Hemant Dixit]
* 30533711 - Update TRANSCEIVER_INFO table after CDB FW upgrade (sonic-net#2837) (8 hours ago) [mihirpat1]
…t#15139)

Why I did it
Fix pipeline issue which breaks 202205 official build.

Work item tracking
Microsoft ADO (number only): 23115245
How I did it
qemuOrCrossBuild was not cherry picked into 202205 branch.
Remove this feature.
mssonicbld and others added 28 commits September 5, 2023 21:47
How I did it
Fix the regex for L4 port range in openconfig_acl.py.

How to verify it
Build image and install on Arista-720DT DUT, then try the repro steps in sonic-net#16189 and confirmed the ACL rule be setup correctly:

Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
…D automatically (sonic-net#16390)

src/sonic-platform-daemons

* 0258ecf - (HEAD -> 202205, origin/202205) [pmon][chassis][voq] Chassis DB cleanup when module is down (sonic-net#394) (9 hours ago) [vganesan-nokia]
…essage (sonic-net#16367) (sonic-net#16440)

* Fix the Loopback0 IPv6 address of LC's in chassis not reachable from peer device's
* Assign the metric vaule for Ipv6 default route learnt via RA message to higher value so that BGP learnt default route is higher priority.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
… shutdown (sonic-net#13611) (sonic-net#16449)

- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue sonic-net#13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test

Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
) (sonic-net#16457)

On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.

We now disable the i801 driver interrupt and instead enable polling

Microsoft ADO (number only): 24910530

How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver

How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-

- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3

Signed-off-by: Prince George <prgeor@microsoft.com>
Co-authored-by: Prince George <45705344+prgeor@users.noreply.github.com>
…nic-net#16271) (sonic-net#16467)

- Why I did it
Revise lable name and fix typo in sensor.conf of 4600C

- How I did it
Revise lable name and fix typo in sensor.conf of 4600C

- How to verify it
Manual test
sonic-mgmt test_sensors.py

Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
…d BMCDATAV6 (sonic-net#16249) (sonic-net#16473)

Why I did it
According to ACL-Table-Type-HLD, the value type of MATCHES, ACTIONS and BIND_POINTS should be list instead of string. Opening this PR to update the definition of BMCDATA and BMCDATAV6.

How I did it
Update the definition of BMCDATA and BMCDATAV6 in minigraph-parser.

How to verify it
Verified by UT and build SONiC image.

Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
…) (sonic-net#16472)

How I did it
Update Yang definition of ACL_TABLE_TYPE.
Update existing testcase.
Add new testcase to cover lowercase key scenario.

How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.

Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
…net#16469)

* [YANG][vlan-sub-interface] Add `vlan` field



* Fix typo



* Fix UT



---------

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Co-authored-by: Longxiang Lyu <35479537+lolyu@users.noreply.github.com>
sonic-net#16474)

src/linkmgrd

* 4bf3ebb - (HEAD -> 202205, origin/202205) [active-standby] Fix extra toggle observed in `config reload` (sonic-net#216) (53 minutes ago) [Longxiang Lyu]
…atically (sonic-net#16476)

src/sonic-utilities

* 03292ffe - (HEAD -> 202205, origin/202205) Fix show acl table for masic (sonic-net#2937) (6 minutes ago) [Arvindsrinivasan Lakshmi Narasimhan]
* 627a2f59 - [Techsupport] Update the message seen during the lock acquisition failure (sonic-net#2897) (55 minutes ago) [Vivek]
…et#16188) (sonic-net#16470)

Bmc is a valid neighbor type in minigraph, however it was missing from the YANG model definition. Usually, the Bmc type device can be neighbor of BmcMgmtToRRouter. This PR is to introduce this type.

Co-authored-by: Yaqiang Zhu <zyq1512099831@gmail.com>
… automatically (sonic-net#16475)

src/sonic-platform-common

* 6a38e71 - (HEAD -> 202205, origin/202205) Default implementation of under/over speed checks (sonic-net#382) (10 minutes ago) [spilkey-cisco]
* 9f2f61d - Convert the tx/rx power unit to the dBm unit (sonic-net#377) (11 minutes ago) [ChiouRung Haung]
…atically (sonic-net#16481)

src/sonic-utilities

* 787b4a32 - (HEAD -> 202205, origin/202205) Remove SFP index usage in generating list of SFP hw error (sonic-net#2961) (6 hours ago) [Prince George]
… to 100G and set speed setting before lane reconfiguration (sonic-net#16452)

* [minigraph] remove number of lanes check for changing speed from 400G to 100G and set speed setting before lane reconfiguration   (sonic-net#15721)

8111 800G interface, split to 2x400G (each has 4 lanes) fails to change interface speed from 400G to 100G during deploy mg. In minigraph.xml, the interface speed configuration is good, but fails to generate the right value to config_db.json.

In order to support this SKU the speed transitioning should support both 4 lanes and 8 lanes in the port_config.ini.

Why I did it

before this change for a 400G to 100G transition, in all cases except when lanes are 8, we would continue and the line
ports.setdefault(port_name, {})['speed'] = port_speed_png[port_name]
would not be executed, hence the default speed will never be set for a case and config_db will not be updated,
where speed is transitioning from 400G to 100G or 40G, but lanes are not equal to 8.

In order for those cases to pass where lanes are not specifically 8, we need the change

Work item tracking
24242657

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* fix UT

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

---------

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
…lly (sonic-net#16455)

src/sonic-swss

* 33d81e7f - (HEAD -> 202205, origin/202205) Support type7 encoded CAK key for macsec in config_db (sonic-net#2892) (2 days ago) [judyjoseph]
…onic-net#16311)

chassis-packet: Update arp_update script for FAILED and STALE check (sonic-net#16311)

1. Fixing an issue with FAILED entry resolution retry.
Neighbor entries in arp table may sometimes enter a FAILED state when the far end is down and reports the state as follows:
2603:10e2:400:3::1 dev PortChannel19 router FAILED
While the arp_update script handles the entries for FAILED in the following format, the above was not handled due to the token location (extra router keyword at index 4):
2603:10e2:400:3::1 dev PortChannel19 FAILED

The former format may appear if an arp resolution is tried on a link that is known but the far end goes down, e.g., pinging a STALE entry while the far end is down.

2. Refreshing STALE entries to make sure the far end is reachable.
STALE entries for some backend ports may appear in chassis-packet when no traffic is received for a while on the port. When the far end goes down, it is expected for BFD to stop sending packets on the session for which the far end is not reachable. But as the entry is known as stale, on the Cisco chassis, BFD keeps sending packets. Refreshing the stale entry will keep active links as reachable in the neighbor table while the entries for the far end down will enter a failed state. FAILED state entries will be retired and entered reachable when far end comes back up.
…onic-net#16456)

Why I did it
Zebra core sometimes seen during config reload. Series of route-map deletions and then re-adds, and this triggers the hash table to realloc to grow to a larger size, then subsuquent route-map operations will be against a corrupted hash table.

Issue is seen when we have BFD Enable on Static Route table we see Static route-map being created/deleted based on bfd session state. However issue itself is very generic from FRR perspective.

Thie issue has detailed core info sonic-net/sonic-frr#37 . This PR fixes this issue.
Fixes#sonic-net/sonic-frr#37

Work item tracking
Microsoft ADO (17952227):

How I did it
This fix is already in Master frr/8.2.5. Porting this fix to 202205 branch to address this Zebra core.
sonic-net/sonic-frr@5f503e5

Solution:
The whole purpose of the delay of deletion and the storage of the route-map is to allow the using protocol the ability to process the route-map at a later time while still retaining the route-map name( for more efficient reprocessing ). The problem exists because we are keeping multiple copies of deletion events that are indistinguishable from each other causing hash havoc.

How to verify it
Verified running sonic-mgmt test, doing multiple config reloads.
sonic-net#16527)

Signed-off-by: anamehra anamehra@cisco.com

Added a check for DEVICE_METADATA before accessing the data. This prevents the j2 failure when var is not available.
… (sonic-net#16541)

* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41f)
Why I did it
sonic-mgmt test failure is seen for update_firmware component API

Microsoft ADO: 25208748

How I did it
Edited API 2.0 to fix this issue.

How to verify it
Run sonic-mgmt test after the fix and verify it passes.
…D automatically

src/sonic-platform-daemons

* 198f300 - (HEAD -> 202205, origin/202205) [pmon]chassisd crash fix (sonic-net#396)
#### Why I did it

To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:

- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
   "DSCP_TO_TC_MAP": {
        "AZURE": {
             "48" : "0",
            "46" : "1",
            "3"  : "3",
            "4"  : "4"
        }
    }

- WRED profile has green {min/max/mark%} as {2M/10M/5%}

This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).

### How I did it

#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met

- verified no error

### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
@liuh-80 liuh-80 closed this Sep 22, 2023
@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 22, 2023

Close because select a wrong target branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.