Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_config: copp: Enable rate limiting for bgp, lacp, dhcp, lldp, macsec and udld #14859

Merged
merged 8 commits into from
Oct 25, 2023

Conversation

prabhataravind
Copy link
Contributor

@prabhataravind prabhataravind commented Apr 26, 2023

Why I did it

It was observed that a flood of DHCP packets without rate-limiting can cause BGP flaps or lacp keepalive losses.
This change attempts to prevent or reduce such BGP flaps by enabling appropriate rate-limiting in SONiC for all traffic types.

Work item tracking
  • Microsoft ADO 17964421:

How I did it

Set a reasonable CIR/CBS value of 300 for queue4_group3 (dhcp, lldp, macsec) and 6000 for queue4_group1.
The value 300 was arrived at after testing with dhcp flooding using ptf (using multiple threads). Throttling at this rate was necessary to ensure that dhcp flooding does not cause BGP flaps.

How to verify it

  1. Verified with this script running from ptf, that BGP flaps don't happen when CBS/CIR is set at 300 for queue4_group3.

     import threading
     from scapy.all import *
     
     def send_dhcp_discover(intf):
         dhcp_discover = Ether(dst='ff:ff:ff:ff:ff:ff',src=RandMAC()) \
                             /IP(src='1.1.1.1',dst='255.255.255.255') \
                             /UDP(sport=68,dport=67) \
                             /DHCP(options=[('message-type','discover'),('end')])
         sendp(dhcp_discover,count=100000,iface=intf)
     
     
     if __name__ == "__main__":
         t1 = threading.Thread(target=send_dhcp_discover, args=("eth1",))
         t2 = threading.Thread(target=send_dhcp_discover, args=("eth2",))
         t1.start()
         t2.start()
         t1.join()
         t2.join()
    
  2. Verified on Arista-7260CX3-D108C8 running 202012 that the copp rule for queue4_group1 and queue4_group3 do NOT affect BGP packets. To verify this using PTF, the copp rules were modified to set the "CBS" and "CIR" for queue4_group1 and queue4_group3 at 600pps and 50k packets each of "BGP open" and "DHCP Discover" were simultaneously sent from the same PTF port to the DUT. It was verified using "show c cpu" that packets are hitting the cpu queue at 1200 pps (double the configured CIR/CBS for these packet types). This helped conclude that throttling rate is per trap (or packet type) and not per queue.

  3. Verified with updated sonic-mgmt tests ([tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199) on broadcom and mellanox platforms that these traffic types are rate-limited.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

  • 20220531.33
  • 20201231.107

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@prabhataravind prabhataravind force-pushed the dhcp_copp branch 2 times, most recently from aed466d to 2e94e63 Compare May 3, 2023 05:44
…acsec, udld

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@prabhataravind prabhataravind changed the title image_config: copp: Enable rate limiting for DHCP traffic image_config: copp: Enable rate limiting for bgp, lacp, dhcp, lldp, macsec and udld May 5, 2023
@prabhataravind
Copy link
Contributor Author

Related sonic-mgmt change: sonic-net/sonic-mgmt#8199

@prabhataravind prabhataravind marked this pull request as ready for review June 8, 2023 01:16
@prabhataravind prabhataravind requested a review from lguohan as a code owner June 8, 2023 01:16
@lguohan
Copy link
Collaborator

lguohan commented Jul 17, 2023

this mark pick for 202012 branch, then it should provide testing results on 202012 branch since it is the oldest branch.

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 14859 in repo sonic-net/sonic-buildimage

@prabhataravind
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

prabhataravind and others added 2 commits August 29, 2023 10:36
    * Based on internal tests, BGP flaps during DHCP flooding can only be
      prevented if CIR/CBS for queue4_group3 is set to 300.This change should
      not cause any unwanted collaterals as all the protocols this applies to
      are fairly slow.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
prsunny
prsunny previously approved these changes Sep 18, 2023
@prsunny prsunny requested a review from dgsudharsan September 18, 2023 22:29
@lguohan
Copy link
Collaborator

lguohan commented Sep 20, 2023

macsec packets should be treated as bgp/lacp, it is fundamental to link stability.

@prabhataravind
Copy link
Contributor Author

macsec packets should be treated as bgp/lacp, it is fundamental to link stability.

okay, I will change macsec to the same queue as bgp/lacp and test

prabhataravind and others added 3 commits October 4, 2023 16:37
 * Macsec is fundamental to link stability and should have the same copp policy
   as bgp/lacp

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@prabhataravind
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@prabhataravind
Copy link
Contributor Author

@prsunny, could you please help merge this change?

@lguohan lguohan merged commit 7e49530 into sonic-net:master Oct 25, 2023
16 checks passed
@StormLiangMS
Copy link
Contributor

Hi @prabhataravind did you test with 202305 image? Could you provide test results for 202305?

@prabhataravind
Copy link
Contributor Author

Hi @prabhataravind did you test with 202305 image? Could you provide test results for 202305?

Hi @StormLiangMS, all copp tests pass on 202305 with the sonic-mgmt changes in sonic-net/sonic-mgmt#8199

admin@str2-7260cx3-acs-9:~$ show ver

SONiC Software Version: SONiC.20230531.07

======================================================================================================================= warnings summary =======================================================================================================================
../../../../usr/local/lib/python3.8/dist-packages/_yaml/init.py:18
/usr/local/lib/python3.8/dist-packages/_yaml/init.py:18: DeprecationWarning: The _yaml extension module is now located at yaml._yaml and its location is subject to change. To use the LibYAML-based parser and emitter, import from yaml: from yaml import CLoader as Loader, CDumper as Dumper.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
------------------------------------------------------------------------------------------ generated xml file: /var/src/sonic-mgmt-int/tests/logs/copp/test_copp.xml -------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------- live log sessionfinish --------------------------------------------------------------------------------------------------------------------
02:43:10 init.pytest_terminal_summary L0064 INFO | Can not get Allure report URL. Please check logs
========================================================================================================== 14 passed, 1 warning in 2538.87s (0:42:18) ==========================================================================================================
INFO:root:Can not get Allure report URL. Please check logs
paravind@paravind-sonic-mgmt:/var/src/sonic-mgmt-int/tests$

@prsunny
Copy link
Contributor

prsunny commented Nov 6, 2023

@StormLiangMS , can we cherry-pick to next 202305 image

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Nov 7, 2023
…ld (sonic-net#14859)

Why I did it
It was observed that a flood of DHCP packets without rate-limiting can cause BGP flaps or lacp keepalive losses.
This change attempts to prevent or reduce such BGP flaps by enabling appropriate rate-limiting in SONiC for all traffic types.

Work item tracking
Microsoft ADO 17964421:

How I did it
Set a reasonable CIR/CBS value of 300 for queue4_group3 (dhcp, lldp, macsec) and 6000 for queue4_group1.
The value 300 was arrived at after testing with dhcp flooding using ptf (using multiple threads). Throttling at this rate was necessary to ensure that dhcp flooding does not cause BGP flaps.

How to verify it
Verified with this script running from ptf, that BGP flaps don't happen when CBS/CIR is set at 300 for queue4_group3.

 import threading
 from scapy.all import *
 
 def send_dhcp_discover(intf):
     dhcp_discover = Ether(dst='ff:ff:ff:ff:ff:ff',src=RandMAC()) \
                         /IP(src='1.1.1.1',dst='255.255.255.255') \
                         /UDP(sport=68,dport=67) \
                         /DHCP(options=[('message-type','discover'),('end')])
     sendp(dhcp_discover,count=100000,iface=intf)
 
 
 if __name__ == "__main__":
     t1 = threading.Thread(target=send_dhcp_discover, args=("eth1",))
     t2 = threading.Thread(target=send_dhcp_discover, args=("eth2",))
     t1.start()
     t2.start()
     t1.join()
     t2.join()

Verified on Arista-7260CX3-D108C8 running 202012 that the copp rule for queue4_group1 and queue4_group3 do NOT affect BGP packets. To verify this using PTF, the copp rules were modified to set the "CBS" and "CIR" for queue4_group1 and queue4_group3 at 600pps and 50k packets each of "BGP open" and "DHCP Discover" were simultaneously sent from the same PTF port to the DUT. It was verified using "show c cpu" that packets are hitting the cpu queue at 1200 pps (double the configured CIR/CBS for these packet types). This helped conclude that throttling rate is per trap (or packet type) and not per queue.

Verified with updated sonic-mgmt tests ([tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199) on broadcom and mellanox platforms that these traffic types are rate-limited.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #17111

lguohan pushed a commit that referenced this pull request Nov 22, 2023
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Nov 22, 2023
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld sonic-net#14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Nov 23, 2023
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
yxieca pushed a commit that referenced this pull request Dec 4, 2023
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
@slutati1536
Copy link
Contributor

@dgsudharsan
@prsunny
@lguohan
was this commit cherrypicked to 202205?

@dgsudharsan
Copy link
Collaborator

@slutati1536 I don't think this was cherry-picked to 202205

@prsunny
Copy link
Contributor

prsunny commented Feb 6, 2024

@dgsudharsan @prsunny @lguohan was this commit cherrypicked to 202205?

No, it was not. Only backported to 202305+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants