-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[orchagent/syncd] Adding Loopback0 IP causes orchagent/syncd crash/dump on Accton-AS9716-32D Tomahawk 3 #20837
Comments
Yea. Confirmed. After applying this patch to --- a/orchagent/tunneldecaporch.cpp
+++ b/orchagent/tunneldecaporch.cpp
@@ -968,6 +968,11 @@ bool TunnelDecapOrch::addDecapTunnelTermEntry(
sai_status_t status = sai_tunnel_api->create_tunnel_term_table_entry(&tunnel_term_table_entry_id, gSwitchId, (uint32_t)tunnel_table_entry_attrs.size(), tunnel_table_entry_attrs.data());
if (status != SAI_STATUS_SUCCESS)
{
+ if (status == SAI_STATUS_NOT_SUPPORTED)
+ {
+ SWSS_LOG_WARN("Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel %s IP %s.", tunnel_name.c_str(), dst_ip.to_string().c_str());
+ return false;
+ }
SWSS_LOG_ERROR("Failed to create tunnel decap term entry %s.", dst_ip.to_string().c_str());
task_process_status handle_status = handleSaiCreateStatus(SAI_API_TUNNEL, status);
if (handle_status != task_success) Then the logs look a lot happier:
I'll go file a PR there. |
…unnels The Tomahawk 3 chipset on an Accton AS9716-32D does not do SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP tunnels and returns SAI_STATUS_NOT_SUPPORTED. Since sonic-net#3117, such a tunnel is created unconditionally whenever an IP address is added to Loopback0. The SAI_STATUS_NOT_SUPPORTED is handled as an unrecoverable error. That takes the orchagent down and syncd along with it. root@spine2:0:~# dmidecode 2>/dev/null | grep -A1 'Manufacturer:' | head -n2 Manufacturer: Accton Product Name: AS9716-32D root@spine2:0:~# printf '%s\n' 'bsv' 'show unit' 'ver' | xargs -d\\n -n1 bcmcmd -t 1 | grep '^[A-Z]' BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4] Unit 0 chip BCM56980_B0 (current) Broadcom Command Monitor: Copyright (c) 1998-2024 Broadcom Release: sdk-6.5.30-SP4 built 20241016 (Wed Oct 16 13:33:02 2024) From root@7ec56acb7b44:/__w/1/s/output/x86-xgsall-deb/xgs-sdk-src/hsdk-all-6.5.30 Platform: X86 OS: Unix (Posix) Logs output: NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router interface oid:60000000005b2 NOTICE swss#orchagent: :- doDecapTunnelTask: Tunnel IPINIP_TUNNEL added to ASIC_DB. ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021a ... ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED ERR swss#orchagent: :- addDecapTunnelTermEntry: Failed to create tunnel decap term entry 10.1.0.1/32. ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_TUNNEL, status: SAI_STATUS_NOT_SUPPORTED NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump This changeset adds a check for SAI_STATUS_NOT_SUPPORTED, turning the error into a warning and going back to behaviour before subnet_decap was added: ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED WARNING swss#orchagent: :- addDecapTunnelTermEntry: Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel IPINIP_TUNNEL IP 10.1.0.1/32. ERR swss#orchagent: :- doDecapTunnelTermTask: IPINIP_TUNNEL:10.1.0.1: failed to add tunnel decap term to ASIC_DB. Resolves: sonic-net/sonic-buildimage#20837
@wdoekes looks like you already fixed the issue |
It ain't fixed until it's merged, yes? =) Can you get my PRs looked at, @prgeor ? ❤️ If changes are needed, I'll gladly oblige. If tests are needed, I'm going to need some hand holding / examples. |
@wdoekes I see you mentioned something about TD3, that's not related to your PR #3377 you said right? What was your resolution for TD3 on master? You said you reverted to an older libsai, did you revert this commit? ffb9bc0 I've been having a heck of a time on TD3 finding a release that actually works on Dell S5248F properly (with VXLAN support) and have starting trying to pull in patches into older branches to get it working. |
@bradh352: For TD3 on master, I needed two reverts: Specifically the latest two on this chain of commits:
I cannot tell whether Vxlan support works with libsaibcm back to 10.1.42. I did not get around to trying. Another thing I tried was running master and then replacing libsaibcm.deb inside the docker-swss with a newer version. That also appeared to work. Fetching a newer .deb is a bit tricky though. I repacked one from a broadcom or edgecore distro (don't remember off-hand) and simply dpkg -i'd it inside the running docker container. Things didn't break immediately, but .. I wouldn't trust it to work long term. In my case, the master build with libsaibcm 11.2.13 doesn't work, but the 202405 build (also 11.2.23) does. So trying branch 202405 would be my best bet, unless you know that Vxlan is unavailable there. |
@wdoekes thanks for that. I know 202405 vxlan definitely doesn't work, even with the required patches merged. It seems to at first, but there are major issues. |
For the TD3, did you check syslog for "objectTypeQuery returned NULL": #20725 ? (If I'm not mistaken both my th3 and td3 suffer from that.) Not that it helps if you do, but then you'll know where it is tracked. And there is activity going on. |
Ah thanks, no, on master I just gave up early since nothing works. I figured someone probably had that covered especially with 202411 branching soon I was hopeful it would be fixed. Trying to track down where regressions occurred with vxlan at this point as I know a lot of people are running 202211 with vxlan on TD3. |
Description
After adding an IP to Loopback0, using
config interface ip add Loopback0 10.1.0.1/32
and a cold reboot, swss/syncd fail on the Accton-AS9716-32D broadcom Tomahawk 3.Before rebooting, adding the IP works fine:
After reboot, we see this:
doDecapTunnelTask: Tunnel IPINIP_TUNNEL added to ASIC_DB
processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
create: create status: SAI_STATUS_NOT_SUPPORTED
This eventually results in:
Both syncd and orchagent (swss) stopping.
Steps to reproduce the issue:
master
or202405
branch (BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4]
,RedisRemoteSaiInterface: sairedis git revision 17e893c5, SAI git revision: e0e9787
,sonic-swss e99f5a914
(probably))# config interface ip add Loopback0 10.1.0.1/32
Describe the results you received:
Crashing orchagent/syncd.
Describe the results you expected:
No crashing.
Output of
show version
:Build: https://github.com/ossobv/sonic-buildimage/tree/osso202405-20241113-0-49b1c0f39-dbg
Based on
202405
branch.Additional information you deem important (e.g. issue happens only occasionally):
I think the culprit might be: sonic-net/sonic-swss@353ab92
The same image doesn't work on my Accton-AS7326-56X Trident 3 because of a different issue; but with a
master
build with libsai reverted to 10.1.42.0 it does work, adding the Loopback0 IP as appropriate, with the following logs:The text was updated successfully, but these errors were encountered: