Should we consider more proactive steps to adjust routes used for off-mesh routing? #10236

ndyck14 · 2024-05-14T18:41:04Z

ndyck14
May 14, 2024

For the topology above, we observed a case with a Bell Giga Hub (prominent ISP supplied router in Canada) where local network connectivity on Wi-Fi was disrupted for some but not all Border Routers. This prevented normal end-to-end communication from occuring. For at least 1 client investigated (a mac), pings were able to be sent out to the mesh, however the return path selected by the device in question (a REED) went back through 2 of the affected BRs, breaking connectivity.

In the packet trace (network key 3f7bb0362b5faa37b4915bcf5e2c02df), we can see the return path flipping between 0x8000 and 0x4800, both of which were affected routers (0x9c00 was unaffected, but outside of the listening range of the sniffer).

At least 1 of the affected border routers provided ICMP responses with code 3 (Address Unreachable). We assume the other was likely doing the same, but because it was out of reach of the sniffer, we did not capture its responses.

37211	May 14, 2024 10:17:39.065583000 EDT	1545.346422	::894e:a52c:6858:56b2	0xe400	fda3:6cc8:e61c:475c:ce2:deaf:fd0c:389e	0x8000	ICMPv6	65	Echo (ping) reply id=0x8aae, seq=83, hop limit=64	-58 dB
37215	May 14, 2024 10:17:39.144008000 EDT	1545.424847	::d967:9594:37d7:31bd	0x8000	::894e:a52c:6858:56b2	0xe400	ICMPv6	109	Destination Unreachable (Address unreachable)	-34 dB

It seems that route selection is entirely based on path cost on mesh and does not consider the possibility there may be other problems preventing a response from reaching back to its destination.

openthread/src/core/thread/network_data_leader.cpp

Lines 302 to 347 in f12785d

    
           template <typename EntryType> int Leader::CompareRouteEntries(const EntryType &aFirst, const EntryType &aSecond) const 
        
           { 
        
               // `EntryType` can be `HasRouteEntry` or `BorderRouterEntry`. 
        
               return CompareRouteEntries(aFirst.GetPreference(), aFirst.GetRloc(), aSecond.GetPreference(), aSecond.GetRloc()); 
        
           } 
        
           int Leader::CompareRouteEntries(int8_t   aFirstPreference, 
        
                                           uint16_t aFirstRloc, 
        
                                           int8_t   aSecondPreference, 
        
                                           uint16_t aSecondRloc) const 
        
           { 
        
               // Performs three-way comparison between two BR entries. 
        
               int result; 
        
               // Prefer the entry with higher preference. 
        
               result = ThreeWayCompare(aFirstPreference, aSecondPreference); 
        
               VerifyOrExit(result == 0); 
        
           #if OPENTHREAD_MTD 
        
               // On MTD, prefer the BR that is this device itself. This handles 
        
               // the uncommon case where an MTD itself may be acting as BR. 
        
               result = ThreeWayCompare((aFirstRloc == Get<Mle::Mle>().GetRloc16()), (aSecondRloc == Get<Mle::Mle>().GetRloc16())); 
        
           #endif 
        
           #if OPENTHREAD_FTD 
        
               // If all the same, prefer the one with lower mesh path cost. 
        
               // Lower cost is preferred so we pass the second entry's cost as 
        
               // the first argument in the call to `ThreeWayCompare()`, i.e., 
        
               // if the second entry's cost is larger, we return 1 indicating 
        
               // that the first entry is preferred over the second one. 
        
               result = ThreeWayCompare(Get<RouterTable>().GetPathCost(aSecondRloc), Get<RouterTable>().GetPathCost(aFirstRloc)); 
        
               VerifyOrExit(result == 0); 
        
               // If all the same, prefer the BR acting as a router over an 
        
               // end device. 
        
               result = ThreeWayCompare(Mle::IsActiveRouter(aFirstRloc), Mle::IsActiveRouter(aSecondRloc)); 
        
           #endif 
        
           exit: 
        
               return result; 
        
           }

Should we consider approaches to address this? In this case, we were able to recover the affected links by power cycling or by toggling wifi (4 separate affected devices from different manufacturers), so the root cause is almost certainly a misbehaving Wi-Fi AP. But the problem remains: users will not intuitively know to power cycle their Border Routers.
faisal nest hub unreachable mac ota.pcapng.zip

jwhui · 2024-05-14T19:05:39Z

jwhui
May 14, 2024
Maintainer

@ndyck14 , thanks for raising this issue.

Ideally, the border router itself should detect whether it is having issues providing connectivity to infrastructure and, if so, stop offering its border router services. Other mechanisms place increased complexity on the devices themselves. For example, routers could cache that reaching a given off-mesh destination is resulting in ICMPv6 errors and it could try routing subsequent packets through different border routers, but that adds significant state complexity to Thread routers.

It appears you are proposing to investigate changes to the existing Thread protocol specification. If so, I would suggest raising this issue within the Thread Group.

0 replies

ndyck14 · 2024-05-15T13:18:47Z

ndyck14
May 15, 2024
Author

No, i don't think spec work would necessarily be required. this could just be an optimization in the OT implementation.

For detection at the affected BR, this case is probably not possible to detect. The BR affected had working upstream connectivity, and multicast was functioning. it seemed the AP was only blocking LAN unicast messages to particular nodes for some reason.

I don't think routers can make any decisions, as once the packet originates from its sender, its 6LoWPAN destination (BR) is concrete, regardless of a (possibly) winding path.

The only optimization can be in the device originating (whether router or ED). The idea is that the device observes a particular address being routed in through a particular BR (in this case 0x9c00). After that, there is some mechanism that allows the route selection to bias toward the implied working route, even if it wasn't the lowest path cost.

Now that I'm writing this, it strikes me that spec may require the route to be selected as lowest path cost? I know that is at least true for the sake of on-mesh routing, but as mentioned, this is a selection between different off-mesh routes, so not sure if that is included (e.g. is path cost tie-breaker for route priority baked into spec? or just implementation decision).

2 replies

jwhui May 15, 2024
Maintainer

I don't think routers can make any decisions, as once the packet originates from its sender, its 6LoWPAN destination (BR) is concrete, regardless of a (possibly) winding path.

For MTDs, routers are responsible for selecting the mesh destination and performing the RLOC16 address resolution.

Now that I'm writing this, it strikes me that spec may require the route to be selected as lowest path cost? I know that is at least true for the sake of on-mesh routing, but as mentioned, this is a selection between different off-mesh routes, so not sure if that is included (e.g. is path cost tie-breaker for route priority baked into spec? or just implementation decision).

Thread 1.3.0 Section 5.10.2 specifies how Thread devices forward unicast packets to off-mesh destinations. The set of rules is quite a bit more complex than just selecting the lowest cost route. But it does not involve an end device attempting to choose a different border router when it receives ICMPv6 error messages.

ndyck14 May 21, 2024
Author

Interesting. OK, so yes, I suppose spec is the right next spot to discuss further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenThread

Should we consider more proactive steps to adjust routes used for off-mesh routing? #10236

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

OpenThread

Should we consider more proactive steps to adjust routes used for off-mesh routing? #10236

ndyck14 May 14, 2024

Replies: 2 comments · 2 replies

jwhui May 14, 2024 Maintainer

ndyck14 May 15, 2024 Author

jwhui May 15, 2024 Maintainer

ndyck14 May 21, 2024 Author

ndyck14
May 14, 2024

Replies: 2 comments 2 replies

jwhui
May 14, 2024
Maintainer

ndyck14
May 15, 2024
Author

jwhui May 15, 2024
Maintainer

ndyck14 May 21, 2024
Author