Replies: 2 comments 2 replies
-
@ndyck14 , thanks for raising this issue. Ideally, the border router itself should detect whether it is having issues providing connectivity to infrastructure and, if so, stop offering its border router services. Other mechanisms place increased complexity on the devices themselves. For example, routers could cache that reaching a given off-mesh destination is resulting in ICMPv6 errors and it could try routing subsequent packets through different border routers, but that adds significant state complexity to Thread routers. It appears you are proposing to investigate changes to the existing Thread protocol specification. If so, I would suggest raising this issue within the Thread Group. |
Beta Was this translation helpful? Give feedback.
-
No, i don't think spec work would necessarily be required. this could just be an optimization in the OT implementation. For detection at the affected BR, this case is probably not possible to detect. The BR affected had working upstream connectivity, and multicast was functioning. it seemed the AP was only blocking LAN unicast messages to particular nodes for some reason. I don't think routers can make any decisions, as once the packet originates from its sender, its 6LoWPAN destination (BR) is concrete, regardless of a (possibly) winding path. The only optimization can be in the device originating (whether router or ED). The idea is that the device observes a particular address being routed in through a particular BR (in this case 0x9c00). After that, there is some mechanism that allows the route selection to bias toward the implied working route, even if it wasn't the lowest path cost. Now that I'm writing this, it strikes me that spec may require the route to be selected as lowest path cost? I know that is at least true for the sake of on-mesh routing, but as mentioned, this is a selection between different off-mesh routes, so not sure if that is included (e.g. is path cost tie-breaker for route priority baked into spec? or just implementation decision). |
Beta Was this translation helpful? Give feedback.
-
For the topology above, we observed a case with a Bell Giga Hub (prominent ISP supplied router in Canada) where local network connectivity on Wi-Fi was disrupted for some but not all Border Routers. This prevented normal end-to-end communication from occuring. For at least 1 client investigated (a mac), pings were able to be sent out to the mesh, however the return path selected by the device in question (a REED) went back through 2 of the affected BRs, breaking connectivity.
In the packet trace (network key
3f7bb0362b5faa37b4915bcf5e2c02df
), we can see the return path flipping between 0x8000 and 0x4800, both of which were affected routers (0x9c00 was unaffected, but outside of the listening range of the sniffer).At least 1 of the affected border routers provided ICMP responses with code 3 (Address Unreachable). We assume the other was likely doing the same, but because it was out of reach of the sniffer, we did not capture its responses.
It seems that route selection is entirely based on path cost on mesh and does not consider the possibility there may be other problems preventing a response from reaching back to its destination.
openthread/src/core/thread/network_data_leader.cpp
Lines 302 to 347 in f12785d
Should we consider approaches to address this? In this case, we were able to recover the affected links by power cycling or by toggling wifi (4 separate affected devices from different manufacturers), so the root cause is almost certainly a misbehaving Wi-Fi AP. But the problem remains: users will not intuitively know to power cycle their Border Routers.
faisal nest hub unreachable mac ota.pcapng.zip
Beta Was this translation helpful? Give feedback.
All reactions