-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Routing tables filled with garbage after some time, maybe hardware-related? #81
Comments
That garbage in the web gui and the ble app is due to the fact that I have not updated either of those to interpret the latest changes to routing table. The routing protocol (LoRaLayer2) now uses four byte addresses, as you can see the ones reported in the web gui (and probably the ble client) are six bytes long. So it's not really garbage, but rather misinterpreted information. I haven't had time or energy to fix the web gui and I don't know how to fix the ble client. Also, both use a "hack" to get the routing table in the first place. They just watch for routing table packets, "intercept them", then interpret them. Instead, they should actively request the routing table from the LoRaClient, I started writing this feature into recent commits and it is what is supposed to be used by the Console, but it is not totally finished (or thought out) as evidenced by #77. |
problem with garbage in routing table and reboot after /lora command is also in rc.2 version... < > /lora abort() was called at PC 0x400f8bdf on core 1 just "enhancement" tag ? |
To be clear, the initial issue is related to the web app and the ble app which are really separate from the firmware. The bug you are reporting @BishopPeter1 looks to be of a different nature, probably more closely related to #77 and #88. That being said, thanks for reporting this and I think you are correct that this should be labeled a bug and not an enhancement, regardless of which part of code it is talking about. I think I noticed this problem in the simulator, so I will try to reproduce there. Not sure why the routes are getting all screwy. But I'm thinking the stack smashing is being caused because there is no check on the size of the routing table print out before sending it to the console (it has to fit it a datagram to be sent to the console). Somewhere between LoRaClient.cpp and the getRoutingTable() function in LoRaLayer2 there needs to be a check that the output routing table does not exceed 239 bytes (which in this case is equal to a single char). Alternatively, we could remove the limit on datagram size in the Layer3 code and only impose the 239 datagram size limit in the LL2 code, which is where it actually matters. I actually like the second option, but it sounds slightly more difficult and could be confusing. |
As seen in sudomesh/LoRaLayer2#17 in routing table is something "byterotated" as seen below Is not it something with wrong definition of length of MAC address in routing table after shortening from initial version? |
Good catch. I haven't looked closely at the problem yet. It is possible that I missed something in that transition, but this problem seems more recent than that change. Based on some quick investigation using the simulator, I think this is most likely a problem of memory not being cleared properly. In the simulator I'm not able to reproduce the problem of more routes appearing, however, it does look like my routing packets contain more data than they should, except the extra data is all zeros (probably because the stack/heap are arranged differently). More investigation is needed to figure why/where that memory is being overrun. |
Fyi, @ those interested. I've rewritten the packet success and metric calculations so they actually make sense. I don't think this is the immediate cause of this issue but it is something I noticed happening and is a start towards making sense of these garbage-filled routing tables. I've still been unable to reproduce this problem with even two real T-beam boards. @BishopPeter1 what exactly was your setup? Was it just two nodes or was it three nodes? Were there any hops involved? Did you have a routing interval set, or were doing the "manual"/reactive routing where messages must be sent to build the routing tables? |
Note sudomesh/LoRaLayer2#17 (comment), I'm no longer seeing this problem in the LL2 sender/receiver example code, so I'm hoping some of my changes (maybe the packetSuccess/metric rewrite) solved the problem. I'll keep this issue open since I still haven't fixed the original bug with the web app and ble app route table print outs. Also because I'm not super confident I've solved the problem, since I can't explain why it's not happening anymore. |
The rotated fake addresses seems to be gone. The partially correct addresses and completely random adresses are still displayed after a while. Currently I have 6 physical nodes running and everything was fine until after I've connected ttgo v1 (abfa0cfc). It's address was partially cut :
After that, it shortly went down hill. For example, the routing table was returned twice:
The number of peers on the display seem to have peaked at 41:
To me it looks like the bad routes are no longer broadcast by v2 and t-beam, but if a malformed packet is received for any reason, the error is propagated throughout the whole network. Some sanity checks may be needed on the receiving side. Check if the value is a hex number, or even add some form of checksum to protect from unintentional RX errors. When it comes to the original UI related issue, the routing table is not currently displayed in the web UI at all. Maybe we should split this to several separate issues, but at this point I'm not sure how many problems are even present and to which part of the codebase they're related. |
My setup is: on computer is small script caled every hour and doing telnet to v2.1 TTGO and writing something funny to console. If i disable learning routes (by setting routing interval to 0 then all problem disappear... but two Lora v1.0 never become in route table as they nothing saying - they just hangin around, switched on...
Looks like problem is something with ttgo v1 as we both have them? |
Ok. I'm not sure I have more than 4 working boards at the moment and I definitely don't have a TTGO V1 board. I'll try to reproduce in the simulator by creating a network with > 4 nodes (and maybe include some hops also). If the simulator works for a very long time (over night?) without causing this issue, then I would feel fairly confident that this is somehow a hardware/LoRa -related problem. Note that @robgil in sudomesh/LoRaLayer2#17 was seeing this problem with Heltec V2 boards, I know the TTGO V1 is a particularly bad board design, I'm not aware of quality issues with the Heltec V2, though @robgil claims to not be seeing the issue since the latest changes to LL2. I agree with @deafboy on splitting this issue. Lets keep this one as the "Routing tables filling with garbage issue maybe hardware-related?" issue since we the discussion has turned more in that direction. I'll create a new issue for the original issue of making the routing tables appear correctly (or appear at all?) in the web app, ble app. |
@paidforby I'm still seeing the issues after fixing the packet size. Still testing though and will report back. |
@robgil did you update to the latest commit of LL2 (or at least sudomesh/LoRaLayer2@e173bf4 )? Is your setup just two nodes (one sender and one receiver)? Or are you only seeing this with more than two? |
After leaving the sim running with seven nodes for more than 12 hours, I didn't observe any issues with the routing table when leaving them in proactive routing mode (see below). I did observe some oddities related to the updating of metrics (specifically metrics for nodes 2 or more hops away) I think I just need to review the logic used for creating those values. I also noticed that a stack smashing failure is generated upon printing the routing table once it becomes larger, this is because I didn't create any check to break the routing table print out into smaller chunks that fit inside individual datagrams. Both of these issues should be easy to fix, but I don't think they are directly related to the issue being observed on physical devices. |
@paidforby running only 2 nodes. So far so good after removing the extra packet length from the datagram. Seeing only one neighbor in the route table now.
I'll run for a while longer to see if anything pops up. I bet the overflow from the packet might somehow be polluting the routes, but this is just conjecture at this point. Latest testing code is here. This adds the non blocking pseudo delay() approach and fixes the packet length. EDIT: Just compiled/built with latest commit and so far so good as well. |
I'm using 1.0.0-rc.1 on ttgo v1 and ttgo v2. They can smoothly communicate with each other. However, the routing table on both nodes starts to fill with garbage after a while.
This is what web ui and ble client shows on the v2:
abfa0cfc01b6 is my v1 node connected directly, yet it shows the distance of 91 hops.
When I try to get more info via telnet the board resets:
The text was updated successfully, but these errors were encountered: