Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

Closed
3 tasks done
michaelsimp opened this issue Oct 14, 2024 · 73 comments
Closed
3 tasks done

WiFi Mesh unstable when parent offline (IDFGH-13875) #14720

michaelsimp opened this issue Oct 14, 2024 · 73 comments
Assignees
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@michaelsimp
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.3.0

Espressif SoC revision.

Chip is ESP32-S3 (QFN56) (revision v0.2)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32-S3-WROOM-1

Power Supply used.

USB

What is the expected behavior?

I expect the ESP32 to continue to run the application without crashing when the WIFI Mesh parent disappears.
If the MESH_ROOT was powered off, I expect a MESH_NODE to assume the role of MESH_ROOT
If the WIFI Router is powered off, when I restore it, I expect the mesh network to establish itself

What is the actual behavior?

  1. Sometimes these tests work perfectly. The Mesh network goes down and the nodes start scanning. If I restore the WiFi router, the Mesh network is reestablished
  2. Sometimes the Mesh network goes down, and can't recover. It doesn't crash but it doesn't scan properly and reestablish the Mesh network.
  3. Very regularly, if I power off the WiFi router the MESH_ROOT intermittently crashes OR if I power off the MESH_ROOT a MESH_NODE intermittently crash

Steps to reproduce.

  1. Power on system comprising 2 x ESP32-S3 dev boards and a Wifi router
  2. Connect a serial terminal (I am using PUTTY) to each serial port for monitoring
  3. Let the Mesh network get established and verify MESH_ROOT and MESH_NODE connected.
  4. Power off the WiFi Router
  5. In the example (logs below) the MESH_ROOT crashed

Debug Logs.

I (00:58:03.336) aWifiMesh: <MESH_EVENT_MESH_STARTED>ID:77:77:77:77:77:76
I (136546) mesh: <MESH_NWK_LOOK_FOR_NETWORK>need_scan:0x3, need_scan_router:0x0, look_for_nwk_count:1
I (00:58:03.336) aWifiMesh: This node MAC:48:ca:43:9b:53:d8
I (00:58:03.354) aWifiMesh: WiFi Mesh started successfully, heap:141084, root not fixed
WIN> I (140766) mesh: [S6]VONETS, 00:17:13:20:bd:74, channel:8, rssi:-12
I (140776) mesh: find router:[ssid_len:6]VONETS, rssi:-12, 00:17:13:20:bd:74(encrypted), new channel:8, old channel:0
I (140786) mesh: [FIND][ch:0]AP:11, otherID:0, MAP:1, idle:1, candidate:0, root:0[00:17:13:20:bd:74]router found
I (140796) mesh: [FIND:1]find a network, channel:8, cfg<channel:0, router:VONETS, 00:00:00:00:00:00>

I (00:58:07.590) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (140796) wifi:<MESH AP>adjust channel:1, secondary channel offset:1(40U)
W (140816) wifi:<MESH AP>adjust channel:8, secondary channel offset:1(40U)
I (141126) mesh: [SCAN][ch:8]AP:1, other(ID:0, RD:0), MAP:0, idle:0, candidate:1, root:0, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141126) mesh: 1330[SCAN]init rc[48:ca:43:9b:53:d9,-9], mine:0, voter:0
I (141136) mesh: 1368, vote myself, router rssi:-9 > voted rc_rssi:-120
I (141146) mesh: [SCAN:1/10]rc[128][48:ca:43:9b:53:d9,-9], self[48:ca:43:9b:53:d8,-9,reason:0,votes:1,idle][mine:1,voter:1(1.00)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141456) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:1, candidate:1, root:0, topMAP:0[c:0,i:1][00:17:13:20:bd:74]router found<>
I (141466) mesh: [SCAN:2/10]rc[128][48:ca:43:9b:53:d9,-8], self[48:ca:43:9b:53:d8,-8,reason:0,votes:1,idle][mine:1,voter:2(0.50)percent:1.00][128,1,48:ca:43:9b:53:d9]

I (141776) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (141776) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (141796) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-15, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-8/0]
I (141806) mesh: set router bssid:00:17:13:20:bd:74
I (142596) mesh: <MESH_NWK_MIE_CHANGE><><><><ROOT ADDR><><><>
I (142596) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (142616) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (00:58:09.409) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (142616) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (142636) mesh: 2012<arm>parent monitor, my layer:2(cap:6)(node), interval:7286ms, retries:1<normal connected>
I (00:58:09.436) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:1-->2, parent:30:30:f9:33:72:b9<layer2>, ID:77:77:77:77:77:76
I (00:58:09.451) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4212753c  PS      : 0x00060030  A0      : 0x82127613  A1      : 0x3fcc1660
A2      : 0xffffffff  A3      : 0x00000000  A4      : 0xff000000  A5      : 0x00000001
A6      : 0x3fcc0a64  A7      : 0xff000000  A8      : 0x3c1505e4  A9      : 0x00000000
A10     : 0x3fcc0a64  A11     : 0x00000000  A12     : 0x00000101  A13     : 0x3c1505e4
A14     : 0x00000007  A15     : 0x3fcd8024  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0xff00000c  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff


Backtrace: 0x42127539:0x3fcc1660 0x42127610:0x3fcc16b0 0x4037e0aa:0x3fcc16d0

More Information.

My application integrates a number of IDF example programs including ip_internal_network
I went back to the example project ip_internal_network and built it unmodified, and can reproduce the same problems quite readily.

Also, for when the ESP32 nodes don't completely crash, I would like to know how to restart the Mesh network in software.
I have tried stopping the Mesh network with:
ESP_ERROR_CHECK(esp_mesh_stop());
ESP_ERROR_CHECK(esp_mesh_deinit());
ESP_ERROR_CHECK(mesh_netifs_destroy()); // I have tried with and without this line. Without it, the logs continually report:
I (135746) mesh: mesh is not started
E (00:58:02.547) mesh_netif: Received with err code 16388 ESP_ERR_MESH_NOT_START

I then try to restart the Mesh network with:
/* mesh initialization /
ESP_ERROR_CHECK(esp_mesh_init());
ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
/
set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason /
ESP_ERROR_CHECK(esp_mesh_send_block_time(5000)); // was 30 seconds
mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
cfg.crypto_funcs = NULL;
/
mesh ID */
memcpy((uint8_t ) &cfg.mesh_id, MESH_ID, MAC_SIZE);
/
router */
cfg.channel = CONFIG_MESH_CHANNEL;

cfg.router.ssid_len = strlen(meshProvisionData.ssid);
memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));

ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
ESP_ERROR_CHECK(esp_mesh_start());

Doing the above when the system is running normally, often causes the ESP32's to crash with various errors
eg after start, MESH_NODE does a scan and then crashes with Guru Meditation Error: Core 0 panic'ed
I (00:22:02.752) aWifiMesh: <MESH_EVENT_FIND_NETWORK>new channel:8, router BSSID:00:00:00:00:00:00
W (1323864) wifi:adjust channel:1, secondary channel offset:1(40U)
W (1323874) wifi:adjust channel:8, secondary channel offset:1(40U)
I (1324184) mesh: [SCAN][ch:8]AP:2, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:0,i:0][00:17:13:20:bd:74]router found<>
I (1324184) mesh: 7391[selection]try rssi_threshold:-78, backoff times:0, max:5<-78,-82,-85>
I (1324204) mesh: [DONE]connect to parent:ESPM_3372B8, channel:8, rssi:-14, 30:30:f9:33:72:b9[layer:1, assoc:0], my_vote_num:0/voter_num:0, rc[00:00:00:00:00:00/-120/0]
I (1324214) mesh: set router bssid:00:17:13:20:bd:74
I (1324834) mesh: <MESH_NWK_MIE_CHANGE><><><><><><>
I (1324834) mesh: <MESH_NWK_ROOT_ADDR>from assoc, layer:2, root_addr:30:30:f9:33:72:b9, root_cap:1
I (1324844) mesh: <MESH_NWK_ROOT_ADDR>idle, layer:2, root_addr:30:30:f9:33:72:b9, conflict_roots.num:0<>
I (1324854) mesh: [scan]new scanning time:600ms, beacon interval:300ms
I (00:22:03.744) aWifiMesh: <MESH_EVENT_ROOT_ADDRESS>root address:30:30:f9:33:72:b9
I (1324854) mesh: 2012parent monitor, my layer:2(cap:6)(node), interval:4526ms, retries:1
I (00:22:03.771) aWifiMesh: <MESH_EVENT_PARENT_CONNECTED>layer:2-->2, parent:30:30:f9:33:72:b9, ID:77:77:77:77:77:76
I (00:22:03.785) mesh_netif: It was a wifi station removing stuff
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump:
PC : 0x4212753c PS : 0x00060830 A0 : 0x82127613 A1 : 0x3fcc15d0
A2 : 0xffffffff A3 : 0x00000000 A4 : 0x00000278 A5 : 0x00000001
A6 : 0x3fcc09d0 A7 : 0x00000278 A8 : 0x3c1505e4 A9 : 0x3fcd778c
A10 : 0x3fcc09d0 A11 : 0x00000000 A12 : 0x00000101 A13 : 0x3c1505e4
A14 : 0x00000007 A15 : 0x3fcaa7f4 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000284 LBEG : 0x40056f5c LEND : 0x40056f72 LCOUNT : 0xffffffff

Backtrace: 0x42127539:0x3fcc15d0 0x42127610:0x3fcc1620 0x4037e0aa:0x3fcc1640

I sometimes get MTX task stack overflows too when I try this, same as #13882

@michaelsimp michaelsimp added the Type: Bug bugs in IDF label Oct 14, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 14, 2024
@github-actions github-actions bot changed the title WiFi Mesh unstable when parent offline WiFi Mesh unstable when parent offline (IDFGH-13875) Oct 14, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
I have tested using the ip_internal_network example, but I didn't reproduce your problem. Can you provide the .elf file when the crash issue happen? Or can you provide the core dump decode file?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024 via email

@michaelsimp
Copy link
Author

michaelsimp commented Oct 24, 2024

Hi
I did a quick set of tests today.

I created the ip_internal_network project from examples and configured as follows:
Set IDF version to 5.3.0
Set target device to ESP32-S3 with jtag integrated debugger
Set partition table Factory partition to 0x400000
Set device flash size to 16MB - matches my ESP32-S3
Set the Router SSID to "VONETS" and Password to "pass9999"
Set Panic Handler to "Print registers and halt"

Clean and Build project.
Load into 2 ESP32-S3 with serial terminals connected for monitoring.
One becomes MESH_ROOT and the other connects as MESH_NODE
Took turns at powering off the MESH_ROOT and watching the other become MESH_ROOT and then powering it back on and it connects as a MESH_NODE. This seemed to work ok today.

But what I did find easy to reproduce was:
Power off MESH_ROOT and power back on BEFORE other MESH_NODE becomes MESH_ROOT.
The original MESH_ROOT I power cycled, becomes MESH_ROOT again, but the MESH_NODE remains disconnected.

See attached files:
MESH_ROOT powered off at line 171
MESH_NODE loses connection around line 33 and never recovers

Also see .elf and .bin files in attachment. I don't know what or where the "Core dump decode file is", but these tests don't show a CPU crash. I cant run under the debugger due to the power cycle tests.

Please see attachment MESH-Testing.zip two comments down

@michaelsimp
Copy link
Author

I did some more tests which can easily cause CPU crashes.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE.
Power off WIFI router.
MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT
Power off Router
This time MESH_ROOT does not crash
Power on Router
MESH_ROOT does not crash
Power on a second node which connects to the first MESH_ROOT - see line 492
MESH_ROOT crashes.
See file attached "MESH_ROOT crash on MESH_NODE connect.txt"
MESH_ Testing 2.zip

@michaelsimp
Copy link
Author

MESH-Testing.zip
This is the attachment for the first tests, 2 entries up. It did not upload properly last before

@brianignacio5
Copy link
Contributor

Hi @michaelsimp

The esp-idf vscode extension allows you to save settings in multiple places: User (Global settings for vscode), Workspace and Workspace folder (your project's .vscode/settings.json). The ESP-IDF: Show Examples command shows you the current esp-idf path used in the current vscode window. You can change where to save settings with the ESP-IDF: Select where to save configuration settings command. It sounds confusing but it does allow to use multiple projects each with different esp-idf versions (even at the same time! Using vscode workspace) More information in here

It seems the example you are trying to use have some components with specific behavior in each esp-idf version. So building a v5.2.2 example using esp-idf v5.3 might produce some compilation problems. How about creating an example using esp-idf v5.3 ?

  1. Open a vscode window.
  2. Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension.
  3. Run the ESP-IDF: Doctor command. Check that esp-idf is indeed using v5.3
  4. Run the ESP-IDF: Show examples. The esp-idf path shown should be v5.3 now
  5. Create your project from esp-idf example and try to build.

We will try to update the Show examples command to show all available esp-idf versions from esp-idf vscode extension to make this easier.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The backtrace of the crash issue is here:

xtensa-esp32s3-elf-addr2line -piaf 0x4208c922:0x3fca7e60 0x4201c951:0x3fca7ee0 0x4201f96f:0x3fca7f20 0x4202345e:0x3fca7f50 0x420167bd:0x3fca7f70 -e ip_internal_network.elf

0x4208c922: parse_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:993
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1190
 (inlined by) handle_dhcp at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/apps/dhcpserver/dhcpserver.c:1106
0x4201c951: udp_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/udp.c:404
0x4201f96f: ip4_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:746
0x4202345e: ethernet_input at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/netif/ethernet.c:186
0x420167bd: tcpip_thread_handle_msg at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:174
 (inlined by) tcpip_thread at C:\Users\micha\Documents\Cybertek\Software\ip_internal_network\build/C:/Users/micha/esp/v5.3/esp-idf/components/lwip/lwip/src/api/tcpip.c:148

I think this issue is caused by the mismatch between your IDF version and the version in the example. Please update your version according to Brain's suggestion and test it again.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Oct 29, 2024
@michaelsimp
Copy link
Author

I installed IDF version 5.3.1
I had a problem at step 2 in your instructions to entries up states: "Select esp-idf v5.3 from status bar (recommended) or the ESP-IDF: Configure ESP-IDF extension."
When I try is reports, "Open a folder first."
So I open a folder of a the project I made in version IDF 3.0. Then on the status bar I could select Version 5.3.1
When I run Run the ESP-IDF: Doctor command, I get the following errors:

  • Extension configuration report has been copied to the clipboard with errors.
  • Cannot open file ../report.txt. Detail: FIles above 50MB cannot be synchnrozied with extensions.
    I checked my report,txt and found it was over 181MB
    I tried continuing:
    When I select Show examples it only shows 5.3.1 which is good.
    I created ip_internal_network, but when completed the status bar reports ESP-IDF v5.2.2 again.
    So I tried deleting the large report.txt and trying again.
    Same problem, it created a report.txt of 181MB again

@brianignacio5
Copy link
Contributor

Delete this file:

%USERPROFILE%\.vscode\extensions\espressif.esp-idf-extension-VERSION\esp_idf_vsc_ext.log

and try to run ESP-IDF: Doctor command again. Seems that your extension log have been logging a lot and vscode limit.

About the ESP-IDF v5.2.2 again, it is because the newly created project does not set settings when created. You can select the v5.3.0 from status bar again.

Again sorry for this issue, will work to make it easier to use in the next release of esp-idf extension.

@michaelsimp
Copy link
Author

I need to make some real progress on this so I have completely uninstalled esp-idf and manually deleted all the ESP and espressif folders including all 3 IDF versions.
I have reinstalled ESP-IDF and only IDF version 5.3.1 to remove all doubt.
I will rebuild and test and report
Thanks

@michaelsimp
Copy link
Author

Hi again
Its hard to tell, but seems as if it might be a little more robust, especially with the router power off and on test.
Attachments.zip
But it still crashes, see files attached including my .elf

"Fail 1.txt" is taken from the Mesh_Root
Line 1458 Mesh_Node disconnected
Line 1495 Mesh_Node reconnects
Line 1496 crash

"Fails 2.txt" is taken from the Mesh_Node
Line 1789 Mesh_Root disconnected
Line 1822 Reconnect
Line 1832 Crash divide by zero

@zhangyanjiaoesp
Copy link
Collaborator

It's weird, I have tested it multiple times as you said (the following two cases) and it can connect normally without any crashing issues.

Power on both nodes, one becomes MESH_ROOT and one MESH_NODE. Power off WIFI router. MESH_ROOT crashes. See file attached MESH_ROOT "Router power off.txt" (crash at end)

Second test. Power up only one node, becomes MESH_ROOT Power off Router This time MESH_ROOT does not crash Power on Router MESH_ROOT does not crash Power on a second node which connects to the first MESH_ROOT - see line 492 MESH_ROOT crashes.

I'm using the Github IDF, and I will try to test with the vscode extension

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024 via email

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

Are you using the completely unmodified example code during your testing?

@michaelsimp
Copy link
Author

michaelsimp commented Oct 30, 2024

Yes. I did not change anything except:

  • set target device to ESP32-S3 - Internal JTAG debug - What device are you testing with? Could this be a factor?
  • change partition table, factory partition size to 0x400000
  • Using vscode GUI menuconfig:
    • change device flash size to 16MB
    • set WiFi router SSID to VONETS and password to "pass9999"

See my source files attached where you can see most are untouched with the original install date 30/10/24 02:13pm NZ time.
Source.zip

@michaelsimp
Copy link
Author

FYI I use vscode for coding and building and sometimes JTAG debugging
Most of the time in testing I am am monitoring serial com port using Putty terminals on com ports

In addition to the crashing, sometimes a MESH_NODE will not reconnect to the MESH network. As a work around for this, I want to stop and restart the wifi mesh network? No matter what I try, my nodes intermittently crash on restart? Is this perhaps related?

I am hoping you could please answer a few questions to help me.

What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());

but this causes a lot of error logging which stops if I add ...
ESP_ERROR_CHECK(mesh_netifs_destroy());

My restart is as follow:

void wifiMeshStart() {
    ESP_LOGW(TAG, "Wifi Mesh switch on");
    /*  mesh initialization */
    ESP_ERROR_CHECK(esp_mesh_init());
    ESP_ERROR_CHECK(esp_mesh_set_max_layer(CONFIG_MESH_MAX_LAYER));
    ESP_ERROR_CHECK(esp_mesh_set_vote_percentage(1));
    ESP_ERROR_CHECK(esp_mesh_set_ap_assoc_expire(10));
    /* set blocking time of esp_mesh_send() to 30s, to prevent the esp_mesh_send() from permanently for some reason */
    ESP_ERROR_CHECK(esp_mesh_send_block_time(30000));
    mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
#if !MESH_IE_ENCRYPTED
    cfg.crypto_funcs = NULL;
#endif
    /* mesh ID */
    memcpy((uint8_t *) &cfg.mesh_id, MESH_ID, MAC_SIZE);
    /* router */
    cfg.channel = CONFIG_MESH_CHANNEL;

    cfg.router.ssid_len = strlen(meshProvisionData.ssid);
    memcpy((uint8_t *) &cfg.router.ssid, meshProvisionData.ssid, cfg.router.ssid_len);
    memcpy((uint8_t *) &cfg.router.password, meshProvisionData.password, strlen(meshProvisionData.password));
    
    ESP_ERROR_CHECK(esp_mesh_set_ap_authmode((wifi_auth_mode_t) CONFIG_MESH_AP_AUTHMODE));
    cfg.mesh_ap.max_connection = CONFIG_MESH_AP_CONNECTIONS;
    cfg.mesh_ap.nonmesh_max_connection = CONFIG_MESH_NON_MESH_AP_CONNECTIONS;
    memcpy((uint8_t *) &cfg.mesh_ap.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
    ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
    /* mesh start */
    ESP_ERROR_CHECK(esp_mesh_start());
    ESP_LOGI(TAG, "WiFi Mesh started successfully");
}

I notice with my custom application:
I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on the RSSI as documented.

The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has".
Does this mean RSSI is not part of the parent selection process?

Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp The following are the answers for your questions:

  1. What are the recommended steps to stop and then restart the WiFi Mesh network? I currently have :

        ESP_ERROR_CHECK(esp_mesh_stop());
        ESP_ERROR_CHECK(esp_mesh_deinit());
    

    Call esp_mesh_stop() is enough.

  2. but this causes a lot of error logging which stops if I add ...
    ESP_ERROR_CHECK(mesh_netifs_destroy());
    Where did you add the mesh_netifs_destory() function? What does the error log look like? Can you provide an example?

  3. Where did you call the wifiMeshStart() function?

  4. I have 1 MESH_ROOT and 5 MESH_NODEs spread around the office. Mesh_Node seem to connect to parents not based on
    the RSSI as documented.

    RSSI is not the only criterion for selecting the parent node, the layer and connections also need to be considered.

  5. The IDF documentation states "To prevent nodes from forming a weak upstream connection, ESP-WIFI-MESH implements an RSSI threshold mechanism for beacon frames." Is this configurable and if so where? I cant find it in the API or in MenuConfig. What is the default RSSI threshhold value?

    You can call this API:

    esp_err_t esp_mesh_set_rssi_threshold(const mesh_rssi_threshold_t *threshold);

  6. The IDF documentation states in Preferred Parent Node "The preferred parent node is determined based on the following criteria: Which layer the parent node candidate is situated on. The number of downstream connections (child nodes) the parent node candidate currently has". Does this mean RSSI is not part of the parent selection process?

    Same to the fourth point, selecting parent need to consider RSSI, layer and connections, the doc need to be updated.

  7. Is it recommend to use self-organized networking or for serious applications should I manually build the mesh network? I will only have a max of 10 mesh nodes altogether but they do a reasonable amount of MQTT5 communications to the cloud.

    You can use self-organized network.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp I can reproduce the crash using the vscode, I will check the difference between VSCode and standard IDF

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp
This excellent news for me. Hopefully it is just something simple you will find soon and be able to offer me a fix.
Thank you

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

Thanks so much for taking the time to answer all my questions.

Please note THESE tests are with MY application (NOT with example program ip_internal_network) running on a network of 6 nodes - all ESP32-S3. My application has a CLI console integrated so I can trigger actions and see the responses on the COM port.

Q1 I will go back to just calling esp_mesh_stop() and see what happens.
To restart should I just be able to call ESP_ERROR_CHECK(esp_mesh_start());

Q2 Triggered from the CLI Console I was calling:

    ESP_ERROR_CHECK(esp_mesh_stop());
    ESP_ERROR_CHECK(esp_mesh_deinit());
    ESP_ERROR_CHECK(mesh_netifs_destroy());

Q3 wifiMeshStart() is also called from my CLI console

The CLI Console is started from my Mainline as is my WiFi Mesh application (built on top of the ip_internal_network source).
Triggering the Mesh Stop and Start would be called from the CLI Console thread. I am assuming this is ok and does not need any mutex protection. Please advise how I should call it if this is a problem.

Q4 I understand this

Q5 Thanks

Q6 Thanks for clarifying this but I am not finding this to be the case. I distributed some MESH_Nodes across the office with the aim of creating a multi-hop network between the far extremes. But it does not form as expected or at all well for healthy RSSI. I have nodes which are close to my Mesh_Root or 2nd layer Mesh_Nodes which are not at parent capacity numbers. When my other Mesh_Nodes do connect to these parents they provide an RSSI on the child to the parent of -35dBm. But they most frequently want connect to nodes a much longer distance away getting a RSSI of < -70dBm.

I read somewhere the default RSSI threshold is -120dBm, but I am finding nodes with RSSIs < -70dBm often lose their MQTT connection to the broker. I have an office environment and I have located the nodes approx 10 to 20 meters apart with a max of one wall between but they are not all line of site. I very much doubt I could even get a connection at RSSI less than -100dBm. I am thinking it may be a signal to noise ratio issue so I have scanned the office for WiFi channel usage and selected channel 1 on my WIFI Router as nothing else is using this channel and no other channels overlap. I know this is not an easy question to answer with precision and I appreciate the many influences, but realistic what is the ballpark min RSSI range at which I can expect a node to work reliably at what sort of distance range.

Q7 Because of my Q6 response above, I have started evaluating the example project "manual_networking" to make my MESH_NODES manually scan and select MESH_NODES with the healthiest RSSI. It sounds like you are saying the IDF framework should already be doing this?
So I am now wondering if the vscode crashing issue is also causing this to not work properly and your fix might fix both.
Should I put manual scanning and parent selection changes to one side and wait for the outcome of the vscode crashing?
I guess I would prefer to use the self-organized network as much as possible, if it works as you describe.
Please advise / confirm manual scanning and parent selection should not be required and I might need you to look at the node parent selection for healthy RSSI next.

Best regards

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking.
  2. I'm sorry I didn't quite understand your question regarding the selection of the parent node. Can you draw a picture to explain it? For example, where are nodes A, B, and C located? What level? How many child nodes are connected below? What is the RSSI of A, B, and C scanning each other? Do you expect A to connect to B but actually connect A to C?

@michaelsimp
Copy link
Author

See attached:
MeshMaps.zip

"Target Network .png" shows the walls as black lines and nodes as circles. The blue lines are approx what I was expecting to see.

It forms very randomly but with bad choices like the file "Actual example Network.png" with links and dBm in red.

My project is configured for up to 50 nodes and 3 children per node as I wanted to force some layers.

@michaelsimp
Copy link
Author

"According to the backtrace of the crash issue, it seems related to DHCP, it won't affect the mesh networking."

That may be the case with the crash issue you found, but the root cause of the vscode IDF environment may cause more than one issue. I guess you will know better when you get to the bottom of the vscode IDF environment issue

Do the diagrams I sent help you understand my issue better now?

@zhangyanjiaoesp
Copy link
Collaborator

Yes, I now understand your question. Once the ROOT node is formed, the chances for other idle nodes to connect to the root node are the same; as long as they can scan the root node within the RSSI threshold range, they can connect and become second-layer nodes. Therefore, it is reasonable for C and D to connect to A and become second-layer nodes. What is the RSSI that B, E, and F receive from A? Since each node can connect to 3 child nodes, if they are within the RSSI threshold range and root A is not yet fully connected, at least one of B, E, or F should be able to connect to A.

I think you can call esp_mesh_set_rssi_threshold() to limit the RSSI threshold for optional parents, which would allow nearby nodes to connect as much as possible. However, nodes D and E are too far from node A. If you set the same RSSI threshold for all nodes, the connection results might still not meet your expectations, unless you configure different RSSI thresholds for each node. Alternatively, you could only call the esp_mesh_set_parent() function to specify the parent of each node.

截屏2024-10-31 18 04 10

@michaelsimp
Copy link
Author

Hi
I thought it best to go back to the Espressif example project "manual_networking". These have been build with the 2 earlier updates to the IDF library. This project does not use "mesh_netif.c"

I made no changes to the source.
I selected device ESP32-S3.
Using Menu Config I:

  • changed the Flash memory size to 16MB
  • set Example Configuration device type to "MESH_SET_NODE" as this matches my objectives with the network automatically assigning a MESH_ROOT.
  • entered my Router SSID and password

I did a clean build and tested. See attachments for log files, line numbers and summary at top of each file.
Manual test 8Nov.zip

Test one, power 2 nodes up at the same time
Log file "manual test 1" Node crashes, reboots and then becomes root
Log file "manual test 1B" Node won't become root even though the other node is delayed with a crash and restart. It displays disconnect message for a period of time until the other node becomes MESH_ROT and then it connects to it.

Test two, power 2 nodes one becomes MESH_ROOT and the other connects to it.
Restart the MESH_ROOT node for a couple of cycles until both nodes get in a state where neither becomes MESH_ROOT and so there are no connections made.
Log file "manual test 2A: Crashes once and then restarts. Does not become MESH_ROOT and does not connect as there is no MESH_ROOT
Log file "manual test 2B: Start but does not become MESH_ROOT and does not connect as there is no MESH_ROOT

I stopped the log files after 72 seconds so I wouldn't lose the start. But I left the 2 terminals running while I wrote this up and they have now been going for 1416228ms or 1416 seconds or over 23 minutes

This pretty much mirrors what I am seeing with my application.
If I keep resetting nodes, they do start up with one becoming MESH_ROOT and the other connecting, but it is VERY easy to get it to fault as described above. Please try it for yourself.

My objectives are to have a self organized network WITHOUT fixing my MESH_ROOT. But if a MESH_NODE connects to a parent with an RSSI < -60dBm, I want them to scan and find eligible parents with better RSSI and switch to them using esp_mesh_set_parent(). The reason is I have found the mesh network comms to be flaky if my RSSI figures get less than -65dBm. The stronger the better.

Most of the time (an for the testing I am doing and reporting to you) I am only using MQTT comms for nodes to an external broker,

But I have also developed and OTA solution, using HTTPS to update my MESH_ROOT and then have my MESH_ROOT push OTA updates to my MESH_NODEs using esp_mesh_send() and recv_cb. I have developed a protocol and broadcast packets of 1024 bytes at a time, and the MESH_NODEs respond after a small delay X their node number to avoid collisions. I have Ack messages and a retry mechanism for dropped packets and this works well if RSSI > -65dBm. BUT less than -65dBm the dropped packets become to much. I drop these nodes from the OTA process and complete the broadcast to the rest. I then follow up with P2P messages to the nodes I dropped which works better over lower RSSI connections. But none of the OTA stuff is active in my application during the tests I have been running, this is just for background information.

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. The crash issue is caused by the error check you added when calling the esp_mesh_scan_get_ap_ie_len() function. After the AP is retrieved, it returns -1.
    image

  2. Since you have set this node as "MESH_SET_NODE" in the menuconfig, it will only look for a mesh AP when searching for a parent. The mesh AP's mesh IE length is non-zero, and the parent of the root can only be a router, router does not contain the mesh IE. Refer to the code in the example:
    image

    Log file "manual test 1" Node crashes, reboots and then becomes root

    So, your description is incorrect. In "manual test1," the device (30:30:f9:33:72:b8) did not become the root; instead, after rebooting, it became a third layer node. The log is : I (7672) mesh_main: <MESH_EVENT_PARENT_CONNECTED>layer:0-->3, parent:48:ca:43:9b:5d:21, ID:77:77:77:77:77:76

  3. In "manual test 1B", the node (48:ca:43:9b:53:d8, we call B) has designated 30:30:f9:33:72:b8 (we call A) as its parent, it will only connect to this parent and will not attempt to connect to any other node. When the node A crashes and reboots, the node B will fail to find the AP and will return a 201 error. It will only successfully reconnect once the parent has fully recovered.
    W (6278) mesh_main: <PARENT>ESPM_3372B8, layer:2/23, assoc:0/6, 1, 30:30:f9:33:72:b9, channel:1, rssi:-19

image

  1. In "manual test 2A", the node A initially selected ESPM_9B54C0 as its parent, I (27638) mesh_main: <MESH_EVENT_PARENT_CONNECTED>layer:0-->2, parent:48:ca:43:9b:54:c1<layer2>, ID:77:77:77:77:77:76.
    After rebooting, it chose ESPM_9B53D8 as its parent instead. W (6258) mesh_main: <PARENT>ESPM_9B53D8, layer:3/22, assoc:0/6, 0, 48:ca:43:9b:53:d9, channel:1, rssi:-32. The connection fail shows the parent has changed to IDLE.
I (9888) wifi:state: init -> auth (0xb0)
I (9898) wifi:state: auth -> init (0x65c0)
I (9898) wifi:new:<1,0>, old:<1,1>, ap:<1,1>, sta:<1,1>, prof:1, snd_ch_cfg:0x0
I (9898) mesh: [wifi]disconnected reason:101(idle), continuous:1/max:12, non-root, vote(,stopped)<><>
I (9898) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason:101
  1. In "manual test 2B", the node B selects node A as its parent.

@michaelsimp
Copy link
Author

Hi. Thanks for your response.

My apologies, I thought I was working on the original "manual_networking project, but this was my test version which I did modify.

I thought MESH_SET_NODE meant I could set the MESH_NODES parent (to a closer node with RSSI), but I think you are saying it is designed set the MESH_TYPE as MESH_NODE only which I guess this explains why I don't later get a MESH_ROOT. Please confirm.

So can you give me a lead on how I achieve my objective as I documented in my last post. I am looking for a hybrid between example "ip_internal_networking" where everything is automatic and example "manual_networking" which is designed to manually set a fixed MESH_ROOT and MESH_NODES.

I am guessing my problem lies in my use of esp_mesh_set_parent() where this example always passes in a type of "my_type" so a MESH_NODE will be set to type MESH_NODE. Does this then stop the node scanning to become MESH_ROOT.

I am guessing some how I need to restart the scanning for the router with highest RSSI for the voting process but I don't know how to tell when I have lost the MESH_ROOT in the network, and then how to trigger the rescan and vote to enable a MESH_NODE or MESH_IDLE to become MESH_ROOT again. This all happens automatically on start up before my use of esp_mesh_set_parent().

I really appreciate your help to better understand this.

@zhangyanjiaoesp
Copy link
Collaborator

  1. In the "manual_networking" example, the "MESH_SET_NODE" meant its type will be MESH_NONE, it can't be ROOT
  2. To achieve your goal, I suggest initially using a self-organized network. Once the network is formed, if a node detects that its parent’s RSSI < -60dBm, it can initiate a scanning process to reselect a new parent. When the root node disappears, an MESH_EVENT_NETWORK_STATE event will be triggered. The nodes can then read this event and call esp_mesh_set_self_organized(true, true) to vote a new root.
  3. Actually, in a self-organized network, after the initial network formation, non-root nodes will periodically scan the surrounding nodes and select a better parent, dynamically adjusting the network, as described in the document I mentioned previously. I believe this dynamic adjustment aligns with your requirements.
    Have you observed this behavior? The related logs are as follows:
I (23076) mesh: 5170<active>parent layer:1(node), channel:5, rssi:-21, assoc:1(cnx rssi threshold:-120)my_assoc:0
I (24576) mesh: 5969<scan>parent layer:1, rssi:-20, assoc:1(cnx rssi threshold:-120)
I (24576) mesh: [SCAN][ch:5]AP:1, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:2,i:2][ac:84:c6:7e:31:2b]<weak>
I (24586) mesh: 7391[weak]try rssi_threshold:-120, backoff times:0, max:5<-78,-82,-85>
I (24596) mesh: 716[monitor]no change, parent:4c:11:ae:eb:83:45, rssi:-20
I (24596) mesh: 2012<arm>parent monitor, my layer:2(cap:6)(node), interval:5231ms, retries:3<>

if there is no better parent, the node will remain with its current parent;

I (24596) mesh: 716[monitor]no change, parent:4c:11:ae:eb:83:45, rssi:-20

if a better parent is found, it will switch to that parent.

mesh: 710[monitor]new parent:30:ae:a4:80:1e:ed<layer:2, rssi:-33, assoc:0>current parent:30:ae:a4:00:48:59<layer:3, rssi:-31, assoc:4>

@michaelsimp
Copy link
Author

Thank you for this information
Is the MESH_EVENT_NETWORK_STATE event only triggered by a MESH_ROOT node disappearing or is there additional testing I need to do before calling esp_mesh_set_self_organized(true, true) to vote a new root. Its just the name of this event "MESH_EVENT_NETWORK_STATE" sounds pretty broad and not at all like MESH_ROOT disappeared?

I had read in the docs about the self-organized network non-root nodes periodically scanning the surrounding nodes and selecting a better parent, but I have had multiple cases where a node at one end of the network location wise has connected with a node at the opposite end of the network location wise and resulting in RSSI figures < -70dBm when there are much closer nodes in between, and they have NOT automatically switched. How often should it scan and switch?

When I added my manual scan and switching code (with the fixed MESH_ROOT) it formed a much better network and the performance was very noticeably better.

I will try this over the weekend and report back.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 10, 2024

Hi

To recap,

I have created a task which runs every 60 seconds to check if the node RSSI < -60dBm and if so it calls wifiMeshScan(), to scan for a better parent. For now I have commented this out and instead call wifiMeshScan() (see below) manually from my CLI console .
This task also monitors a selfOrganizeReactivateTimer which calls esp_mesh_set_self_organized(true, false) after the scan and switch parent is completed.

static uint8_t meshTaskPrescale = MESH_SCAN_INTERVAL / 2; // rescan in 10 seconds
void vWifiMeshTaskCode(void* args) {
    ESP_LOGI(TAG, "vWifiMeshTaskCode started");
    ESP_ERROR_CHECK(taskRegisterToWatch(xWifiMeshHandle, "appWiFiMesh")); // register task for watchdog
    while (true) {
        sleepMs(1 * MILLISECOND_TICKS_PER_SECOND);
        taskPatWdog(xWifiMeshHandle); // pat watchdog
#ifdef ENABLE_PARENT_SEARCH
        // if (meshTaskPrescale++ > MESH_SCAN_INTERVAL) {
        //     meshTaskPrescale = 0;
        //     if ((!esp_mesh_is_root()) && (currentRSSI < RSSI_RESCAN_THRESHOLD)) 
        //         wifiMeshScan();
        // }

        if (selfOrganizeReactivateTimer > 0) {
            selfOrganizeReactivateTimer--;
            if (selfOrganizeReactivateTimer == 0) {
                    ESP_LOGI(TAG, "Reactivate self organization");
                    ESP_ERROR_CHECK(esp_mesh_set_self_organized(true, false));
            }
        }
#endif
    }
}

Here is my wifiMeshScan():

void wifiMeshScan() { // for closer parent
    if (esp_mesh_is_root()) {
        ESP_LOGE(TAG, "Error: This node is MESH_ROOT");
        return;
    }
#ifndef ENABLE_PARENT_SEARCH
    ESP_LOGE(TAG, "WifiMeshScan for closer parent disabled");
    return;
#endif // ENABLE_PARENT_SEARCH

    if (esp_mesh_get_type() != MESH_IDLE) {
        esp_err_t err;
        err = esp_wifi_sta_get_rssi(&currentRSSI);
        if (err == ESP_OK) {
            ESP_LOGI(TAG, "WifiMeshScan for closer parent. Current parent RSSI: %d", currentRSSI);
        } else {
            ESP_LOGE(TAG, "WifiMeshScan for closer parent. Get RSSI error: %s", esp_err_to_name(err));
        }
    } else {
        currentRSSI = -100; // if not parent, default low RSSI        
    }
    selfOrganizeReactivateTimer = SELF_ORGANIZE_REACTIVATE_LONG_TIME; // enable long reenable timer as backup in case scan does not complete
    ESP_ERROR_CHECK(esp_mesh_set_self_organized(false, false));
    esp_wifi_scan_stop();
    wifi_scan_config_t scan_config = { 0 }; // mesh softAP is hidden
    scan_config.show_hidden = 1;
    scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
    esp_wifi_scan_start(&scan_config, 0);
}

and in my mesh_event_handler I have added:

    case MESH_EVENT_SCAN_DONE: {
        mesh_event_scan_done_t *scan_done = (mesh_event_scan_done_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_SCAN_DONE>number:%d", scan_done->number);
#ifdef ENABLE_PARENT_SEARCH
        findClosestParent(scan_done->number);  // now look for a closer parent in the results
#endif // ENABLE_PARENT_SEARCH
    }
    break;

and my findClosestParent (copied and modified slightly from manual_networking example, to scan all eligible node and take the one with the best RSSI)

void findClosestParent(int num) { // after a WiFi scan
    ESP_LOGW(TAG, "findClosestParent  Current RSSI: %d", currentRSSI);
    int i;
    int ie_len = 0;
    mesh_assoc_t assoc;
    mesh_assoc_t parent_assoc = { .layer = CONFIG_MESH_MAX_LAYER, .rssi = -120 };
    wifi_ap_record_t record;
    wifi_ap_record_t parent_record = { 0, };
    parent_record.rssi = currentRSSI; // has to be better than current RSSI to change parent
    bool parent_found = false;
    mesh_type_t my_type = MESH_IDLE;
    int my_layer = -1;
    wifi_config_t parent = { 0, };
    wifi_scan_config_t scan_config = { 0 };

    for (i = 0; i < num; i++) { // iterate through scan records looking for eligible closer parent node
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_ie_len(&ie_len));
        ESP_ERROR_CHECK(esp_mesh_scan_get_ap_record(&record, &assoc));
        ESP_LOGD(TAG, "ie_len: %d  sizeof(assoc): %d", ie_len, sizeof(assoc));
        if (ie_len == sizeof(assoc)) {
            ESP_LOGI(TAG,
                     "<MESH>[%d]%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d, ID<"MACSTR"><%s>",
                     i, record.ssid, assoc.layer, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, assoc.layer2_cap, MAC2STR(record.bssid),
                     record.primary, record.rssi, MAC2STR(assoc.mesh_id), assoc.encrypted ? "IE Encrypted" : "IE Unencrypted");

            // ESP_LOGI(MESH_TAG, "Type: %d  layer_cap %d:  assoc %d  assoc_cap: %d  rssi: %d", assoc.mesh_type, assoc.layer_cap, assoc.assoc, assoc.assoc_cap, record.rssi);
            if (assoc.mesh_type != MESH_IDLE && assoc.layer_cap && assoc.assoc < assoc.assoc_cap) { //  && record.rssi > -70) MBS removed 
                // ESP_LOGI(MESH_TAG, "assoc.layer: %d  parent_assoc.layer %d:  assoc.layer2_cap %d  parent_assoc.layer2_cap: %d", assoc.layer, parent_assoc.layer, assoc.layer2_cap, parent_assoc.layer2_cap);
                if (assoc.layer < parent_assoc.layer || assoc.layer2_cap < parent_assoc.layer2_cap) {
                    if (record.rssi > parent_record.rssi) { // closer parent found
                        if (memcmp(&parent_assoc, &assoc, sizeof(assoc)) != 0) { // dont switch to same parent
                            ESP_LOGW(TAG, "Closer Parent found: %s  RSSI: %d", record.ssid, record.rssi);
                            parent_found = true;
                            memcpy(&parent_record, &record, sizeof(record));
                            memcpy(&parent_assoc, &assoc, sizeof(assoc));
                            if (parent_assoc.layer_cap != 1) {
                                my_type = MESH_NODE;
                            } else {
                                my_type = MESH_LEAF;
                            }
                            my_layer = parent_assoc.layer + 1;
                            // break; // MSB removed, keep searching for the closest parent
                        }
                    }
                }
            }
        } else {
            ESP_LOGD(TAG, "[%d]%s, "MACSTR", channel:%u, rssi:%d", i, record.ssid, MAC2STR(record.bssid), record.primary, record.rssi);
        }
    }

    esp_mesh_flush_scan_result();
    if (parent_found) { // parent: Both channel and SSID of the parent are mandatory
        parent.sta.channel = parent_record.primary;
        memcpy(&parent.sta.ssid, &parent_record.ssid, sizeof(parent_record.ssid));
        parent.sta.bssid_set = 1;
        memcpy(&parent.sta.bssid, parent_record.bssid, 6);
        if ((my_type == MESH_NODE) || (my_type == MESH_LEAF) || (my_type == MESH_IDLE)) {
            ESP_ERROR_CHECK(esp_mesh_set_ap_authmode(parent_record.authmode));
            if (parent_record.authmode != WIFI_AUTH_OPEN) {
                memcpy(&parent.sta.password, CONFIG_MESH_AP_PASSWD, strlen(CONFIG_MESH_AP_PASSWD));
            }
            ESP_LOGW(TAG,
                     "<PARENT>%s, layer:%d/%d, assoc:%d/%d, %d, "MACSTR", channel:%u, rssi:%d",
                     parent_record.ssid, parent_assoc.layer,
                     parent_assoc.layer_cap, parent_assoc.assoc,
                     parent_assoc.assoc_cap, parent_assoc.layer2_cap,
                     MAC2STR(parent_record.bssid), parent_record.primary,
                     parent_record.rssi);
            ESP_ERROR_CHECK(esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer));
            selfOrganizeReactivateTimer = SELF_ORGANIZE_REACTIVATE_TIME; // start self organize reactivation timer
        }
    } else {
        ESP_LOGE(TAG, "No eligible closer Parent found");
        if (currentRSSI == NO_RSSI) { // scan again if no connection yet
            esp_wifi_scan_stop();
            scan_config.show_hidden = 1;
            scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
            esp_wifi_scan_start(&scan_config, 0);
        }
    }
}

My problem was if I run the above code, and then later I restart my MESH_ROOT, I didn't get a new MESH_ROOT and the mesh network can't re-establish itself.
Note this is not a problem if the code above does not run.

Based on your last post advice I have added the following to mesh_event_handler()

    case MESH_EVENT_NETWORK_STATE: {
        mesh_event_network_state_t *network_state = (mesh_event_network_state_t *)event_data;
        if (network_state->is_rootless) {
            ESP_LOGW(TAG, "\n<MESH_EVENT_NETWORK_STATE> MESH_ROOT lost\n");
            esp_mesh_set_self_organized(true, true); // vote a new root            
        } else {
            ESP_LOGW(TAG, "\n<MESH_EVENT_NETWORK_STATE> MESH_ROOT established\n");
        }
    }

Now when I restart the MESH_ROOT it works perfectly SOME of the time. The event MESH_EVENT_NETWORK_STATE fires, and the voting process starts to find a MESH_ROOT. Everything reestablishes and starts working again.

But more than half of the time, the network gets in a mess. The MESH_EVENT_NETWORK_STATE event does NOT fire to restart the MESH_ROOT polling.

I end up with no MESH_ROOT and MESH_NODES on deep layers all point directly or indirectly to a MESH_IDLE node. See my logs attached.
10Nov.zip

  • rootrestart.txt starts at the reboot of my MESH_ROOT. At the end my console command "mesh" shows this node 48:ca:43:9b:53:d8 as a MESH_NODE layer 4 to parent 48:ca:43:9b:5c:69
  • node1.txt shows 48:ca:43:9b:5c:68 as a MESH_NODE layer 3 to parent 30:30:f9:33:49:bd
  • node2.txt shows 30:30:f9:33:49:bc as a MESH_IDLE
  • node 3.txt shows 30:30:f9:33:72:b8 as a MESH_NODE layer 4 to parent 48:ca:43:9b:5c:69

None of these log files show the MESH_EVENT_NETWORK_STATE firing and calling esp_mesh_set_self_organized(true, true)
So as a test I added a CLI command to call esp_mesh_set_self_organized(true, true) and tried to use it to rectify the problem above. It does work some of the time creating a MESH_ROOT and bringing some nodes back, but not all nodes. Some stay stuck on MESH_IDLE.

In summary:

    esp_mesh_set_self_organized(false, false);
    esp_wifi_scan_stop();
    wifi_scan_config_t scan_config = { 0 }; // mesh softAP is hidden
    scan_config.show_hidden = 1;
    scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
    esp_wifi_scan_start(&scan_config, 0);

When I find a better parent:

    esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer)
    ....
    esp_mesh_set_self_organized(true, false);

When root disappears:

   case MESH_EVENT_NETWORK_STATE:  (and if is_rootless)
   esp_mesh_set_self_organized(true, true);

The problem:
MESH_EVENT_NETWORK_STATE regularly does not fire and call esp_mesh_set_self_organized(true, true)

@zhangyanjiaoesp
Copy link
Collaborator

Thank you for this information Is the MESH_EVENT_NETWORK_STATE event only triggered by a MESH_ROOT node disappearing or is there additional testing I need to do before calling esp_mesh_set_self_organized(true, true) to vote a new root. Its just the name of this event "MESH_EVENT_NETWORK_STATE" sounds pretty broad and not at all like MESH_ROOT disappeared?

I had read in the docs about the self-organized network non-root nodes periodically scanning the surrounding nodes and selecting a better parent, but I have had multiple cases where a node at one end of the network location wise has connected with a node at the opposite end of the network location wise and resulting in RSSI figures < -70dBm when there are much closer nodes in between, and they have NOT automatically switched. How often should it scan and switch?

When I added my manual scan and switching code (with the fixed MESH_ROOT) it formed a much better network and the performance was very noticeably better.

I will try this over the weekend and report back.

  1. The MESH_EVENT_NETWORK_STATE event is triggered by a Layer 2 nodes and propagates progressively through the network to the final node. The event only records a "rootless" state underneath.
    image

  2. The scanning period is not fixed, but is at least 4.5 seconds.

@zhangyanjiaoesp
Copy link
Collaborator

I end up with no MESH_ROOT and MESH_NODES on deep layers all point directly or indirectly to a MESH_IDLE node. See my logs attached.
10Nov.zip

Could you provide a more complete log? I need to understand the network state before the root reboot. Specifically, how Node1, Node2, and Node3 were connected, and what the behavior was after the reboot.

@michaelsimp
Copy link
Author

Hi
See logs attached, from boot till end.
11Nov.zip

I have all 4 nodes in my office for convenience.
I have disabled the periodic scan for closer wifi and invoke this from my CLI console - see the command "wifi scan". It looks like I only sent this on nodes 2 and 3 this time, but the results are pretty typical of when it goes wrong:

  • Event <MESH_EVENT_NETWORK_STATE> did not fire so I didn't call esp_mesh_set_self_organized(true, true);
  • ended up with broken mesh

Ignore the watchdog in MQTT. I think the library blocks a bit when WiFi is not up.

  • test2 root.txt: MAC: 30:30:f9:33:72:b8 MESH_ROOT powered off at the end of the file and left off
  • test2 node1.txt: MAC: 48:ca:43:9b:53:d8 After MESH_ROOT powered of, this MESH_NODE becomes MESH_IDLE
  • test2 node2.txt: MAC: 48:ca:43:9b:5c:68 After MESH_ROOT powered of, this MESH_NODE layer 2 becomes MESH_NODE layer 3 parent of 48:ca:43:9b:53:d9
  • test2 node3.txt: MAC: 30:30:f9:33:49:bc After MESH_ROOT powered of, this MESH_NODE becomes MESH_NODE layer 4 parent of 48:ca:43:9b:5c:69

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp
Just checking on the status of this to make sure you have everything you need and that you will get back to me with some ideas soon. Thanks

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp

  1. When would you invoke the scan via the CLI console? Is it at any given moment?
  2. Can you disable the MQTT task watchdog? I'm not sure if it will interfere with the entire process.
  3. Why are there so many reboot operations in the root log?

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
Is this the expected behavior?

  1. Initially, the network is self-organized, forming the initial network.
  2. Then, non-root nodes disable self-organization and periodically scan for nearby APs, selecting a better parent.
  3. When the root node disappears, the other nodes re-enable self-organization, reselect the root, and restore the network.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 12, 2024

Answering your questions:
When would you invoke the scan via the CLI console? Is it at any given moment?
YES but I am running this when the network is created and settled.

Can you disable the MQTT task watchdog? I'm not sure if it will interfere with the entire process.
OK

Why are there so many reboot operations in the root log?
Sorry I should have deleted up until the last one. I did a few tests, some successful before the final one I sent you. Reference the last test run from line 1672 onwards which matches the node traces I sent.

Is this the expected behavior?
Initially, the network is self-organized, forming the initial network.
Yes

Then, non-root nodes disable self-organization and periodically scan for nearby APs, selecting a better parent.
Yes

When the root node disappears, the other nodes re-enable self-organization, reselect the root, and restore the network.
Yes. This part is the problem as OFTEN MESH_EVENT_NETWORK_STATE does not fire and call esp_mesh_set_self_organized(true, true). If I call call esp_mesh_set_self_organized(true, true) manually from my CLI, it doesn't fix the broken mesh.

If it helps, I could make a cut down project and send it to you to test and play with. It doesn't need any special hardware, I am just using ESP32-S3-WROOM-1 dev boards with 16MB Flash

@zhangyanjiaoesp
Copy link
Collaborator

If it helps, I could make a cut down project and send it to you to test and play with. It doesn't need any special hardware, I am just using ESP32-S3-WROOM-1 dev boards with 16MB Flash

Sure, it would be better if you could provide a demo. That way, I can test and debug it locally.

@michaelsimp
Copy link
Author

Hi

See attached wincut folder structure of my project minus the build directory.

wincut.zip

I have cut out the application but left all of the platform stuff. Its still quite big but you will probably only be interested in the source at:
..\wincut\main\platform\espidf\wifi where wifiMesh.cpp is developed out of the example project "ip_internal_network"

Typing "help" will bring up a menu.

Your targets probably won't be wifi provisioned yet but if it is you can run "wifi unprovision" to clear it out. On restart it will display the provisioning screen on the console with the QRCode for the ESP provisioning app.

After provisioning (all your nodes), they will start up normally and form a self organized mesh.

Type "mesh" to display the nodes WiFi mesh attributes including the Mesh routing Table as follows:

W (00:14:26.671) aWifiMesh: MESH_ROOT Layer: 1 MAC: 30:30:f9:33:49:bc IP: 192.168.1.13
I (00:14:26.674) aWifiMesh: Router VONETS BSSID (MAC): 00:17:13:20:bd:74
I (00:14:26.686) aWifiMesh: Vote: 100%
I (00:14:26.687) aWifiMesh: Network is: self organized
I (00:14:26.688) aWifiMesh: Network Root is: not fixed
I (00:14:26.700) aWifiMesh: Number of nodes: 4
I (00:14:26.701) aWifiMesh: Max layers: 6
I (00:14:26.713) aWifiMesh: Mesh Parent MAC: 00:17:13:20:bd:74 RSSI: -45 Channel: 3
I (00:14:26.715) aWifiMesh: Mesh routing Table records: 4 Record size: 6 Table size: 24
I (00:14:26.728) aWifiMesh: MAC: 30:30:f9:33:49:bc
I (00:14:26.729) aWifiMesh: MAC: 30:30:f9:33:72:b8
I (00:14:26.742) aWifiMesh: MAC: 48:ca:43:9b:5c:68
I (00:14:26.744) aWifiMesh: MAC: 48:ca:43:9b:53:d8

The MQTT app will try to connect to a Mosquito Broker on 192.168.1.2. If you want to disable this the MQTT is in the file:
..\wincut\main\host\mqtt5app.cpp. Comment out the last line of code which starts my MQTT task.

Type "wifi scan" on a NON MESH_ROOT to get it to search for a better parent and switch. It won't run on a MESH_ROOT.

Reboot the MESH_ROOT to reproduce the broken Mesh issue.

Type "wifi root" to run call esp_mesh_set_self_organized(true, true)

@michaelsimp
Copy link
Author

Have you manged to test my application yet and reproduce the problem?

@zhangyanjiaoesp
Copy link
Collaborator

Still in testing,I didn't use your demo as it involves too many components. Instead, I made some modifications based on the ip_internal_network example.

@michaelsimp
Copy link
Author

Ok, can you reproduce the problem?

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Done Issue is done internally labels Nov 13, 2024
@zhangyanjiaoesp
Copy link
Collaborator

@michaelsim
Here are the changes I made in the ip_internal_network example,you can take a look.
0001-add-user-scan-code.zip

There are the following points to note in your changes:

  1. When determining if the parent is consistent, using the BSSID is sufficient,so I use memcmp(parent_record.bssid, record.bssid, 6) != 0 in the findClosestParent() function , the memcmp(&parent_assoc, &assoc, sizeof(assoc)) != 0 should not be used.
    I encountered an issue during testing if I use memcmp(&parent_assoc, &assoc, sizeof(assoc)) != 0. The issue is like this: when the current connected parent was detected, but the RSSI was lower than when it was previously connected. The node still reselected the parent, even though it was not necessary to do so. It is normal for the RSSI values obtained when scanning the AP at different times to vary.
    Screenshot from 2024-11-13 10-07-26
    Screenshot from 2024-11-13 10-09-05

  2. Why did you add a scan operation when the in the MESH_EVENT_PARENT_DISCONNECTED event? If the scan operation must be invoked, it is essential to call the esp_mesh_set_self_organized(false, false); before the scan API. This ensures that the mesh self-organization is disabled before the application layer calling the scan.
    Screenshot from 2024-11-13 17-15-49

    same to here, also need to call esp_mesh_set_self_organized(false, false);
    image

  3. I think the invocation of the esp_mesh_set_self_organized(true, false) is unnecessary.
    Screenshot from 2024-11-13 17-26-22

@michaelsimp
Copy link
Author

Hi Zhangyanjiaoesp

Thanks for the feedback and software

I backed up the original files, applied the patch and then compared the files.

  • esp_task.h had a change in #define ESP_TASKD_EVENT_STACK only. I have applied this but havn't seen stack related crashes.
  • internal_communication\main\mesh_main.c had many changes to compare. Is this just adding the parent switch code, or are there other changes here which I need to apply to my application in addition to the 3 notes you gave me.

I will apply the 3 changes and retest and report, but I would like to know if you found and fixed anything or if your note 2 is my problem.
Thanks
Michael

@michaelsimp
Copy link
Author

michaelsimp commented Nov 13, 2024

Hi Zhangyanjiaoesp

Note 1: This makes sense, thanks implemented.

Note 2:
My thinking was if I previously called ESP_ERROR_CHECK(esp_mesh_set_self_organized(false, false)); to scan for a closer parent, I don't have a self organized network, so I need to scan for a parent when the node gets disconnected. Is this not correct?

The second block you pointed out was taken from example project manual_networking inside function mesh_scan_done_handler(). In the else of parent_found (so !parent_found) it calls esp_wifi_scan_stop() and esp_wifi_scan_start() without esp_mesh_set_self_organized(false, false);
I have added esp_mesh_set_self_organized(false, false); before the esp_wifi_scan_stop() as per other instances of this.

Note 3: My thinking was to return the network to self organized again after the scan for better Parent. I will remove it soon (don't want to make too many changes at once) if it is not causing a problem for now. Please confirm.

I am not sure what the overall status of this is from your side, please advise:

  • Could you reproduce the broken mesh problem?
  • Did you find any problems in the library - I saw only a minor change to stack size in esp_task.h
  • Were there problems in my original main_mesh.c from internal_networking other than your test code
  • Is there a specific problem with my code in addition to not call esp_mesh_set_self_organized(false, false); before a mesh scan?

I did some tests. It is much better at establishing a MESH_ROOT when the MESH_ROOT reboots or is powered off. But I still have a few problems:
14Nov.zip

Test1 root.txt: My Mesh_ROOT crashed line 395 Guru Meditation Error

Test 2 MESH_NODE.txt: See notes at top of file. After Wifi Scan for a new parent, Became a MESH_IDLE with a child node

Why does a node become a MESH_IDLE when there are MESH_ROOT and MESH_NODES very close (around -35dBM) which are well below capacity? How do I fix them? Another wifi scan request does not fix it as you can see in the log.

Why does a MESH_NODE connect to a MESH_IDLE when there are MESH_ROOT and MESH_NODEs available with strong RSSIs? Surely this should not be allowed. This MESH_NODE was otherwise healthy and so when I rebooted its parent which was MESH_IDLE, it connected to another MESH_NODE successfully.

I started looking though your many changes to internal_networking mesh_main file and found some changes which I think are important. eg in event <MESH_EVENT_PARENT_DISCONNECTED>

image

Would it be possible for you to highlight any other important changes I need to merge into my application

ps I also has a node assert reboot on esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer)
It looks like ESP_ERROR_CHECK is not a good strategy. I don't know which argument could be wrong?
Any advice?

ESP_ERROR_CHECK failed: esp_err_t 0x4008 (ESP_ERR_MESH_ARGUMENT) at 0x4200ff9d
file: "./main/platform/espidf/wifi/wifiMesh.cpp" line 342
func: void findClosestParent(int)
expression: esp_mesh_set_parent(&parent, (mesh_addr_t *)&parent_assoc.mesh_id, my_type, my_layer)

abort() was called at PC 0x4037e45f on core 0

Backtrace: 0x40375f06:0x3fcd2db0 0x4037e469:0x3fcd2dd0 0x40386f5d:0x3fcd2df0 0x4037e45f:0x3fcd2e60 0x4200ff9d:0x3fcd2e90 0x420105ea:0x3fcd3150 0x42129bd2:0x3fcd31f0 0x4212a251:0x3fcd3230 0x4212a368:0x3fcd3280 0x4037f062:0x3fcd32a0


ELF file SHA256: ea203e8cd

@michaelsimp
Copy link
Author

Also the change above adding to event <MESH_EVENT_PARENT_DISCONNECTED>

        if (!esp_mesh_get_self_organized()) {
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }

...undoes my switch to a close parent after manual scan.

@michaelsimp
Copy link
Author

michaelsimp commented Nov 14, 2024

After more testing and switching back modifications, I have the following summary:

I think you can ignore the earlier crash reports I posted today. I was working on something else and enabled PSRAM and I think things got a little cludgy and slow. OTA HTTPS started failing part way through. I have disabled PSRAM again and everything seems reliable again. I was not using it yet and had it configured like this, so I didn't think it would have any effect.
image
Anyway will put PSRAM on hold until later in another ticket after more tests on my side.

In event <MESH_EVENT_PARENT_DISCONNECTED> if I have this code, the networks establishes itself again reliably and everything looks quite stable. BUT: It immediately undoes my switch to a close parent after manual scan.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (!esp_mesh_get_self_organized()) {
            printf(">>>%d, set true, true\n",__LINE__);
            esp_mesh_set_self_organized(true, true); // vote a new root 
        }
    }
    break;

Whereas if I have this code, the MESH_NODE parent switch works nicely, but the mesh breaks after the MESH_ROOT reboots or powers off.

    case MESH_EVENT_PARENT_DISCONNECTED: {
        mesh_event_disconnected_t *disconnected = (mesh_event_disconnected_t *)event_data;
        ESP_LOGI(TAG, "<MESH_EVENT_PARENT_DISCONNECTED>reason:%d", disconnected->reason);
        mesh_layer = esp_mesh_get_layer();
        mesh_netifs_stop();
        wifiConnected = false;
        ESP_LOGW(TAG, "WiFi Disconnected");
        currentRSSI = NO_RSSI;

        printf(">>>last layer = %d, layer = %d\n", last_layer, mesh_layer);

        if (disconnected->reason == WIFI_REASON_ASSOC_TOOMANY) {
            esp_mesh_set_self_organized(false, false);
            esp_wifi_scan_stop();
            scan_config.show_hidden = 1;
            scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
            esp_wifi_scan_start(&scan_config, 0);
        }
    }
    break;

@zhangyanjiaoesp
Copy link
Collaborator

@michaelsimp
The following answers your question:

  1. The patch I provided is the calling method I believe to be correct. I was unable to reproduce your issue locally, as the devices were relatively close to each other during my testing. I cannot deploy my test setup in the same way you have. However, I did test the scenario of whether other nodes can correctly elect a new root after a root power off, and the result was successful.

The second block you pointed out was taken from example project manual_networking inside function mesh_scan_done_handler(). In the else of parent_found (so !parent_found) it calls esp_wifi_scan_stop() and esp_wifi_scan_start() without esp_mesh_set_self_organized(false, false);
I have added esp_mesh_set_self_organized(false, false); before the esp_wifi_scan_stop() as per other instances of this.

In the manual_networking example, the esp_mesh_set_self_organized(false, false) API is called only once throughout the entire process. It ensures that the network remains non self-organized throughout. However, in your project, the esp_mesh_set_self_organized(false, false) , esp_mesh_set_self_organized(true, false), esp_mesh_set_self_organized(true, true) are all called. To ensure that the self-organized network is disabled during the user's scan, it is necessary to call esp_mesh_set_self_organized(false, false). See the doc
image

It looks like ESP_ERROR_CHECK is not a good strategy. I don't know which argument could be wrong?
Any advice?

Unless the return value check is critical, please avoid calling ESP_ERROR_CHECK, as it may cause the program to abort. This is something we discussed previously. Regarding the esp_mesh_set_parent() issue, could you provide how you have set the parameters? I can help check why the parameter error is occurring.

  1. The crash issue is caused by the error check you added when calling the esp_mesh_scan_get_ap_ie_len() function. After the AP is retrieved, it returns -1.
    image

In event <MESH_EVENT_PARENT_DISCONNECTED> if I have this code, the networks establishes itself again reliably and everything looks quite stable. BUT: It immediately undoes my switch to a close parent after manual scan.

Yes, I’ve noticed that as well. I think I need to find a better solution.

Note 3: My thinking was to return the network to self organized again after the scan for better Parent. I will remove it soon (don't want to make too many changes at once) if it is not causing a problem for now. Please confirm.

I still think this part of the code is unnecessary. Returning to the self-organized network after finding a better parent doesn’t seem meaningful, especially since you need to periodically scan for nearby APs, and disable the self-organized network during scanning. As I understand it, you only need the self-organized network when selecting the root; at other times, you prefer to actively choose a better parent, correct?

  1. I will check the log you sent and get back to you.

@zhangyanjiaoesp
Copy link
Collaborator

zhangyanjiaoesp commented Nov 14, 2024

@michaelsimp

I did some tests. It is much better at establishing a MESH_ROOT when the MESH_ROOT reboots or is powered off. But I still have a few problems: 14Nov.zip

Test1 root.txt: My Mesh_ROOT crashed line 395 Guru Meditation Error

Need the elf file when the crash occurred to check the backtrace. By the way, have you added all the following changes when you testing?

wifi_lib_s3_1104.zip
0001-fix-dhcp-add-debug-log-for-dhcp-server.zip

Test 2 MESH_NODE.txt: See notes at top of file. After Wifi Scan for a new parent, Became a MESH_IDLE with a child node

The device is idle because it has not yet connected to the network. Its child should return parent idle error.
image

Screenshot from 2024-11-14 16-12-55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

4 participants