esp32 network code runs in different task #2912

mikee47 · 2024-11-19T14:11:03Z

I'm getting unpredictable hangs and crashes with an esp32-s2 project which communicates with an SPI touch display.
The details aren't relevant as this issue applies to any resource which isn't thread-aware, so that's pretty much everything in the framework.

The problem is because Sming code is split across two tasks, not just one.
Referring to PR #2371, all code runs in the Sming task except for stuff coming in over the network, which runs in the tiT (tcpip) task.
The FreeRTOS scheduler runs both these tasks in a round-robin fashion.
If an SPI request is executing inside the tiT task (as a result of, say, an incoming websocket command request) it can get pre-empted by the higher-priority Sming task, resulting in a crash/hang.

The whole point of Sming is that it avoids these multitasking issues, ~~but as previously discussed the IDF black-boxes the WiFi stack which itself depends on a FreeRTOS environment.~~ Fortunately the lwip/tcpip stuff is accessible for ~~hacking~~ inspection and revision.

Possible solutions include:

Could we implement a context switch to the Sming task at a suitable point in the tiT task? Not sure.
Use a sempahore/mutex which would be used in the network stack somewhere and in the main Sming task loop.
Find a way to get Sming to run in the tiT task rather than creating its own.
Modify the tcpip stuff so it's polled from the main Sming loop rather than using its own task. This would be my preferred solution as it's how the other architectures roll.

I modified the debug output to include the current task handle. In combination with TaskStat this shows the issue.

@@ -14,11 +14,17 @@
  *
  ****/
 
 #pragma once
 
-#include "FakePgmSpace.h"
+#include <FakePgmSpace.h>
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-parameter"
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#pragma GCC diagnostic pop
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
@@ -74,12 +80,12 @@ extern uint32_t system_get_time();
 		m_printf(fmtbuf, __LINE__, ##__VA_ARGS__);                                                                     \
 	}))
 #else
 #define debug_e(fmt, ...)                                                                                              \
 	(__extension__({                                                                                                   \
-		PSTR_ARRAY(fmtbuf, "%u " fmt "\r\n");                                                                          \
-		m_printf(fmtbuf, system_get_time(), ##__VA_ARGS__);                                                            \
+		PSTR_ARRAY(fmtbuf, "%u %x " fmt "\r\n");                                                                          \
+		m_printf(fmtbuf, system_get_time(), xTaskGetCurrentTaskHandle(), ##__VA_ARGS__);                                                            \
 	}))
 #endif

The text was updated successfully, but these errors were encountered:

mikee47 · 2024-11-20T13:04:14Z

Option (4) is a lot of work and would involve removing the entire esp_netif stack. That would be a great option to have in the future.

After digging about in the code I can't think of a way to implement options (1) or (2).

That leaves option (3). I tried getting rid of the Sming task and hooking into the tcpip_thread function which runs the tiT task. It seems to work except we never get a network address. Feels like there's a deadlock somewhere so I've left that - it's a bit hacky.

However, lwip conveniently provides the tcpip_callback function which can be used to run code in the tcpip thread context. That thread has its own messaging system (for dealing with input from the wifi task, for example) so we can just use that directly for our System::queueCallback calls. Simple change and it works fine.

This means all Sming code (in a non-interrupt context) runs in the tiT task, and the Sming task remains only to service the IDF task queue and watchdog timer. Stack sizes will require adjusting: tiT is currently 8192, Sming is 16384.

My current fix still has messages posted to the Sming event queue, which are then dispatched into the tcpip event queue. This two-step mechanism obviously makes it less efficient but it's necessary as other IDF components such as WiFi require the queue, and the messages must be serialised correctly along with everything else.

Non-network applications (DISABLE_NETWORK) are not affected and still require the Sming task with custom event loop.

mikee47 added the Bug label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esp32 network code runs in different task #2912

esp32 network code runs in different task #2912

mikee47 commented Nov 19, 2024 •

edited

Loading

mikee47 commented Nov 20, 2024

esp32 network code runs in different task #2912

esp32 network code runs in different task #2912

Comments

mikee47 commented Nov 19, 2024 • edited Loading

mikee47 commented Nov 20, 2024

mikee47 commented Nov 19, 2024 •

edited

Loading