Simplify NetworkChannel interface #150

LasseRosenow · 2024-12-04T13:53:11Z

This pull request removes the try_connect function from the NetworkChannel.
The runtime will now only tell the NetworkChannel implementation such as CoapUdpIpChannel to open or close a connection. The channel will then internally take care of connection establishment.

The runtime can get the NetworkChannel state using get_conneciton_state.

github-actions · 2024-12-04T13:55:09Z

Benchmark results after merging this PR:

Benchmark results

Performance:

PingPongUc:
Best Time: 141.067 msec
Worst Time: 196.152 msec
Median Time: 159.576 msec

PingPongC:
Best Time: 169.724 msec
Worst Time: 176.871 msec
Median Time: 170.131 msec

ReactionLatencyUc:
Best latency: 23676 nsec
Median latency: 59825 nsec
Worst latency: 758672 nsec

ReactionLatencyC:
Best latency: 20470 nsec
Median latency: 59803 nsec
Worst latency: 180593 nsec

Memory usage:

PingPongUc:
text data bss dec hex filename
39444 752 8496 48692 be34 bin/PingPongUc

PingPongC:
text data bss dec hex filename
46044 872 360 47276 b8ac bin/PingPongC

ReactionLatencyUc:
text data bss dec hex filename
29689 736 2112 32537 7f19 bin/ReactionLatencyUc

ReactionLatencyC:
text data bss dec hex filename
41666 840 360 42866 a772 bin/ReactionLatencyC

erlingrj

This is looking very good, I like the new organization a lot. There is an important question about possible race conditions (current and future) wrt to updating the channel state.

src/platform/posix/tcp_ip_channel.c

erlingrj · 2024-12-06T20:22:03Z

src/platform/posix/tcp_ip_channel.c

+/**
+ * @brief Main loop of the TcpIpChannel.
+ */
+static void *_TcpIpChannel_worker_thread(void *untyped_self) {


I like this new organization of the code. Great!

erlingrj · 2024-12-06T20:30:46Z

src/platform/posix/tcp_ip_channel.c


    if (bytes_send < 0) {
-      LF_ERR(NET, "write failed errno=%d", errno);
+      LF_ERR(NET, "TcpIpChannel: Write failed errno=%d", errno);
      switch (errno) {
      case ETIMEDOUT:
      case ENOTCONN:


Is it necessary to call _TcpIpChannel_update_state here? We might get some very subtle races here when the worker thread also detects that the socket is closed he will initiate a reconnect right? Then we don't want to overwrite the new state if we fail to send some data as well?

If we are guaranteed that the worker thread will not block forever on receive, and that it will detect anything that is detected here, then I propose that we only read the channel state from the runtime side, and only write to it from the worker thread. What do you think?

Yes there might be a race if the worker thread already started a reconnection. We could guard this though.
I think it is better to update the state straight away if we see something is wrong.

A very robust way could be to just mutex the whole worker thread while loop and then also mutex the whole send_blocking function? And of course also the get_state function. This way both should not be able to interfer with each other.
We should though then change back to making the tcp receive non-blocking.

I have now implemented a possible solution in the Fix race conditions commit

LasseRosenow · 2024-12-09T14:01:58Z

@erlingrj I am thinking about one thing: It seems to me that the runtime really only cares about one thing now: Is the channel connected or not. So maybe it is reasonable to replace that get_connection_state function with a simple boolean function: is_connected?

Or do you think maybe there could be reasons why we need more information in the future?

github-actions · 2024-12-09T17:43:00Z

Memory usage after merging this PR will be:

Memory Report

action_empty_test_c

	from	to	increase (%)
text	59655	59466	-0.32
data	744	744	0.00
bss	10112	10112	0.00
total	70511	70322	-0.27

action_microstep_test_c

	from	to	increase (%)
text	60486	60297	-0.31
data	752	752	0.00
bss	10112	10112	0.00
total	71350	71161	-0.26

action_overwrite_test_c

	from	to	increase (%)
text	60323	60134	-0.31
data	744	744	0.00
bss	10112	10112	0.00
total	71179	70990	-0.27

action_test_c

	from	to	increase (%)
text	60259	60070	-0.31
data	752	752	0.00
bss	10112	10112	0.00
total	71123	70934	-0.27

deadline_test_c

	from	to	increase (%)
text	56102	55913	-0.34
data	760	760	0.00
bss	10784	10784	0.00
total	67646	67457	-0.28

delayed_conn_test_c

	from	to	increase (%)
text	61516	61327	-0.31
data	744	744	0.00
bss	10112	10112	0.00
total	72372	72183	-0.26

event_payload_pool_test_c

	from	to
text	18330	18330
data	624	624
bss	320	320
total	19274	19274

event_queue_test_c

	from	to
text	27597	27597
data	736	736
bss	480	480
total	28813	28813

nanopb_test_c

	from	to
text	42888	42888
data	904	904
bss	320	320
total	44112	44112

port_test_c

	from	to	increase (%)
text	61464	61275	-0.31
data	744	744	0.00
bss	10112	10112	0.00
total	72320	72131	-0.26

reaction_queue_test_c

	from	to
text	27319	27319
data	736	736
bss	480	480
total	28535	28535

request_shutdown_test_c

	from	to	increase (%)
text	60458	60269	-0.31
data	744	744	0.00
bss	10112	10112	0.00
total	71314	71125	-0.27

startup_test_c

	from	to	increase (%)
text	55801	55612	-0.34
data	752	752	0.00
bss	10784	10784	0.00
total	67337	67148	-0.28

tcp_channel_test_c

	from	to	increase (%)
text	91462	92381	1.00
data	1224	1224	0.00
bss	21344	21376	0.15
total	114030	114981	0.83

timer_test_c

	from	to	increase (%)
text	55692	55503	-0.34
data	744	744	0.00
bss	10784	10784	0.00
total	67220	67031	-0.28

LasseRosenow · 2024-12-09T17:59:09Z

I have now also checked that the POSIX federated example works :) So I mark this PR as ready 🥳

tanneberger

LGTM

So the CoAP API is

/connect (telling the other federate that I am ready to receive messages)
/message (message passes)
/disconnect (gracefull disconnect)

LasseRosenow · 2024-12-09T18:24:07Z

LGTM

So the CoAP API is

/connect (telling the other federate that I am ready to receive messages)

/message (message passes)

/disconnect (gracefull disconnect)

Yes, but I don't have strong opinions on that design, that was just the most obvious way to do it for me

erlingrj

Thanks for addressing my comments Lasse and continuing working on the PR. I am still hoping we can find a more optimal solution for TcpIpChannel. I think you have solved the race conditions, but I think we sacrifice too much. Ideally we can avoid both busy-polling the receive socket, and also allow concurrently receiving and sending data.

We could for instance do a select on the receive socket with a timeout of 1s. And if we time-out, then we can grab a mutex. And check if a flag was set from send_blocking signalling a problem we transmitting data. And then possibly instigate a reconnection?

src/platform/posix/tcp_ip_channel.c

erlingrj · 2024-12-09T18:42:10Z

src/platform/posix/tcp_ip_channel.c

-        self->receive_callback(self->federated_connection, &self->output);
-      } else {
-        LF_ERR(NET, "TcpIpChannel: Error receiving message %d", ret);
+    pthread_mutex_lock(&self->state_mutex);


This means that we cannot send concurrently with receiving? It seems a bit too constraining?

erlingrj · 2024-12-09T18:44:16Z

src/platform/posix/tcp_ip_channel.c

    }
+    pthread_mutex_unlock(&self->state_mutex);
+    _env->platform->wait_for(_env->platform, TCP_IP_CHANNEL_WORKER_THREAD_MAIN_LOOP_SLEEP);


It is a little unfortunate that we have to set a sleep here. The duration of the sleep adds to the worst case latency between two federates. Also it means that we are always polling the the receive socket and that we never really can go to low-power sleep

erlingrj · 2024-12-09T18:53:27Z

src/platform/posix/tcp_ip_channel.c

-    LF_ERR(NET, "TcpIpChannel: Could not encode protobuf");
-    return LF_ERR;
-  }
+  pthread_mutex_lock(&self->state_mutex);


If the worker thread is constantly polling the socket, then I think he will detect the closed socket very quickly anyways and we would need to handle this from the send_blocking side and we wouldn't need to lock any mutex here.

However, I am unsure whether we want to have the worker thread busy-poll the receive socket. Is it possible for the worker thread to both block on the receiving socket AND a condition variable? And then we could signal that condition variable from send_blocking if we fail to send. That would wake up the worker thread, and then let it handle the updating of state etc?

LasseRosenow · 2024-12-09T19:17:34Z

Thanks for the feedback! All good points I will try to find a better solution tomorrow

erlingrj · 2024-12-09T19:29:41Z

@erlingrj I am thinking about one thing: It seems to me that the runtime really only cares about one thing now: Is the channel connected or not. So maybe it is reasonable to replace that get_connection_state function with a simple boolean function: is_connected?

Or do you think maybe there could be reasons why we need more information in the future?

Let's do this, I think it is better to keep the API as minimal as possible and then potentially expand later

LasseRosenow · 2024-12-10T16:53:13Z

So I hope I have addressed the parallel send and receive issue now.
I am now using a pipe between send_blocking and the worker thread.
The worker thread uses select to listen for the channel socket and the pipe.
I also fixed some issues with reset_socket causing the other federate to close the channel which is something that should only happen from within the runtime in my opinion, because it makes it impossible for the tcp ip channel to recover

LasseRosenow · 2024-12-10T17:37:08Z

I have now changed it to use an eventfd_t instead of a pipe. This way it should be compatible to zephyr OS also

Co-authored-by: erling <erling.jellum@gmail.com>

…TATE_CONNECTED

erlingrj

This looks very good to me. Just minor nitpicks. OK to merge after addressing them!

src/platform/posix/tcp_ip_channel.c

erlingrj · 2024-12-10T19:37:44Z

src/platform/posix/tcp_ip_channel.c

+      max_fd = (socket > self->send_failed_event_fds) ? socket : self->send_failed_event_fds;
+
+      // Wait for data or cancel if send_failed externally
+      select(max_fd + 1, &readfds, NULL, NULL, NULL);


Should we check the return value of select?

I added a ERR log if it returns negative now. :)

erlingrj · 2024-12-10T19:43:28Z

src/platform/riot/coap_udp_ip_channel.c

+  if (_CoapUdpIpChannel_send_coap_message_with_payload(self, &self->remote, "/message", _client_send_blocking_callback,
+                                                       message)) {
+    // Wait until the response handler confirms the ack or times out
+    while (!self->send_ack_received) {


I guess this is the only way for us to guarantee in-order reliable delivery, right?

Actually this was implemented to achieve a blocking send. Just like in TCP we want to send blocking so we wait for the message to be acked.

But you are correct this somewhat also gives us in-order guarantee, but I am not sure if it covers all cases. We can think about this more later I think.

Yes, this is a sane starting point and we can think about edge cases and intermittent connections etc later

Co-authored-by: erling <erling.jellum@gmail.com>

LasseRosenow · 2024-12-10T20:04:52Z

The review should be addressed now :)

github-actions · 2024-12-10T20:08:33Z

Coverage after merging new-network-channel-interface into main will be

71.36%

Coverage Report

File	Stmts	Branches	Funcs	Lines	Uncovered Lines
src
action.c	81.90%	69.23%	100%	85.33%	120–121, 24, 43–46, 49, 51, 53, 56–58, 63–64, 73–74, 85–86
builtin_triggers.c	90.91%	70%	100%	96.77%	14, 18, 40, 43
connection.c	78.52%	51.16%	100%	88.66%	10, 104, 11, 110, 123–124, 136–137, 14, 14, 143, 145, 16–17, 21–22, 22, 22–23, 25, 27–28, 33, 48, 48, 48–49, 55, 60–62, 97
environment.c	78.35%	55.56%	84.62%	83.33%	12–13, 18, 20–21, 31, 35–36, 42–43, 51–52, 55–56, 60–61, 93–95
event.c	95.35%	92.86%	100%	96.15%	14–15
federated.c	5.61%	2.97%	7.69%	6.76%	101, 104–105, 105, 105–106, 108–109, 11, 111, 115–116, 118–120, 123, 125–129, 13, 13, 13, 130, 132–134, 137–139, 139, 139, 14, 140, 140, 140–142, 144, 147–148, 15, 150–154, 156–159, 16, 160–161, 163, 163, 163–166, 168, 168, 168–169, 17, 17, 17, 170, 170, 170–171, 175–176, 176, 176, 179–180, 184–186, 188, 188, 188, 190–194, 197, 197, 197–199, 20, 200, 203–204, 204, 204–205, 207–208, 21, 211–212, 217–218, 218, 218–219, 22, 22, 22, 221, 223, 223, 223–226, 226, 226, 226, 226–229, 23, 230–238, 24, 24, 24, 242, 245, 245, 245–247, 25, 251, 254–255, 255, 255, 255–259, 26, 260–263, 265, 27, 27, 27, 271–273, 28, 28, 28, 28, 28, 285–286, 289, 29, 290–292, 294, 294, 294–295, 299–300, 300, 300, 302, 304–305, 305, 305–306, 306, 306–307, 307, 307–308, 308, 308–309, 309, 309, 31, 310, 310, 310, 312, 312, 312–313, 313, 313–314, 314, 314–315, 315, 315, 317, 34, 34, 34, 34, 34–35, 38, 42–43, 45–48, 50, 50, 50–51, 51, 51, 53, 53, 53–55, 55, 55–57, 61–62, 66–67, 69–72, 74, 76, 76, 76–77, 77, 77–78, 78, 78–79, 79, 79, 82–83, 85–88, 9, 90, 90, 90–93, 95, 97, 97, 97–98
logging.c	87.50%	80%	100%	88.64%	24, 37–39, 46, 59–60
network_channel.c	57.69%	50%	100%	58.82%	36, 36, 36, 36, 41–44, 49–50, 53
port.c	78.08%	45.83%	100%	93.33%	10, 10, 10, 16, 20, 25, 25–27, 27, 27–28, 39, 39, 39–40
queues.c	89.94%	80.36%	100%	94.06%	108, 113, 119, 21–23, 47–48, 60–61, 84–88, 91–92
reaction.c	70.34%	54.55%	100%	78.26%	15, 17, 21–22, 28–31, 31, 31–32, 42, 45, 47, 52–53, 53, 53–55, 55, 55–56, 73, 89–91, 91, 91–94, 94, 94–95
reactor.c	69.33%	51.52%	100%	82.28%	10, 101–102, 14–19, 22, 28, 30, 32–37, 37, 37–38, 38, 38, 43, 55, 58–59, 59, 59–60, 60, 60–61, 63, 77–78, 81–82, 82, 82–83, 83, 83–84, 86, 91
serialization.c	50%	50%	50%	50%	16–17, 26–27, 33–35, 38–40
tag.c	40.19%	31.48%	60%	47.92%	14, 14–15, 17, 17–18, 23–24, 24, 24, 24, 24–25, 27, 27, 27, 27, 27–28, 30, 30, 30–31, 33–34, 34, 34–35, 37, 37, 37, 37, 37–38, 40, 40, 40, 40, 40–41, 43, 53–54, 63, 63–64, 83–85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85–87, 89
timer.c	95%	66.67%	100%	100%	14, 25
trigger.c	100%	100%	100%	100%
util.c	0%	0%	0%	0%	10, 3–4, 4, 4–5, 5, 5–6, 8–9
src/platform/posix
posix.c	52.73%	30%	66.67%	56%	100, 100, 100–102, 106, 16, 18, 20–21, 34–36, 38–40, 48–49, 54–59, 59, 59–62, 62, 62–64, 67, 73–74, 78, 81, 92–94, 94, 94–96, 98–99
tcp_ip_channel.c	66.54%	51.61%	100%	76.16%	101, 103, 106–107, 117–118, 118, 118–119, 124–125, 125, 125–127, 131–132, 132, 132–134, 148, 154, 154, 154, 154, 154–155, 155, 155–156, 158, 158, 158–159, 168, 175–176, 180, 184, 184, 184–186, 186, 186–187, 189, 189, 189–191, 195, 195, 195–196, 216, 229–230, 230, 230–231, 237, 242–243, 243, 243–244, 244, 244–245, 247–248, 248, 248–249, 249, 249, 251–254, 264, 264–265, 265, 265–266, 290–291, 291, 291–293, 293, 293–294, 294, 294–295, 304, 304, 309, 311, 313, 315, 327–328, 344–346, 346,

LasseRosenow changed the title ~~Migrate to simplified network channel interface~~ Simplify NetworkChannel interface Dec 4, 2024

erlingrj self-requested a review December 5, 2024 17:46

erlingrj reviewed Dec 6, 2024

View reviewed changes

LasseRosenow force-pushed the new-network-channel-interface branch from 582fea4 to f310d69 Compare December 9, 2024 17:41

LasseRosenow marked this pull request as ready for review December 9, 2024 17:59

LasseRosenow requested a review from erlingrj December 9, 2024 17:59

tanneberger approved these changes Dec 9, 2024

View reviewed changes

erlingrj reviewed Dec 9, 2024

View reviewed changes

LasseRosenow force-pushed the new-network-channel-interface branch from 0ae92fe to a5bb763 Compare December 9, 2024 22:40

LasseRosenow requested a review from erlingrj December 10, 2024 16:50

LasseRosenow and others added 11 commits December 10, 2024 18:37

Migrate to simplified network channel interface

f588c60

Fix tcp ip channel test

ee6a1d4

Fix TcpIpChannel

89efe51

Add mutex to tcpipchannel

e598a3f

Fix race conditions

a4cb1ce

Update src/platform/posix/tcp_ip_channel.c

0c9f4d4

Co-authored-by: erling <erling.jellum@gmail.com>

Generalize state to string

194453d

Fix test

c160b33

OMG finally fix why the test is failing sometimes

25419aa

Fix examples

6aca92a

Rename server to is_server for better readability

e731295

LasseRosenow added 7 commits December 10, 2024 18:37

Migrate get_connection_state to is_connected

1f98c81

Inform runtime only if the state changes to or from NETWORK_CHANNEL_S…

2efb5f9

…TATE_CONNECTED

Remove double include of mutex

56c9d4d

Allow parallel send and recv

b7676d0

cleanup

fdcb100

Use eventfd instead to support zephyr os

8bffaa5

Reenable test

a85afb0

LasseRosenow force-pushed the new-network-channel-interface branch from ef78b17 to a85afb0 Compare December 10, 2024 17:38

Add special debugging to coap channel as well

213652a

erlingrj approved these changes Dec 10, 2024

View reviewed changes

LasseRosenow and others added 3 commits December 10, 2024 20:52

Update src/platform/posix/tcp_ip_channel.c

cdebe93

Co-authored-by: erling <erling.jellum@gmail.com>

Handle select return type

b5d4f94

Fix

2c18ab2

LasseRosenow requested a review from erlingrj December 10, 2024 20:04

erlingrj merged commit 7a9fc7f into main Dec 10, 2024
7 of 8 checks passed

erlingrj deleted the new-network-channel-interface branch December 10, 2024 20:06

Simplify NetworkChannel interface #150

Simplify NetworkChannel interface #150

Conversation

LasseRosenow commented Dec 4, 2024 • edited Loading

github-actions bot commented Dec 4, 2024 • edited Loading

Performance:

Memory usage:

erlingrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LasseRosenow Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

LasseRosenow commented Dec 9, 2024 • edited Loading

github-actions bot commented Dec 9, 2024 • edited Loading

action_empty_test_c

action_microstep_test_c

action_overwrite_test_c

action_test_c

deadline_test_c

delayed_conn_test_c

event_payload_pool_test_c

event_queue_test_c

nanopb_test_c

port_test_c

reaction_queue_test_c

request_shutdown_test_c

startup_test_c

tcp_channel_test_c

timer_test_c

LasseRosenow commented Dec 9, 2024

tanneberger left a comment

Choose a reason for hiding this comment

LasseRosenow commented Dec 9, 2024

erlingrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LasseRosenow commented Dec 9, 2024

erlingrj commented Dec 9, 2024

LasseRosenow commented Dec 10, 2024

LasseRosenow commented Dec 10, 2024

erlingrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LasseRosenow commented Dec 10, 2024

github-actions bot commented Dec 10, 2024

LasseRosenow commented Dec 4, 2024 •

edited

Loading

github-actions bot commented Dec 4, 2024 •

edited

Loading

LasseRosenow Dec 9, 2024 •

edited

Loading

LasseRosenow commented Dec 9, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024 •

edited

Loading