Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify NetworkChannel interface #150

Merged
merged 22 commits into from
Dec 10, 2024
Merged

Conversation

LasseRosenow
Copy link
Collaborator

@LasseRosenow LasseRosenow commented Dec 4, 2024

This pull request removes the try_connect function from the NetworkChannel.
The runtime will now only tell the NetworkChannel implementation such as CoapUdpIpChannel to open or close a connection. The channel will then internally take care of connection establishment.

The runtime can get the NetworkChannel state using get_conneciton_state.

@LasseRosenow LasseRosenow changed the title Migrate to simplified network channel interface Simplify NetworkChannel interface Dec 4, 2024
Copy link
Contributor

github-actions bot commented Dec 4, 2024

Benchmark results after merging this PR:

Benchmark results

Performance:

PingPongUc:
Best Time: 141.067 msec
Worst Time: 196.152 msec
Median Time: 159.576 msec

PingPongC:
Best Time: 169.724 msec
Worst Time: 176.871 msec
Median Time: 170.131 msec

ReactionLatencyUc:
Best latency: 23676 nsec
Median latency: 59825 nsec
Worst latency: 758672 nsec

ReactionLatencyC:
Best latency: 20470 nsec
Median latency: 59803 nsec
Worst latency: 180593 nsec

Memory usage:

PingPongUc:
text data bss dec hex filename
39444 752 8496 48692 be34 bin/PingPongUc

PingPongC:
text data bss dec hex filename
46044 872 360 47276 b8ac bin/PingPongC

ReactionLatencyUc:
text data bss dec hex filename
29689 736 2112 32537 7f19 bin/ReactionLatencyUc

ReactionLatencyC:
text data bss dec hex filename
41666 840 360 42866 a772 bin/ReactionLatencyC

@erlingrj erlingrj self-requested a review December 5, 2024 17:46
Copy link
Collaborator

@erlingrj erlingrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good, I like the new organization a lot. There is an important question about possible race conditions (current and future) wrt to updating the channel state.

src/platform/posix/tcp_ip_channel.c Outdated Show resolved Hide resolved
/**
* @brief Main loop of the TcpIpChannel.
*/
static void *_TcpIpChannel_worker_thread(void *untyped_self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new organization of the code. Great!


if (bytes_send < 0) {
LF_ERR(NET, "write failed errno=%d", errno);
LF_ERR(NET, "TcpIpChannel: Write failed errno=%d", errno);
switch (errno) {
case ETIMEDOUT:
case ENOTCONN:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to call _TcpIpChannel_update_state here? We might get some very subtle races here when the worker thread also detects that the socket is closed he will initiate a reconnect right? Then we don't want to overwrite the new state if we fail to send some data as well?

If we are guaranteed that the worker thread will not block forever on receive, and that it will detect anything that is detected here, then I propose that we only read the channel state from the runtime side, and only write to it from the worker thread. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there might be a race if the worker thread already started a reconnection. We could guard this though.
I think it is better to update the state straight away if we see something is wrong.

A very robust way could be to just mutex the whole worker thread while loop and then also mutex the whole send_blocking function? And of course also the get_state function. This way both should not be able to interfer with each other.
We should though then change back to making the tcp receive non-blocking.

Copy link
Collaborator Author

@LasseRosenow LasseRosenow Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now implemented a possible solution in the Fix race conditions commit

@LasseRosenow
Copy link
Collaborator Author

LasseRosenow commented Dec 9, 2024

@erlingrj I am thinking about one thing: It seems to me that the runtime really only cares about one thing now: Is the channel connected or not. So maybe it is reasonable to replace that get_connection_state function with a simple boolean function: is_connected?

Or do you think maybe there could be reasons why we need more information in the future?

@LasseRosenow LasseRosenow force-pushed the new-network-channel-interface branch from 582fea4 to f310d69 Compare December 9, 2024 17:41
Copy link
Contributor

github-actions bot commented Dec 9, 2024

Memory usage after merging this PR will be:

Memory Report

action_empty_test_c

from to increase (%)
text 59655 59466 -0.32
data 744 744 0.00
bss 10112 10112 0.00
total 70511 70322 -0.27

action_microstep_test_c

from to increase (%)
text 60486 60297 -0.31
data 752 752 0.00
bss 10112 10112 0.00
total 71350 71161 -0.26

action_overwrite_test_c

from to increase (%)
text 60323 60134 -0.31
data 744 744 0.00
bss 10112 10112 0.00
total 71179 70990 -0.27

action_test_c

from to increase (%)
text 60259 60070 -0.31
data 752 752 0.00
bss 10112 10112 0.00
total 71123 70934 -0.27

deadline_test_c

from to increase (%)
text 56102 55913 -0.34
data 760 760 0.00
bss 10784 10784 0.00
total 67646 67457 -0.28

delayed_conn_test_c

from to increase (%)
text 61516 61327 -0.31
data 744 744 0.00
bss 10112 10112 0.00
total 72372 72183 -0.26

event_payload_pool_test_c

from to increase (%)
text 18330 18330 0.00
data 624 624 0.00
bss 320 320 0.00
total 19274 19274 0.00

event_queue_test_c

from to increase (%)
text 27597 27597 0.00
data 736 736 0.00
bss 480 480 0.00
total 28813 28813 0.00

nanopb_test_c

from to increase (%)
text 42888 42888 0.00
data 904 904 0.00
bss 320 320 0.00
total 44112 44112 0.00

port_test_c

from to increase (%)
text 61464 61275 -0.31
data 744 744 0.00
bss 10112 10112 0.00
total 72320 72131 -0.26

reaction_queue_test_c

from to increase (%)
text 27319 27319 0.00
data 736 736 0.00
bss 480 480 0.00
total 28535 28535 0.00

request_shutdown_test_c

from to increase (%)
text 60458 60269 -0.31
data 744 744 0.00
bss 10112 10112 0.00
total 71314 71125 -0.27

startup_test_c

from to increase (%)
text 55801 55612 -0.34
data 752 752 0.00
bss 10784 10784 0.00
total 67337 67148 -0.28

tcp_channel_test_c

from to increase (%)
text 91462 92381 1.00
data 1224 1224 0.00
bss 21344 21376 0.15
total 114030 114981 0.83

timer_test_c

from to increase (%)
text 55692 55503 -0.34
data 744 744 0.00
bss 10784 10784 0.00
total 67220 67031 -0.28

@LasseRosenow
Copy link
Collaborator Author

I have now also checked that the POSIX federated example works :) So I mark this PR as ready 🥳

@LasseRosenow LasseRosenow marked this pull request as ready for review December 9, 2024 17:59
@LasseRosenow LasseRosenow requested a review from erlingrj December 9, 2024 17:59
Copy link
Member

@tanneberger tanneberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

So the CoAP API is

  • /connect (telling the other federate that I am ready to receive messages)
  • /message (message passes)
  • /disconnect (gracefull disconnect)

@LasseRosenow
Copy link
Collaborator Author

LGTM

So the CoAP API is

  • /connect (telling the other federate that I am ready to receive messages)
  • /message (message passes)
  • /disconnect (gracefull disconnect)

Yes, but I don't have strong opinions on that design, that was just the most obvious way to do it for me

Copy link
Collaborator

@erlingrj erlingrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments Lasse and continuing working on the PR. I am still hoping we can find a more optimal solution for TcpIpChannel. I think you have solved the race conditions, but I think we sacrifice too much. Ideally we can avoid both busy-polling the receive socket, and also allow concurrently receiving and sending data.

We could for instance do a select on the receive socket with a timeout of 1s. And if we time-out, then we can grab a mutex. And check if a flag was set from send_blocking signalling a problem we transmitting data. And then possibly instigate a reconnection?

src/platform/posix/tcp_ip_channel.c Outdated Show resolved Hide resolved
self->receive_callback(self->federated_connection, &self->output);
} else {
LF_ERR(NET, "TcpIpChannel: Error receiving message %d", ret);
pthread_mutex_lock(&self->state_mutex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that we cannot send concurrently with receiving? It seems a bit too constraining?

}
pthread_mutex_unlock(&self->state_mutex);
_env->platform->wait_for(_env->platform, TCP_IP_CHANNEL_WORKER_THREAD_MAIN_LOOP_SLEEP);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little unfortunate that we have to set a sleep here. The duration of the sleep adds to the worst case latency between two federates. Also it means that we are always polling the the receive socket and that we never really can go to low-power sleep

LF_ERR(NET, "TcpIpChannel: Could not encode protobuf");
return LF_ERR;
}
pthread_mutex_lock(&self->state_mutex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the worker thread is constantly polling the socket, then I think he will detect the closed socket very quickly anyways and we would need to handle this from the send_blocking side and we wouldn't need to lock any mutex here.

However, I am unsure whether we want to have the worker thread busy-poll the receive socket. Is it possible for the worker thread to both block on the receiving socket AND a condition variable? And then we could signal that condition variable from send_blocking if we fail to send. That would wake up the worker thread, and then let it handle the updating of state etc?

@LasseRosenow
Copy link
Collaborator Author

Thanks for the feedback! All good points I will try to find a better solution tomorrow

@erlingrj
Copy link
Collaborator

erlingrj commented Dec 9, 2024

@erlingrj I am thinking about one thing: It seems to me that the runtime really only cares about one thing now: Is the channel connected or not. So maybe it is reasonable to replace that get_connection_state function with a simple boolean function: is_connected?

Or do you think maybe there could be reasons why we need more information in the future?

Let's do this, I think it is better to keep the API as minimal as possible and then potentially expand later

@LasseRosenow LasseRosenow force-pushed the new-network-channel-interface branch from 0ae92fe to a5bb763 Compare December 9, 2024 22:40
@LasseRosenow
Copy link
Collaborator Author

So I hope I have addressed the parallel send and receive issue now.
I am now using a pipe between send_blocking and the worker thread.
The worker thread uses select to listen for the channel socket and the pipe.
I also fixed some issues with reset_socket causing the other federate to close the channel which is something that should only happen from within the runtime in my opinion, because it makes it impossible for the tcp ip channel to recover

@LasseRosenow
Copy link
Collaborator Author

I have now changed it to use an eventfd_t instead of a pipe. This way it should be compatible to zephyr OS also

@LasseRosenow LasseRosenow force-pushed the new-network-channel-interface branch from ef78b17 to a85afb0 Compare December 10, 2024 17:38
Copy link
Collaborator

@erlingrj erlingrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me. Just minor nitpicks. OK to merge after addressing them!

src/platform/posix/tcp_ip_channel.c Show resolved Hide resolved
src/platform/posix/tcp_ip_channel.c Show resolved Hide resolved
max_fd = (socket > self->send_failed_event_fds) ? socket : self->send_failed_event_fds;

// Wait for data or cancel if send_failed externally
select(max_fd + 1, &readfds, NULL, NULL, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the return value of select?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a ERR log if it returns negative now. :)

if (_CoapUdpIpChannel_send_coap_message_with_payload(self, &self->remote, "/message", _client_send_blocking_callback,
message)) {
// Wait until the response handler confirms the ack or times out
while (!self->send_ack_received) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the only way for us to guarantee in-order reliable delivery, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was implemented to achieve a blocking send. Just like in TCP we want to send blocking so we wait for the message to be acked.

But you are correct this somewhat also gives us in-order guarantee, but I am not sure if it covers all cases. We can think about this more later I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a sane starting point and we can think about edge cases and intermittent connections etc later

@LasseRosenow
Copy link
Collaborator Author

The review should be addressed now :)

@erlingrj erlingrj merged commit 7a9fc7f into main Dec 10, 2024
7 of 8 checks passed
@erlingrj erlingrj deleted the new-network-channel-interface branch December 10, 2024 20:06
Copy link
Contributor

Coverage after merging new-network-channel-interface into main will be

71.36%

Coverage Report
FileStmtsBranchesFuncsLinesUncovered Lines
src
   action.c81.90%69.23%100%85.33%120–121, 24, 43–46, 49, 51, 53, 56–58, 63–64, 73–74, 85–86
   builtin_triggers.c90.91%70%100%96.77%14, 18, 40, 43
   connection.c78.52%51.16%100%88.66%10, 104, 11, 110, 123–124, 136–137, 14, 14, 143, 145, 16–17, 21–22, 22, 22–23, 25, 27–28, 33, 48, 48, 48–49, 55, 60–62, 97
   environment.c78.35%55.56%84.62%83.33%12–13, 18, 20–21, 31, 35–36, 42–43, 51–52, 55–56, 60–61, 93–95
   event.c95.35%92.86%100%96.15%14–15
   federated.c5.61%2.97%7.69%6.76%101, 104–105, 105, 105–106, 108–109, 11, 111, 115–116, 118–120, 123, 125–129, 13, 13, 13, 130, 132–134, 137–139, 139, 139, 14, 140, 140, 140–142, 144, 147–148, 15, 150–154, 156–159, 16, 160–161, 163, 163, 163–166, 168, 168, 168–169, 17, 17, 17, 170, 170, 170–171, 175–176, 176, 176, 179–180, 184–186, 188, 188, 188, 190–194, 197, 197, 197–199, 20, 200, 203–204, 204, 204–205, 207–208, 21, 211–212, 217–218, 218, 218–219, 22, 22, 22, 221, 223, 223, 223–226, 226, 226, 226, 226–229, 23, 230–238, 24, 24, 24, 242, 245, 245, 245–247, 25, 251, 254–255, 255, 255, 255–259, 26, 260–263, 265, 27, 27, 27, 271–273, 28, 28, 28, 28, 28, 285–286, 289, 29, 290–292, 294, 294, 294–295, 299–300, 300, 300, 302, 304–305, 305, 305–306, 306, 306–307, 307, 307–308, 308, 308–309, 309, 309, 31, 310, 310, 310, 312, 312, 312–313, 313, 313–314, 314, 314–315, 315, 315, 317, 34, 34, 34, 34, 34–35, 38, 42–43, 45–48, 50, 50, 50–51, 51, 51, 53, 53, 53–55, 55, 55–57, 61–62, 66–67, 69–72, 74, 76, 76, 76–77, 77, 77–78, 78, 78–79, 79, 79, 82–83, 85–88, 9, 90, 90, 90–93, 95, 97, 97, 97–98
   logging.c87.50%80%100%88.64%24, 37–39, 46, 59–60
   network_channel.c57.69%50%100%58.82%36, 36, 36, 36, 41–44, 49–50, 53
   port.c78.08%45.83%100%93.33%10, 10, 10, 16, 20, 25, 25–27, 27, 27–28, 39, 39, 39–40
   queues.c89.94%80.36%100%94.06%108, 113, 119, 21–23, 47–48, 60–61, 84–88, 91–92
   reaction.c70.34%54.55%100%78.26%15, 17, 21–22, 28–31, 31, 31–32, 42, 45, 47, 52–53, 53, 53–55, 55, 55–56, 73, 89–91, 91, 91–94, 94, 94–95
   reactor.c69.33%51.52%100%82.28%10, 101–102, 14–19, 22, 28, 30, 32–37, 37, 37–38, 38, 38, 43, 55, 58–59, 59, 59–60, 60, 60–61, 63, 77–78, 81–82, 82, 82–83, 83, 83–84, 86, 91
   serialization.c50%50%50%50%16–17, 26–27, 33–35, 38–40
   tag.c40.19%31.48%60%47.92%14, 14–15, 17, 17–18, 23–24, 24, 24, 24, 24–25, 27, 27, 27, 27, 27–28, 30, 30, 30–31, 33–34, 34, 34–35, 37, 37, 37, 37, 37–38, 40, 40, 40, 40, 40–41, 43, 53–54, 63, 63–64, 83–85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85–87, 89
   timer.c95%66.67%100%100%14, 25
   trigger.c100%100%100%100%
   util.c0%0%0%0%10, 3–4, 4, 4–5, 5, 5–6, 8–9
src/platform/posix
   posix.c52.73%30%66.67%56%100, 100, 100–102, 106, 16, 18, 20–21, 34–36, 38–40, 48–49, 54–59, 59, 59–62, 62, 62–64, 67, 73–74, 78, 81, 92–94, 94, 94–96, 98–99
   tcp_ip_channel.c66.54%51.61%100%76.16%101, 103, 106–107, 117–118, 118, 118–119, 124–125, 125, 125–127, 131–132, 132, 132–134, 148, 154, 154, 154, 154, 154–155, 155, 155–156, 158, 158, 158–159, 168, 175–176, 180, 184, 184, 184–186, 186, 186–187, 189, 189, 189–191, 195, 195, 195–196, 216, 229–230, 230, 230–231, 237, 242–243, 243, 243–244, 244, 244–245, 247–248, 248, 248–249, 249, 249, 251–254, 264, 264–265, 265, 265–266, 290–291, 291, 291–293, 293, 293–294, 294, 294–295, 304, 304, 309, 311, 313, 315, 327–328, 344–346, 346,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants