-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify NetworkChannel interface #150
Conversation
Benchmark results after merging this PR: Benchmark resultsPerformance:PingPongUc: PingPongC: ReactionLatencyUc: ReactionLatencyC: Memory usage:PingPongUc: PingPongC: ReactionLatencyUc: ReactionLatencyC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very good, I like the new organization a lot. There is an important question about possible race conditions (current and future) wrt to updating the channel state.
/** | ||
* @brief Main loop of the TcpIpChannel. | ||
*/ | ||
static void *_TcpIpChannel_worker_thread(void *untyped_self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new organization of the code. Great!
src/platform/posix/tcp_ip_channel.c
Outdated
|
||
if (bytes_send < 0) { | ||
LF_ERR(NET, "write failed errno=%d", errno); | ||
LF_ERR(NET, "TcpIpChannel: Write failed errno=%d", errno); | ||
switch (errno) { | ||
case ETIMEDOUT: | ||
case ENOTCONN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to call _TcpIpChannel_update_state
here? We might get some very subtle races here when the worker thread also detects that the socket is closed he will initiate a reconnect right? Then we don't want to overwrite the new state if we fail to send some data as well?
If we are guaranteed that the worker thread will not block forever on receive
, and that it will detect anything that is detected here, then I propose that we only read the channel state from the runtime side, and only write to it from the worker thread. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there might be a race if the worker thread already started a reconnection. We could guard this though.
I think it is better to update the state straight away if we see something is wrong.
A very robust way could be to just mutex the whole worker thread while loop
and then also mutex the whole send_blocking function? And of course also the get_state
function. This way both should not be able to interfer with each other.
We should though then change back to making the tcp receive
non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented a possible solution in the Fix race conditions
commit
@erlingrj I am thinking about one thing: It seems to me that the runtime really only cares about one thing now: Is the channel connected or not. So maybe it is reasonable to replace that Or do you think maybe there could be reasons why we need more information in the future? |
582fea4
to
f310d69
Compare
Memory usage after merging this PR will be: Memory Reportaction_empty_test_c
action_microstep_test_c
action_overwrite_test_c
action_test_c
deadline_test_c
delayed_conn_test_c
event_payload_pool_test_c
event_queue_test_c
nanopb_test_c
port_test_c
reaction_queue_test_c
request_shutdown_test_c
startup_test_c
tcp_channel_test_c
timer_test_c
|
I have now also checked that the POSIX federated example works :) So I mark this PR as ready 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
So the CoAP API is
/connect
(telling the other federate that I am ready to receive messages)/message
(message passes)/disconnect
(gracefull disconnect)
Yes, but I don't have strong opinions on that design, that was just the most obvious way to do it for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments Lasse and continuing working on the PR. I am still hoping we can find a more optimal solution for TcpIpChannel. I think you have solved the race conditions, but I think we sacrifice too much. Ideally we can avoid both busy-polling the receive socket, and also allow concurrently receiving and sending data.
We could for instance do a select
on the receive socket with a timeout of 1s. And if we time-out, then we can grab a mutex. And check if a flag was set from send_blocking
signalling a problem we transmitting data. And then possibly instigate a reconnection?
src/platform/posix/tcp_ip_channel.c
Outdated
self->receive_callback(self->federated_connection, &self->output); | ||
} else { | ||
LF_ERR(NET, "TcpIpChannel: Error receiving message %d", ret); | ||
pthread_mutex_lock(&self->state_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that we cannot send concurrently with receiving? It seems a bit too constraining?
src/platform/posix/tcp_ip_channel.c
Outdated
} | ||
pthread_mutex_unlock(&self->state_mutex); | ||
_env->platform->wait_for(_env->platform, TCP_IP_CHANNEL_WORKER_THREAD_MAIN_LOOP_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little unfortunate that we have to set a sleep here. The duration of the sleep adds to the worst case latency between two federates. Also it means that we are always polling the the receive socket and that we never really can go to low-power sleep
src/platform/posix/tcp_ip_channel.c
Outdated
LF_ERR(NET, "TcpIpChannel: Could not encode protobuf"); | ||
return LF_ERR; | ||
} | ||
pthread_mutex_lock(&self->state_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the worker thread is constantly polling the socket, then I think he will detect the closed socket very quickly anyways and we would need to handle this from the send_blocking
side and we wouldn't need to lock any mutex here.
However, I am unsure whether we want to have the worker thread busy-poll the receive socket. Is it possible for the worker thread to both block on the receiving socket AND a condition variable? And then we could signal that condition variable from send_blocking
if we fail to send. That would wake up the worker thread, and then let it handle the updating of state etc?
Thanks for the feedback! All good points I will try to find a better solution tomorrow |
Let's do this, I think it is better to keep the API as minimal as possible and then potentially expand later |
0ae92fe
to
a5bb763
Compare
So I hope I have addressed the parallel send and receive issue now. |
I have now changed it to use an eventfd_t instead of a pipe. This way it should be compatible to zephyr OS also |
Co-authored-by: erling <erling.jellum@gmail.com>
ef78b17
to
a85afb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good to me. Just minor nitpicks. OK to merge after addressing them!
src/platform/posix/tcp_ip_channel.c
Outdated
max_fd = (socket > self->send_failed_event_fds) ? socket : self->send_failed_event_fds; | ||
|
||
// Wait for data or cancel if send_failed externally | ||
select(max_fd + 1, &readfds, NULL, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the return value of select
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a ERR log if it returns negative now. :)
if (_CoapUdpIpChannel_send_coap_message_with_payload(self, &self->remote, "/message", _client_send_blocking_callback, | ||
message)) { | ||
// Wait until the response handler confirms the ack or times out | ||
while (!self->send_ack_received) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the only way for us to guarantee in-order reliable delivery, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this was implemented to achieve a blocking
send. Just like in TCP we want to send blocking so we wait for the message to be acked
.
But you are correct this somewhat also gives us in-order guarantee, but I am not sure if it covers all cases. We can think about this more later I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a sane starting point and we can think about edge cases and intermittent connections etc later
Co-authored-by: erling <erling.jellum@gmail.com>
The review should be addressed now :) |
This pull request removes the
try_connect
function from theNetworkChannel
.The runtime will now only tell the
NetworkChannel
implementation such asCoapUdpIpChannel
to open or close a connection. The channel will then internally take care of connection establishment.The runtime can get the
NetworkChannel
state usingget_conneciton_state
.