Skip to content

Commit

Permalink
broker: improve LOST error message
Browse files Browse the repository at this point in the history
Problem: many "transitioning to LOST due to EHOSTUNREACH error on send"
messages were logged during shutdown of a large instance.

This is still not well understood but we can perhaps get a little more
information for next time.

Add the previous state and the time the peer has spent in that
state (in whole seconds).

Hopefully will help with #5881 if it occurs again.
  • Loading branch information
garlick committed Apr 12, 2024
1 parent d4fa3dc commit 61bb419
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion src/broker/overlay.c
Original file line number Diff line number Diff line change
Expand Up @@ -777,9 +777,12 @@ static int overlay_sendmsg_child (struct overlay *ov, const flux_msg_t *msg)
&& (child = child_lookup_online (ov, uuid))) {
flux_log (ov->h,
LOG_ERR,
"%s (rank %d) transitioning to LOST due to %s",
"%s (rank %d) transitioning to %s->LOST"
" after %ds due to %s",
flux_get_hostbyrank (ov->h, child->rank),
(int)child->rank,
subtree_status_str (child->status),
(int)(monotime_since (child->status_timestamp) / 1000.0),
"EHOSTUNREACH error on send");
overlay_child_status_update (ov, child, SUBTREE_STATUS_LOST);
}
Expand Down

0 comments on commit 61bb419

Please sign in to comment.