Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in poll and condvar #62691

Merged
merged 2 commits into from
Sep 18, 2023

Conversation

ambroise-arm
Copy link
Contributor

In the poll code, signal_poll_event is called by 2 functions: z_impl_k_poll_signal_raise and z_handle_obj_poll_events. The first one correctly takes the poll lock. Fix the second one not taking the lock.

In the sockets code, fix the mutex of the condition variable getting released before signaling for this same condition variable.

Both of those created race conditions which especially appeared on SMP-enabled configurations. The symptom was similar for both of them: a thread pending, waiting on an event, but never getting scheduled again even though the event happened.

The signal_poll_event function was previously called without the poll
lock held. This created a race condition between a thread calling k_poll
to wait for an event and another thread signalling for this same event.
This resulted in the waiting thread to stay pending and the handle to it
getting removed from the notifyq, meaning it couldn't get woken up
again.

Signed-off-by: Ambroise Vincent <ambroise.vincent@arm.com>
Releasing the lock before notifying condvar led to a race condition
between a thread calling k_condvar_wait to wait for a condition variable
and another thread signalling for this same condition variable. This
resulted in the waiting thread to stay pending and the handle to it
getting removed from the notifyq, meaning it couldn't get woken up
again.

Signed-off-by: Ambroise Vincent <ambroise.vincent@arm.com>
Copy link
Member

@povergoing povergoing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering why the naming of z_handle_obj_poll_events looks so different from others in kernel/poll.c.

@nashif
Copy link
Member

nashif commented Sep 16, 2023

@peter-mitsis @andyross please review.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question. If you're already done the analysis and know it's safe, feel free to treat your "yes" as a +1 from me.

@@ -470,11 +471,14 @@ static int signal_poll_event(struct k_poll_event *event, uint32_t state)
void z_handle_obj_poll_events(sys_dlist_t *events, uint32_t state)
{
struct k_poll_event *poll_event;
k_spinlock_key_t key = k_spin_lock(&lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looks like a lock was missing here, but I do worry about the possibility of a deadlock underneath signal_poll_event, which calls into a bunch of subsystems. Are we 100% sure those can't turn around and call back into something that wants the poll lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. But given that z_impl_k_poll_signal_raise already calls signal_poll_event with the poll lock taken, I suppose such a case would have already shown up.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sold.

@nashif nashif merged commit bb450eb into zephyrproject-rtos:main Sep 18, 2023
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants