-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock issue in logrus, or deadlock between the data writer and the DataIterator #280
Comments
shuhaowu
added a commit
that referenced
this issue
Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
shuhaowu
added a commit
that referenced
this issue
Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
shuhaowu
added a commit
that referenced
this issue
Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While testing, I've observed an intermittent test failure within the ruby test
InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source
. When debugging that, I can see that all the Ghostferry goroutine to be stuck at some logrus mutex methods[1]. The ruby process is always trying to modify a row a the same time via the data writer at the same time. This modification query appears to be always stuck waiting for lock, presumably because the data iterator just went over that row and issued aFOR UPDATE
against that row (BEFORE_ROW_COPY and AFTER_ROW_COPY in the integrationferry is sent while the FOR UPDATE is held, which makes the situation worse).I'm not quite sure why this is the case. It's possible that the problem lies in an upstream logrus issue (sirupsen/logrus#1201) or it's somehow a race between DataIterator which holds the lock while trying to send BEFORE_ROW_COPY and AFTER_ROW_COPY to the integration test server (reader should check integrationferry.go to see that, and check how the listeners on the data iterator + cursor works with the FOR UPDATE lock). I'm not sure which one is more likely, as maybe we should see the same deadlock in production if it is purely a problem with logrus.
I've found that by reducing the frequency of the data writer, the deadlock appears to be gone. However, it seems that even a very small sleep (5ms) solves this issue. Such a small sleep only reduces the total number of modification query from about 30 to 22 (for the duration over which the datawriter is active). I'm not sure why this make such a big difference, either.
A failure grinding script is included below, at [2]
[1] Ghostferry per go routine stack trace:
[2] Failure grinding script
Before running this, you should comment out the
sleep
intest/helpers/data_writer_helper.rb
, after the call towrite_data
.The text was updated successfully, but these errors were encountered: