-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(mine_loop
): Segregate rayon threadpool for guessing
#315
Conversation
Don't use all threads available to rayon, as other parts of the code might need some; instead, leave two threads free.
Drop the -2 threads' margin. Not necessary.
If the CLI argument is not set, the number of threads used will default to whatever `rayon::current_num_threads()` returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thoughts:
In particular my understanding of how it has been working is that there is a single global rayon threadpool. So any ParallelIterator workers would execute on that global rayon threadpool. The mining-loop paralleliterator was already running in a spawn_blocking, so it should be in a dedicated tokio thread. If there's any interaction with tokio workers it's likely due to the call to triton_vm::verify() executing paralleliterator directly inside a regular tokio task.
If I'm reading that quote correctly and in context it is talking about a situation where install() is called by a thread inside an existing rayon threadpool. That situation doesn't apply here afaict. Also calling yield_now() in a loop should result in a hot loop with full CPU utlization. As I understand it, we were seeing the opposite with top reporting Idle CPU. But again, afaik there has only been a single rayon global threadpool.
And back to a fundamental question: why does having two separate rayon threadpools fix the hang issue?
I would be Ok with merging this now if there is urgency. I still hope to gain clarity on root cause though. |
I do, and I'm happy to elaborate but I'm afraid I would end up using many of the same words that weren't compelling the first go around. So I decided instead to write a minimum working demonstration exhibiting the behavior and the fix. Here it is: https://github.com/aszepieniec/tokio-rayon-interference This demo reliably reproduces the same behavior on different machines. I added an elaborate note to the repository's README.md file because I figured I might as well help out others who stumble into this bug. You should probably read that before you continue; otherwise I'm afraid my references might not make any sense.
It's saying the current thread tries to stay busy while the
The idle CPU can be explained as a consequence of using
If my understanding is correct, they (the verify tasks) will be sequenced by the global thread-pool, and as a consequence they will stall until it's their turn. And I struggle to imagine a better solution. Even if you could find a way to make them share resources simultaneously, the amount of work does not change and so you would have to wait the same time but without getting the results of tasks 1 -- (N-1) early. Mind you,
This is a valid concern. If you were to ask me for a litmus test to distinguish band-aid-and-morphine from surgical cures you might get a checklist like this:
Yes, we did. At first I thought it solved the issue -- running log output, responsive dashboard, etc. At some point I recall noticing that the number of peer connections reported by the dashboard was very large, 80 or so. (More nodes than I think ever connected at the same time.) I didn't think it was related at the time. Looking back, I think this mechanic was at play:
In the end, we rolled back the use of My minimal working demo shows that indeed it does nothing to fix the issue.
I'm not sure it is possible to make the issue go away without either a) making at least one of both tasks (guessing and verifying) sequential, or b) segregating rayon mempools. I think it is because rayon thread-pools are inherently synchronous and can only ever have one client. If true, then yes it would imply that there is a lot of context switching going on when two distinct thread-pools are taking turns. |
Thanks for taking the time to write up the demo, both in terms of code and the readme. You mention noise in the output due to buffering. It might reduce the noise when you lock |
Fixes #282.
The issue was caused by interference between rayon parallelism in disparate tokio tasks. On the one hand, the guesser spawned the max number of rayon threads using
repeat
. On the other hand, the peer loop task spawned rayon threads in the course oftriton_vm::verify
usingpar_iter
. But since the the rayon threadpool was already occupied and since tokio tasks do not know how to coordinate about rayon's thread pool, the second task ended up stalling.This PR fixes the problem by managing the rayon parallelism for guessing in a segregated rayon threadpool. It has already been experimentally verified to solve the problem. The documentation for
ThreadPool
explains why: