Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(mine_loop): Segregate rayon threadpool for guessing #315

Merged
merged 4 commits into from
Jan 13, 2025

Conversation

aszepieniec
Copy link
Contributor

Fixes #282.

The issue was caused by interference between rayon parallelism in disparate tokio tasks. On the one hand, the guesser spawned the max number of rayon threads using repeat. On the other hand, the peer loop task spawned rayon threads in the course of triton_vm::verify using par_iter. But since the the rayon threadpool was already occupied and since tokio tasks do not know how to coordinate about rayon's thread pool, the second task ended up stalling.

This PR fixes the problem by managing the rayon parallelism for guessing in a segregated rayon threadpool. It has already been experimentally verified to solve the problem. The documentation for ThreadPool explains why:

If the current thread is part of a different thread pool, it will try to keep busy while the op completes in its target pool, similar to calling ThreadPool::yield_now() in a loop. Therefore, it may potentially schedule other tasks to run on the current thread in the meantime.

Sword-Smith and others added 4 commits January 8, 2025 15:08
Don't use all threads available to rayon, as other parts of the code
might need some; instead, leave two threads free.
Drop the -2 threads' margin. Not necessary.
If the CLI argument is not set, the number of threads used will
default to whatever `rayon::current_num_threads()` returns.
@aszepieniec aszepieniec requested review from Sword-Smith and dan-da and removed request for Sword-Smith January 8, 2025 14:45
Copy link
Member

@Sword-Smith Sword-Smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dan-da
Copy link
Collaborator

dan-da commented Jan 10, 2025

Thoughts:

  1. I like this solution better than the other commit I saw that spawned a lot of tokio tasks instead of using rayon in the mining loop. I had some concerns about that one.

  2. I like the addition of the cli arg to control # of guessing threads.

  3. I don't feel like I have a good grasp of what the root cause was yet. @aszepieniec if you feel that you do, could you perhaps reword the explanation to break it down for me, step by step?

In particular my understanding of how it has been working is that there is a single global rayon threadpool. So any ParallelIterator workers would execute on that global rayon threadpool. The mining-loop paralleliterator was already running in a spawn_blocking, so it should be in a dedicated tokio thread. If there's any interaction with tokio workers it's likely due to the call to triton_vm::verify() executing paralleliterator directly inside a regular tokio task.

If the current thread is part of a different thread pool, it will try to keep busy while the op completes in its target pool, similar to calling ThreadPool::yield_now() in a loop. Therefore, it may potentially schedule other tasks to run on the current thread in the meantime.

If I'm reading that quote correctly and in context it is talking about a situation where install() is called by a thread inside an existing rayon threadpool. That situation doesn't apply here afaict. Also calling yield_now() in a loop should result in a hot loop with full CPU utlization. As I understand it, we were seeing the opposite with top reporting Idle CPU. But again, afaik there has only been a single rayon global threadpool.

  1. Let's take a high-level view of the situation with the change in this PR. iiuc, there are now two rayon threadpools: a) the global threadpool which is used by ParallelIterator in triton_vm::verify(), and b) a threadpool that is local to mine_loop. For some reason (unclear to me) this arrangement seems to fix the particular hang issue we've been seeing. But there is still the global threadpool that verify() is running in, and verify() can still be called multiple times concurrently. Are we certain that's fine for any N of concurrency?

And back to a fundamental question: why does having two separate rayon threadpools fix the hang issue?

  1. the proof is in the pudding. Clearly this is fixing the particular hang issue we've observed, so it's an improvement. I just hope a) we can collectively come to a clear understanding of the root cause, and b) that with this change we haven't just placed a band-aid, only for it to popup in another scenario.

  2. I forget. Did we ever try putting the verify() calls inside spawn_blocking(), and if so what was the result? I think we were fearful that would just hide the issue if it occurred again. But if there is an interaction happening that stalls tokio worker threads, perhaps that would be the most correct solution... and a single global rayon threadpool would then work without hanging?

  3. I'm not clear on what costs/benefits there are to having multiple rayon threadpools. It would increase CPU context switching some, maybe not a lot. A clear benefit seems to be that we can independently control the number of threads in each threadpool, so it mostly seems worth keeping regardless.


I would be Ok with merging this now if there is urgency. I still hope to gain clarity on root cause though.

@aszepieniec
Copy link
Contributor Author

I don't feel like I have a good grasp of what the root cause was yet. @aszepieniec if you feel that you do, could you perhaps reword the explanation to break it down for me, step by step?

I do, and I'm happy to elaborate but I'm afraid I would end up using many of the same words that weren't compelling the first go around. So I decided instead to write a minimum working demonstration exhibiting the behavior and the fix. Here it is: https://github.com/aszepieniec/tokio-rayon-interference

This demo reliably reproduces the same behavior on different machines.

I added an elaborate note to the repository's README.md file because I figured I might as well help out others who stumble into this bug. You should probably read that before you continue; otherwise I'm afraid my references might not make any sense.

If I'm reading that quote correctly and in context it is talking about a situation where install() is called by a thread inside an existing rayon threadpool. That situation doesn't apply here afaict.
It's definitely confusing. Here is my reading, with annotations:

If the current thread [i.e., the one calling pool.install] is part of a different thread pool [different from pool], it [the current thread] will try to keep busy while the op [guessing] completes in its target pool [pool]

It's saying the current thread tries to stay busy while the pool.install job is running. The current thread, in turn, can yield to other tokio tasks -- such as the one which does the verification.

As I understand it, we were seeing the opposite with top reporting Idle CPU.

The idle CPU can be explained as a consequence of using --sleepy-guessing, which sleeps for 100 ms in between every guess. I do not recall running htop when this flag was dropped.

But there is still the global threadpool that verify() is running in, and verify() can still be called multiple times concurrently. Are we certain that's fine for any N of concurrency?

If my understanding is correct, they (the verify tasks) will be sequenced by the global thread-pool, and as a consequence they will stall until it's their turn. And I struggle to imagine a better solution. Even if you could find a way to make them share resources simultaneously, the amount of work does not change and so you would have to wait the same time but without getting the results of tasks 1 -- (N-1) early.

Mind you, triton_vm::verify takes around 50 ms, and only some steps invoke parallelism. Furthermore, we are not sure parallelism actually speeds things up because there is the overhead of dividing and distributing work across cores.

I just hope [...] that with this change we haven't just placed a band-aid, only for it to popup in another scenario.

This is a valid concern. If you were to ask me for a litmus test to distinguish band-aid-and-morphine from surgical cures you might get a checklist like this:

  • Can you explain the root cause in simple terms? I guess I failed in the first try but the elaborate README.md seeks to remedy that.
  • Does the hypothesized root cause explain all observations? I believe the answer is yes.
  • Can you reliably reproduce the issue? I believe I have.
  • Does the fix reliably make the issue go away? Yes.

I forget. Did we ever try putting the verify() calls inside spawn_blocking(), and if so what was the result? I think we were fearful that would just hide the issue if it occurred again. But if there is an interaction happening that stalls tokio worker threads, perhaps that would be the most correct solution... and a single global rayon threadpool would then work without hanging?

Yes, we did. At first I thought it solved the issue -- running log output, responsive dashboard, etc. At some point I recall noticing that the number of peer connections reported by the dashboard was very large, 80 or so. (More nodes than I think ever connected at the same time.) I didn't think it was related at the time. Looking back, I think this mechanic was at play:

  • my peer thread stalls
  • connection closed by peer, but my peer thread hasn't processed it yet
  • some time later, same node connects again but with new IP address
  • same node listed twice.

In the end, we rolled back the use of spawn_blocking because the possibility existed that it did not solve anything and instead made the problem seem to disappear -- like accumulating crashed tasks without crashing the main task.

My minimal working demo shows that indeed it does nothing to fix the issue.

I'm not clear on what costs/benefits there are to having multiple rayon threadpools. It would increase CPU context switching some, maybe not a lot. A clear benefit seems to be that we can independently control the number of threads in each threadpool, so it mostly seems worth keeping regardless.

I'm not sure it is possible to make the issue go away without either a) making at least one of both tasks (guessing and verifying) sequential, or b) segregating rayon mempools. I think it is because rayon thread-pools are inherently synchronous and can only ever have one client. If true, then yes it would imply that there is a lot of context switching going on when two distinct thread-pools are taking turns.

@jan-ferdinand
Copy link
Member

Thanks for taking the time to write up the demo, both in terms of code and the readme.

You mention noise in the output due to buffering. It might reduce the noise when you lock stdout and flush it immediately. Don't forget to drop the write guard afterward. That said, I haven't run the demo, and it might not be a problem in practice.

@jan-ferdinand
Copy link
Member

Here are my results from running the demo in all 12 configurations. Summarizing the picture, on my machine, the verifier appears to be stalled in configurations par_guess_global + par_verify_spawn and par_guess_global + par_verify_direct. No other configuration results in a stalling verifier.

image

@aszepieniec aszepieniec merged commit a1baa42 into master Jan 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node Hangs
4 participants