Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node: Processor multithreading (simple) #3166

Closed
wants to merge 1 commit into from

Conversation

bruce-riley
Copy link
Contributor

No description provided.

@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 3 times, most recently from 3bb80c8 to b1acab8 Compare July 12, 2023 16:00
@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 85a818f to b734f68 Compare July 17, 2023 19:03
@bruce-riley
Copy link
Contributor Author

bruce-riley commented Jul 18, 2023

The following graphs were generated by listening to mainnet gossip and mainnet pythnet, using 24 cores.

These metrics were generated using PR #3210 .

The charts on the left measure the delay from when P2P puts the observation on the channel until the processor pulls it off. The charts on the right measure the time from when P2P enqueues the observation and the processor finishes processing it.

The X-axis is microseconds delay, and the Y-axis is percentage of the time. Remember that Prometheus histograms are cumulative.

The three pairs of charts are:

  1. Multithreading with a factor 2.0 (48 workers)
  2. The multithreading code but with a factor of 0.0, meaning a single worker (new code but single threaded)
  3. The old single threaded code

Screenshot from 2023-07-18 13-25-18

@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 3 times, most recently from 6cea7bb to 4131970 Compare July 27, 2023 15:04
@bruce-riley
Copy link
Contributor Author

bruce-riley commented Jul 28, 2023

Here are some charts comparing the original single threaded processor with three versions of multithreading. The data was generated by listening to mainnet pythnet and mainnet gossip.

It is interesting that the three versions of multithreading look very similar. However, even the simple multithreading handled almost all observations in under 750 microseconds, whereas the existing single threaded approach took between five and ten milliseconds to reach that point.

So it seems like mulithreading is helpful, but even the simple approach may be "good enough".

image

@bruce-riley
Copy link
Contributor Author

bruce-riley commented Jul 31, 2023

Some further data. I ran the benchmarks for the various configurations. For each config, I ran it three times with the total time for each run posted below.

  • Existing single threaded: 10.989, 10.964, 11.688 seconds
  • Simple multithreading: 11.099, 11.509. 11.801 seconds
  • Multithreading with dispatcher: 11.361, 11.860, 11.722 seconds
  • Direct dispatcher: 11.0557, 9.677, 10.060 seconds

So it seems like multithreading doesn't have much impact on the benchmarks.

@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 1406630 to f3a7b24 Compare August 1, 2023 14:44
@bruce-riley
Copy link
Contributor Author

Although this PR does not have as big of an impact as we had hoped, it is an improvement, and it positions us to make additional improvements going forward. Therefore I am marking it ready for review.

@bruce-riley bruce-riley requested a review from tbjump August 1, 2023 19:58
@bruce-riley bruce-riley marked this pull request as ready for review August 1, 2023 19:58
node/cmd/guardiand/node.go Outdated Show resolved Hide resolved
node/pkg/processor/broadcast.go Outdated Show resolved Hide resolved
node/pkg/processor/broadcast.go Outdated Show resolved Hide resolved
node/pkg/processor/processor.go Outdated Show resolved Hide resolved
node/pkg/processor/processor.go Outdated Show resolved Hide resolved
@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 116dca1 to a9b1e20 Compare August 3, 2023 20:45
@bruce-riley bruce-riley requested a review from tbjump August 3, 2023 20:56
node/pkg/processor/cleanup.go Outdated Show resolved Hide resolved
node/pkg/processor/cleanup.go Outdated Show resolved Hide resolved
node/pkg/processor/processor.go Show resolved Hide resolved
node/pkg/processor/processor.go Outdated Show resolved Hide resolved
node/pkg/processor/processor.go Outdated Show resolved Hide resolved
node/pkg/node/node_test.go Outdated Show resolved Hide resolved
node/pkg/processor/processor.go Outdated Show resolved Hide resolved
@tbjump
Copy link
Contributor

tbjump commented Aug 7, 2023

Catching panics or not discussion:

We discussed a bit in the code comments, but starting this higher-level comment so the discussion doesn't get buried in the code comments.

First of all, errors should be caught and handled appropriately, nobody would say anything else.

When it comes to panics, I think we have to differentiate

  1. panics that are expected to occur, e.g. if calling a library that is known to panic on certain inputs: These panics should be recovered as close to the panic as possible. This is the same as handling an error at this point.
  2. panics that are unexpected and therefore don't have any recover code around them. Since the developer did not expect this panic, chances are that returning at this point may leak resources and/or leave the program in an inconsistent state. Maybe a transaction stopped half-way, maybe a lock wasn't released, maybe a context not canceled, a socket not closed, etc. In these cases, a balance needs to be struck between correctness and availability. For correctness, it is safer to just crash the whole program and restart everything cleanly. But that jeopardizes availability since the crash would make concurrent processes fail and the restart takes some time to carry out.

In the case of watchers, it was agreed (#2187) that panics should be caught generically and watchers always just restarted. Because watchers are pretty self-contained, restarting them isn't very dangerous and because they're the most panic-prone components, it improves reliability a lot.

But when it comes to the processor, I don't think that there should be attempts to recover from unexpected panics because the processor has more potential to end up in an inconsistent state.

Here is what I think this could look like: #3267 Workers don't return until their context is canceled. Errors are simply logged. If there was an error in the worker that is not possible to handle, it should instead be a panic. What do you think?

Copy link
Contributor

tbjump commented Aug 7, 2023

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@tbjump tbjump force-pushed the node_processor_multithreading branch from 981da16 to 9d3eb96 Compare August 8, 2023 16:46
@bruce-riley
Copy link
Contributor Author

It is desirable that the multithreaded processor handle errors / panics in the same way as the current processor, meaning that panics kill the guardian and errors restart the entire processor. It is also desirable to use the RunWithScissors pattern since that has been adopted by the team as the way to do error handling. Rather than create a one-off approach for the processor, I created a new StartRunnable function in the RunWithScissors module that allows you to specify if you want to catch errors or not. I am now using StartRunnable in the processor. After this PR is merged, I plan to change all references to RunWithScissors to the appropriate StartRunnable call, after careful consultation with the various code owners.

@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 82ff5d3 to a06d2af Compare August 15, 2023 18:59
panoel
panoel previously approved these changes Aug 15, 2023
@tbjump
Copy link
Contributor

tbjump commented Aug 25, 2023

Hey, what's the state of this PR? It has conflicts.

@bruce-riley bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from c23c6c3 to 996b2a0 Compare August 25, 2023 20:28
Change-Id: I72e56c106c8a275c54af6cb073aa16a5c7d75fbe
@bruce-riley bruce-riley marked this pull request as draft August 29, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants