Node: Processor multithreading (simple) #3166

bruce-riley · 2023-07-05T19:11:10Z

No description provided.

bruce-riley · 2023-07-18T18:28:53Z

The following graphs were generated by listening to mainnet gossip and mainnet pythnet, using 24 cores.

These metrics were generated using PR #3210 .

The charts on the left measure the delay from when P2P puts the observation on the channel until the processor pulls it off. The charts on the right measure the time from when P2P enqueues the observation and the processor finishes processing it.

The X-axis is microseconds delay, and the Y-axis is percentage of the time. Remember that Prometheus histograms are cumulative.

The three pairs of charts are:

Multithreading with a factor 2.0 (48 workers)
The multithreading code but with a factor of 0.0, meaning a single worker (new code but single threaded)
The old single threaded code

bruce-riley · 2023-07-28T15:36:34Z

Here are some charts comparing the original single threaded processor with three versions of multithreading. The data was generated by listening to mainnet pythnet and mainnet gossip.

Single Threaded is the original processor code.
Factor 0.0 is the new code but running with only a single go routine.
Factor 2.0 is the simple multithreading approach (this PR) with 48 workers.
Dispatch, Factor 2.0 is PR Node: Processor multithreading with dispatcher #3240, with a dispatcher routine routing all messages for a given VAA to the same worker.
Direct Dispatch, Factor 2.0 is PR Node: Processor multithreading w/ direct dispatch #3241, with P2P dispatching observations directly to a worker, avoiding a channel hop.

It is interesting that the three versions of multithreading look very similar. However, even the simple multithreading handled almost all observations in under 750 microseconds, whereas the existing single threaded approach took between five and ten milliseconds to reach that point.

So it seems like mulithreading is helpful, but even the simple approach may be "good enough".

bruce-riley · 2023-07-31T15:29:29Z

Some further data. I ran the benchmarks for the various configurations. For each config, I ran it three times with the total time for each run posted below.

Existing single threaded: 10.989, 10.964, 11.688 seconds
Simple multithreading: 11.099, 11.509. 11.801 seconds
Multithreading with dispatcher: 11.361, 11.860, 11.722 seconds
Direct dispatcher: 11.0557, 9.677, 10.060 seconds

So it seems like multithreading doesn't have much impact on the benchmarks.

bruce-riley · 2023-08-01T19:58:13Z

Although this PR does not have as big of an impact as we had hoped, it is an improvement, and it positions us to make additional improvements going forward. Therefore I am marking it ready for review.

node/cmd/guardiand/node.go

node/pkg/processor/broadcast.go

node/pkg/processor/processor.go

node/pkg/processor/cleanup.go

node/pkg/processor/processor.go

node/pkg/node/node_test.go

node/pkg/processor/processor.go

tbjump · 2023-08-07T20:48:31Z

Catching panics or not discussion:

We discussed a bit in the code comments, but starting this higher-level comment so the discussion doesn't get buried in the code comments.

First of all, errors should be caught and handled appropriately, nobody would say anything else.

When it comes to panics, I think we have to differentiate

panics that are expected to occur, e.g. if calling a library that is known to panic on certain inputs: These panics should be recovered as close to the panic as possible. This is the same as handling an error at this point.
panics that are unexpected and therefore don't have any recover code around them. Since the developer did not expect this panic, chances are that returning at this point may leak resources and/or leave the program in an inconsistent state. Maybe a transaction stopped half-way, maybe a lock wasn't released, maybe a context not canceled, a socket not closed, etc. In these cases, a balance needs to be struck between correctness and availability. For correctness, it is safer to just crash the whole program and restart everything cleanly. But that jeopardizes availability since the crash would make concurrent processes fail and the restart takes some time to carry out.

In the case of watchers, it was agreed (#2187) that panics should be caught generically and watchers always just restarted. Because watchers are pretty self-contained, restarting them isn't very dangerous and because they're the most panic-prone components, it improves reliability a lot.

But when it comes to the processor, I don't think that there should be attempts to recover from unexpected panics because the processor has more potential to end up in an inconsistent state.

Here is what I think this could look like: #3267 Workers don't return until their context is canceled. Errors are simply logged. If there was an error in the worker that is not possible to handle, it should instead be a panic. What do you think?

tbjump · 2023-08-07T21:15:39Z

Current dependencies on/for this PR:

main
- PR Node: Processor multithreading (simple) #3166 👈
  - PR node/processor: switch to using supervisor for workers #3267

This comment was auto-generated by Graphite.

bruce-riley · 2023-08-10T18:51:01Z

It is desirable that the multithreaded processor handle errors / panics in the same way as the current processor, meaning that panics kill the guardian and errors restart the entire processor. It is also desirable to use the RunWithScissors pattern since that has been adopted by the team as the way to do error handling. Rather than create a one-off approach for the processor, I created a new StartRunnable function in the RunWithScissors module that allows you to specify if you want to catch errors or not. I am now using StartRunnable in the processor. After this PR is merged, I plan to change all references to RunWithScissors to the appropriate StartRunnable call, after careful consultation with the various code owners.

node/pkg/processor/processor.go

tbjump · 2023-08-25T15:59:45Z

Hey, what's the state of this PR? It has conflicts.

Change-Id: I72e56c106c8a275c54af6cb073aa16a5c7d75fbe

bruce-riley force-pushed the node_processor_multithreading branch 3 times, most recently from 3bb80c8 to b1acab8 Compare July 12, 2023 16:00

bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 85a818f to b734f68 Compare July 17, 2023 19:03

tbjump mentioned this pull request Jul 17, 2023

Node: simple processor multithreading vs processor v3 #3215

Closed

bruce-riley force-pushed the node_processor_multithreading branch 3 times, most recently from 6cea7bb to 4131970 Compare July 27, 2023 15:04

bruce-riley mentioned this pull request Jul 27, 2023

Node: Processor multithreading with dispatcher #3240

Closed

bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 1406630 to f3a7b24 Compare August 1, 2023 14:44

bruce-riley requested a review from tbjump August 1, 2023 19:58

bruce-riley marked this pull request as ready for review August 1, 2023 19:58

bruce-riley requested review from evan-gray, hendrikhofstadt, panoel and kev1n-peters as code owners August 1, 2023 19:58

tbjump reviewed Aug 2, 2023

View reviewed changes

bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 116dca1 to a9b1e20 Compare August 3, 2023 20:45

bruce-riley requested a review from tbjump August 3, 2023 20:56

tbjump suggested changes Aug 3, 2023

View reviewed changes

tbjump reviewed Aug 7, 2023

View reviewed changes

node/pkg/processor/processor.go Outdated Show resolved Hide resolved

tbjump mentioned this pull request Aug 7, 2023

node/processor: switch to using supervisor for workers #3267

Closed

tbjump force-pushed the node_processor_multithreading branch from 981da16 to 9d3eb96 Compare August 8, 2023 16:46

bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from 82ff5d3 to a06d2af Compare August 15, 2023 18:59

panoel reviewed Aug 15, 2023

View reviewed changes

node/pkg/processor/processor.go Show resolved Hide resolved

panoel previously approved these changes Aug 15, 2023

View reviewed changes

bruce-riley force-pushed the node_processor_multithreading branch from a06d2af to 5df2ef7 Compare August 17, 2023 13:38

bruce-riley dismissed panoel’s stale review via 13a3945 August 25, 2023 20:21

bruce-riley force-pushed the node_processor_multithreading branch 2 times, most recently from c23c6c3 to 996b2a0 Compare August 25, 2023 20:28

Node: Processor multithreading (simple)

a281c2e

Change-Id: I72e56c106c8a275c54af6cb073aa16a5c7d75fbe

bruce-riley force-pushed the node_processor_multithreading branch from 996b2a0 to a281c2e Compare August 29, 2023 16:30

bruce-riley marked this pull request as draft August 29, 2023 16:30

bruce-riley closed this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node: Processor multithreading (simple) #3166

Node: Processor multithreading (simple) #3166

bruce-riley commented Jul 5, 2023

bruce-riley commented Jul 18, 2023 •

edited

Loading

bruce-riley commented Jul 28, 2023 •

edited

Loading

bruce-riley commented Jul 31, 2023 •

edited

Loading

bruce-riley commented Aug 1, 2023

tbjump commented Aug 7, 2023 •

edited

Loading

tbjump commented Aug 7, 2023

bruce-riley commented Aug 10, 2023

tbjump commented Aug 25, 2023

Node: Processor multithreading (simple) #3166

Node: Processor multithreading (simple) #3166

Conversation

bruce-riley commented Jul 5, 2023

bruce-riley commented Jul 18, 2023 • edited Loading

bruce-riley commented Jul 28, 2023 • edited Loading

bruce-riley commented Jul 31, 2023 • edited Loading

bruce-riley commented Aug 1, 2023

tbjump commented Aug 7, 2023 • edited Loading

Catching panics or not discussion:

tbjump commented Aug 7, 2023

bruce-riley commented Aug 10, 2023

tbjump commented Aug 25, 2023

bruce-riley commented Jul 18, 2023 •

edited

Loading

bruce-riley commented Jul 28, 2023 •

edited

Loading

bruce-riley commented Jul 31, 2023 •

edited

Loading

tbjump commented Aug 7, 2023 •

edited

Loading