Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor] improve synchronizer and bulk block processing #59

Merged
merged 3 commits into from
Dec 20, 2023

Conversation

akichidis
Copy link
Collaborator

This PR is:

  • making the synchronzer continuously sending blocks to peers and not attempt send only when a new block is out.
  • sample peers based on latency to ensure that lower latency peers are prioritised (there might be some exceptions to that)
  • make the send operations on the block disseminator non blocking - so we avoid clogging the network loops
  • send multiple blocks at once and process them in bulk and pre-check if they have been already processed so we avoid any penalties from verification etc.

syncer: Mutex<Syncer<H, S, C>>,
syncer: RwLock<Syncer<H, S, C>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the insight behind this change? Are we afraid that too many reads will starve the writes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure we need this change, since this is simulated code anyway and simulator runs in the single thread(so there is no difference between Mutex and RwLock)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is not really needed, it has been transferred as part of some experimentation.

block_receive_latency: register_histogram_vec_with_registry!(
"block_receive_latency",
"The time it took for a block to reach our node",
&["authority", "proc"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the authority the author of the block? What is proc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored that metric a bit. The authority is basically the block author

@@ -158,15 +168,23 @@ impl<H: BlockHandler + 'static, C: CommitObserver + 'static> NetworkSyncer<H, C>
block_fetcher: Arc<BlockFetcher>,
block_verifier: Arc<impl BlockVerifier>,
metrics: Arc<Metrics>,
leader_timeout: Duration,
parameters: SynchronizerParameters,
cleanup_enabled: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases are we enabling/disabling cleanups?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was mainly during some experimentation and found it convenient when investigated possible interference of clean up with latencies. I would suggest keeping it for now and we can always remote it later once closer to production, what do you think?

Comment on lines +58 to +59
let (_al_sender, al_receiver) = tokio::sync::watch::channel(Duration::from_secs(0));
let (_bl_sender, bl_receiver) = tokio::sync::watch::channel(Duration::from_secs(0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to indicate what these channels are for (e.g comments or more explicit names)?

Comment on lines +92 to +94
if to_send.len() >= CHUNK_SIZE {
self.send(peer, NetworkMessage::Blocks(std::mem::take(&mut to_send)))?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do these kind of changes are we running benchmarking to ensure we are not regressing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in AWS right? Basically those changes are introduced as part of our integration and experimentation in private-testnet where we already see better results and more sustainable behaviour. Also we are willing to run AWS experiments too.

.senders
.iter()
.filter(|&(index, _)| !except.contains(index))
.filter(|&(index, (_, latency_receiver))| {
!except.contains(index) && latency_receiver.borrow().as_millis() <= CUT_OFF_MILLIS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also implement an anti-dos check on the helper nodes to ensure they do not allocate more than a set amount of resources per peer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something that we definitely need to do but I would suggest doing it in later iterations as it's a separate domain on its own.

@akichidis akichidis force-pushed the akichidis/improve-commit-performance-2 branch from 2d2dea9 to 9c63296 Compare December 11, 2023 17:51
…eer latency. Also send and process blocks at bulk.
@akichidis akichidis force-pushed the akichidis/synchronizer-improvements branch from f4ca0ed to cb8f5b2 Compare December 15, 2023 13:45
@akichidis akichidis changed the base branch from akichidis/improve-commit-performance-2 to main December 15, 2023 13:45
@akichidis akichidis merged commit f4bae7f into main Dec 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants