Add support for concurrent requests to a single peer. #200

njgheorghita · 2023-07-10T20:14:40Z

Description

Add support to discv5 for multiple, concurrent network requests to a single peer. See #198

How was it fixed?

Update ActiveRequests to support tracking multiple, concurrent, active requests for any given peer.
Moved HashMapDelay to the nonce mapping since it's no longer effective on the peer mapping.
If any single request fails for a peer, all other active requests for that peer are also failed.
Replay all active requests if a new challenge is issued.
Added tests for ActiveRequests methods.

Notes & open questions

Updated active_requests methods to return a Result rather than an Option. This made sense to me to accurately reflect the cases where the mappings might be out of sync. However, I realize that the HashMap & HashMapDelay apis return an Option, so I don't feel strongly about this change either way.
This was tested in trin, and resulted in 100% success rate for utp txs over discv5.

Change checklist

Self-review
Documentation updates if relevant
Tests if relevant

src/handler/active_requests.rs

src/handler/mod.rs

njgheorghita · 2023-07-19T21:17:39Z

Thanks @divagant-martian for the review! I wasn't too familiar with using entry() so thanks for pointing out where that would be useful.

I went back to the old pattern of returning an Option rather than a Result from ActiveRequests::remove_* methods. Imo it's just cleaner if we want to use entry() to mutate ActiveRequests. As well as it more closely represents the HashMap api. This means that the code functionally ignores scenarios when the ActiveRequests mappings are out of sync (besides logging some info). However, this was the behavior before this pr, so I don't consider it a problem.
I re-introduced pending_requests to queue requests that are made before a session is established

divagant-martian

partial review for now

src/handler/active_requests.rs

divagant-martian · 2023-07-25T01:56:40Z

src/handler/active_requests.rs

+                    .position(|req| req.packet().message_nonce() == nonce)
+                    .expect("to find request call by nonce");


Back when I suggested to use a expect the case was first finding the position and then a get. In that case it was ok since we assume the standard library ensures us position is correct.

Here however, this means the RequestCall matching the nonce is always present, which it should if we don't have bugs. Since we are treating our (potential) bugs as None this shouldn't be expectd, but treated instead exactly as the Vacant variant

src/handler/active_requests.rs

divagant-martian · 2023-07-25T02:15:00Z

src/handler/active_requests.rs

+                    Entry::Vacant(_) => panic!("invalid ActiveRequests state"),
+                    Entry::Occupied(mut requests) => {
+                        let index = requests
+                            .get()
+                            .iter()
+                            .position(|req| req.packet().message_nonce() == &nonce)
+                            .expect("to find request call by nonce");


better to no panic in those to keep in line with gracefully handling potential inconsistencies

ackintosh · 2023-07-30T21:56:27Z

I'm testing this PR with a simulation. 🙂 The sequence diagram below shows what happens on session expirations.

sequenceDiagram
    participant Node1
    participant Node2
    Note over Node1: Start discv5 server
    Note over Node2: Start discv5 server

    Node1 ->> Node2: message
    Node2 ->> Node1: message
    Note over Node1,Node2: Session established

    rect rgb(100, 100, 0)
    Note over Node2: ** Session expired **
    end

    rect rgb(10, 10, 10)
    Note left of Node1: Node1 sends two requests<br> **in parallel**.
    par 
    Node1 ->> Node2: FINDNODE (id:1)
    and 
    Node1 -->> Node2: FINDNODE (id:2)
    end
    end

    Note over Node2: Send WHOAREYOU<br>since the session has been expired
    Node2 ->> Node1: WHOAREYOU (id:1)

    rect rgb(100, 100, 0)
    Note over Node2: Drop FINDNODE(id:2) request<br>since WHOAREYOU already sent.
    end

    Note over Node1: New session established with Node2

    Node1 ->> Node2: Handshake message (id:1)

    Note over Node2: New session established with Node1

    Node2 ->> Node1: NODES (id:1)

    rect rgb(100, 0, 0)
    Note over Node1: ** FINDNODE(id:2) request timed out **
    end

I observed that the FINDNODE (id:2) request had been dropped and then the request timed out. As fjl said, I think the FINDNODE (id:2) request should be replayed with the new session.

#198 (comment)

Remember that in discv5, it is assumed the session may be dropped at any moment. If the node has multiple requests in flight with a peer, and the peer answers with a challenge (WHOAREYOU), all pending requests would need to be replayed.

src/handler/mod.rs

njgheorghita · 2023-08-03T20:01:29Z

@ackintosh Thanks for the review! Made the suggested changes. Now, all active requests are replayed if an active session is re-established. Tested it out locally with testground and it appears to have solved the FINDNODE request timeout. Lmk if there are any other changes / testground simulations to run.

Btw, curious as to how you're finding testground? Looks like a pretty useful tool and something that we might want to use in trin.

ackintosh · 2023-08-03T22:19:38Z

@njgheorghita Thanks! I will review the changes.

Yeah testground is so helpful. It's easy to simulate/reproduce particular situation/problem with testground. We are also using testground for the gossipsub protocol https://github.com/sigp/gossipsub-testground. I hope this can be of reference.

src/handler/mod.rs

divagant-martian

This is progressing in the right direction but there are some bugs to be fixed

divagant-martian · 2023-08-03T23:05:35Z

src/handler/active_requests.rs

+        node_address: &NodeAddress,
+        id: &RequestId,
+    ) -> Option<RequestCall> {
+        match self.active_requests_mapping.entry(node_address.to_owned()) {


There is no difference between clone and to_owned for NodeContact since it's owned version it of the same type. Let's use clone which is easier to understand in general

Suggested change

match self.active_requests_mapping.entry(node_address.to_owned()) {

match self.active_requests_mapping.entry(node_address.clone()) {

divagant-martian · 2023-08-03T23:06:20Z

src/handler/active_requests.rs

+                    .remove(request_call.packet().message_nonce());
+                Some(request_call)


you need to check the len of requests. Otherwise you will end up with an entry with an empty list

divagant-martian · 2023-08-03T23:11:31Z

src/handler/active_requests.rs

+                    Entry::Occupied(mut requests) => {
+                        match requests
+                            .get()
+                            .iter()
+                            .position(|req| req.packet().message_nonce() == &nonce)
+                        {
+                            Some(index) => Poll::Ready(Some(Ok((
+                                node_address,
+                                requests.get_mut().remove(index),
+                            )))),
+                            None => Poll::Ready(None),
+                        }


same here, this will leave requests with empty lists

this will leave requests with empty lists

Not sure I'm understanding your suggestion here. Do you want to see an empty requests replaced with Entry::Vacant after removing the last request? Based on your suggestion on line 89, it seems like the convention is that it is possible for the active_requests_mapping entry for a node to be an empty list instead of strictly Entry::Vacant.

In the case where requests is an empty list, this logic will simply return None => Poll::Ready(None). There doesn't seem to be a need to handle the empty list case any differently? But I could be missing something.

But if you'd like to see the change where the active_requests_mapping entry for a node without any active requests cannot be an empty list and must always be Entry::Vacant, I can make those changes.

Yes. Empty list should always be vacant. Otherwise we can let this grow unbounded on keys because we forgot to remove stale ones

To be clear, on empty list use entry.remove()

divagant-martian · 2023-08-04T02:29:46Z

src/handler/mod.rs

        if let Some(current_session) = self.sessions.get_mut(&node_address) {
            current_session.update(session);
+            self.replay_active_requests::<P>(&node_address).await;


changes like this deserve a comment.
I need to think this one a bit more but the session stores the old keys in case they are needed, so I'm not sure this is necessary

Added a comment, but lmk if you feel like this needs to be removed. Although just to mention that this was the change that solved the FindNode requests timeout error described above, but if this is unnecessary I'll debug that failure some more and find a more appropriate solution

The timeout has been solved by this change however I noticed another issue about this replaying:

As highlighted by the red square below, replay_active_requests() re-sends the FINDNODE(id:1) that has already been sent in the Handshake message. Active requests that should be re-sent is only FINDNODE(id:2) in the case of this example.

sequenceDiagram participant Node1 participant Node2 Note over Node1: Start discv5 server Note over Node2: Start discv5 server Node1 ->> Node2: message Node2 ->> Node1: message Note over Node1,Node2: Session established rect rgb(100, 100, 0) Note over Node2: ** Session expired ** end rect rgb(10, 10, 10) Note left of Node1: Node1 sends two requests **in parallel**. par Node1 ->> Node2: FINDNODE (id:1) and Node1 -->> Node2: FINDNODE (id:2) end end Note over Node2: Send WHOAREYOU since the session has been expired Node2 ->> Node1: WHOAREYOU (id:1) rect rgb(100, 100, 0) Note over Node2: Drop FINDNODE(id:2) request since WHOAREYOU already sent. end Note over Node1: New session established with Node2 Node1 ->> Node2: Handshake message (id:1) Note over Node2: New session established with Node1 rect rgb(10, 10, 10) Note left of Node1: Handler::replay_active_requests() rect rgb(100, 0, 0) Node1 ->> Node2: FINDNODE (id:1) **duplicates with Handshake message (id:1)** end Node1 -->> Node2: FINDNODE (id:2) end Node2 ->> Node1: NODES (id:1) Node2 -->> Node1: NODES (id:2) rect rgb(100, 0, 0) Node2 ->> Node1: NODES (id:1) end

Loading

Cool, thanks for the diagram it was super helpful. I believe I've implemented a sufficient solution that filters out the handshake request when replaying active requests. But only in the case when Node2 sends the whoareyou. I'm not sure if I took the right approach here, but lmk if you see anything that needs changing.

src/handler/mod.rs

ackintosh · 2023-08-08T23:02:43Z

src/handler/mod.rs

                Some(Ok((node_address, _challenge))) = self.active_challenges.next() => {
-                    // A challenge has expired. There could be pending requests awaiting this
-                    // challenge. We process them here
-                    self.send_next_request::<P>(node_address).await;
+                    // A challenge has expired. There could be active requests impacted by this
+                    // challenge. We replay them here
+                    self.replay_active_requests::<P>(&node_address).await;


I think we need to send pending_requests here not active_requests.

Current implementation, requests in pending_requests remain without being sent in case of WHOAREYOU timeout:

sequenceDiagram participant Node1 participant Node2 Node2 ->> Node1: Random packet Note over Node1: active_challenges.insert() Node1 ->> Node2: WHOAREYOU rect rgb(100, 100, 0) Note over Node2: ** Discard the WHOAREYOU packet to reproduce kind of network issue. ** end rect rgb(10, 10, 10) Note left of Node1: Node1 want to send FINDNODE to Node2 but active_challenge exists. So insert requests into pending_requests. par Note over Node1: pending_requests.insert() and Note over Node1: pending_requests.insert() end end Note over Node1: The challenge in active_challenges has been expired. rect rgb(100, 0, 0) Note over Node1: FINDNODE requests remain in pending_requests without being sent. end

Loading

njgheorghita · 2023-08-21T21:38:06Z

Just a note that I've been experiencing some flaky behavior while trying to use this branch inside trin. Looks like there could be a problem with how requests are queued for a node before the session is established. Will update here once I've found a solution.

ackintosh · 2023-08-22T23:10:24Z

@njgheorghita Could you please merge current master into this branch? There're some fixes. I'm not sure if the fixes relates to that problem though.

njgheorghita · 2023-08-23T18:59:08Z

@ackintosh Doesn't seem to have fixed the flakiness, I should have some time in the next day or two to further investigate what's going on there and I'll report back here

ackintosh · 2023-08-25T04:11:09Z

@njgheorghita I'm working on adding a test for replay_active_request() and trying to fix bugs I found through the test.

I don't have permission to push commits to this PR directly, so I have created a PR to your experimental branch here: njgheorghita#2

ackintosh · 2023-10-18T23:08:50Z

After the fix has merged, I had run this on a live network and verified it works as expected.

AgeManning · 2023-11-06T01:42:50Z

Ok, @ackintosh we think this ready for a merge?

I'll do one last quick review if so.

ackintosh · 2023-11-06T23:38:04Z

@AgeManning Yes, please go ahead with the final review. 🙂

AgeManning · 2023-11-29T08:46:53Z

This seems like it's been well tested. I think the best approach now is to merge it in. We're hoping to get better visibility on errors that occur is discovery, with #218

I think we should make a release with both of these PRs in and we will be able to test at scale in Lighthouse also.

AgeManning · 2023-11-29T08:51:24Z

@divagant-martian - Want to do a final review, happy with a merge then live test?

divagant-martian · 2023-11-29T22:34:08Z

sure, I''l review this friday!

AgeManning · 2023-12-04T03:13:01Z

As soon as the review is approved lets merge this :)

divagant-martian

I'm mostly happy with the current state. Looks like the happy paths are appropriately handled. There are some places where I'd like to do some changes: Active requests can probably be changed to use a Smallvec instead of a Vec with a reasonable low value. Some places need a bit of love in handling the unhappy paths since we do things like this:

fn operation_that_should_be_atomic() {
    op1_that_works(); // this change remains even if op2 fails
    op2_that_fails()?;
}

but in all fairness the affected code already required some love.

That being said, I don't want to delay this further. The base is good, now time to battle test it.

@AgeManning please note that there is one required changed before merging, since there is an unwrap in the code

divagant-martian · 2023-12-04T16:47:36Z

src/handler/mod.rs

+            {
+                let new_packet = session
+                    .encrypt_message::<P>(self.node_id, &request_call.encode())
+                    .unwrap();


I kinda believe this but we don't really use unwrap in prod code (release builds). Can you please change this to handle the error or at least change it to an expect with the right explanation?

This is my only requirement before merging @njgheorghita

@njgheorghita - I had a quick look through the AES library we use. From what I can tell this function can fail if the size of the messages are over a specific limit. I was looking at version 0.9. I haven't looked at future versions.

I don't think it's reasonable to assume that in the current version and all possible future versions, that every possible message we try to encrypt will be successful. If there is an event that it's not successful, I dont think we should crash the entire application.

Failing to encrypt a message is not a catastrophic failure imo. I think it's fine to just log a CRIT message and continue running.

In which case we shouldn't use an unwrap() or expect() rather check the failure and if it fails log a crit.

What do you think?

Some thoughts...

We're only dealing with requests retrieved from the active_requests store. IIUC this implies a pretty strong guarantee that the encryption will be successful, since the request has already been successfully encrypted when it originated. If this is accurate, then expect might actually expose an error in the request-handling logic - though, I'm unsure whether or not that is worth crashing...

If the encryption fails, do we want to do more than just log a CRIT message? Maybe it's worth removing the request entirely from the active_requests store. Although... maybe it's alright to just leave the (unable to encrypt) active request as is and let it simply timeout?

Failing to encrypt a message is not a catastrophic failure imo.

Ultimately, I agree with this. So I'm leaning towards logging a message. The question I would need to think more about (it's been a while since I've been intimate with this code 😅 ) is whether or not we need to handle un-encryptable requests? Or if it's ok to just log the message and let the request timeout

The reason why I more or less believed this could be true is because we are replaying. In that scenario, the appropriate message in expect is why this is infallible, not simply the error occurred. Maybe could have been more clear about it.

But it seems to be replaying might no be enough guarantee from @AgeManning 's comment? In that case the error needs to be handled. Here that potentially means bobbling up the error to the application. You should look into how failed requests and failed encryption in particular is handled in other parts of the code to get a good idea of what to do in this scenario.

Also, yes to logging the error (there is no CRIT, just ERROR)

Or if it's ok to just log the message and let the request timeout

Would need to check but I assume we try to guarantee that if a request times out then it is active, removing from one place and not the other would most likely break that invariant.

Yeah all good points.

My personal preference with edge cases like this, is to just log the highest kind of error. Hopefully we have good enough monitoring tools to notice it. I tend not to try to engineer wild error cases. If we hit this, I think its fine to just let the request timeout. Usually we should always send an error back to the service, so the service knows that the request broke, but I'm not sure if its even worth doing that.
This should be a very very low probability of occurring. If things start breaking because we dont handle this error case correctly, we should hopefully be able to identify the problem from the error log.

Anyway, feel free to handle how you like, but I think just taking the easy path and logging an error should be enough.

…e request

njgheorghita · 2023-12-07T22:40:16Z

@AgeManning @divagant-martian @ackintosh Thanks everyone so much for the help on this pr! I believe all the issues have been addressed & fixed, but if there are any other final changes that you'd like to see, just lmk!

ackintosh · 2023-12-13T01:37:42Z

https://github.com/sigp/discv5/releases/tag/v0.4.0 🎉

* Version bump to v0.3.1 (#203) * fix docs (#204) * refactor for removing `Discv5` prefix (#206) * goodbye prefix * adjust docs * fix rpc decoding to reject extra data (#208) * reject extra data * reduce diff * expected_responses remains after challenge has been completed (#210) * Replace `Handler::spawn` with `build_handler()` and `Handler::start()` * Test the handler's states after the handler has been terminated * Remove expected response on handle_auth_message() * Rename variables for readability * Expose local ENR Arc (#214) * Use zero ports in tests (#216) * update dependencies (#219) * Changing port of ReponseBody::Pong to NonZeroU16 (#220) * Change `port` from u16 to NonZeroU16 * Fix tests * Fix test: the PONG port can't be zero * Fix clippy warnings * Update 'enr' dependency (#223) * Add support for concurrent requests to a single peer. (#200) Co-authored-by: ackintosh <sora.akatsuki@gmail.com> Co-authored-by: Diva M <divma@protonmail.com> Co-authored-by: Age Manning <Age@AgeManning.com> * Adjust some logs (#225) * remove log for timed out query. This is always informed in the callback * expand common logs, unify info on startup * adjust auth header log * Update src/service.rs * Appease clippy * Realised I was wrong. Don't need this log, my bad * fmt --------- Co-authored-by: Age Manning <Age@AgeManning.com> * Version bump to v0.4.0 * make tracing-subscriber a dev dep (#226) * Fix warnings and bump libp2p (#232) * Update session_cache_capacity from usize to NonZeroUsize since the argument of LruCache::new is NonZeroUsize. * Fix rustdoc * cargo fmt * Fix a merging mistake: lost validation in Message::decode --------- Co-authored-by: Age Manning <Age@AgeManning.com> Co-authored-by: Divma <26765164+divagant-martian@users.noreply.github.com> Co-authored-by: Jack McPherson <jmcph4.github@gmail.com> Co-authored-by: João Oliveira <hello@jxs.pt> Co-authored-by: Milos Stankovic <82043364+morph-dev@users.noreply.github.com> Co-authored-by: Nick Gheorghita <nickgheorghita@gmail.com> Co-authored-by: Diva M <divma@protonmail.com>

njgheorghita mentioned this pull request Jul 10, 2023

Handling multiple in-flight requests #198

Closed

njgheorghita marked this pull request as ready for review July 10, 2023 20:25

divagant-martian self-requested a review July 18, 2023 23:20

divagant-martian reviewed Jul 19, 2023

View reviewed changes

njgheorghita force-pushed the experimental branch 5 times, most recently from 15b5ed7 to 9cc6c79 Compare July 19, 2023 21:16

njgheorghita force-pushed the experimental branch 4 times, most recently from 15afbcc to 7ae82df Compare July 20, 2023 14:14

divagant-martian reviewed Jul 25, 2023

View reviewed changes

njgheorghita force-pushed the experimental branch from 5947e0c to 6bece13 Compare July 27, 2023 20:44

ackintosh reviewed Jul 30, 2023

View reviewed changes

src/handler/mod.rs Outdated Show resolved Hide resolved

ackintosh reviewed Jul 31, 2023

View reviewed changes

src/handler/mod.rs Outdated Show resolved Hide resolved

njgheorghita force-pushed the experimental branch from a766c21 to b079c01 Compare August 3, 2023 19:51

ackintosh reviewed Aug 3, 2023

View reviewed changes

src/handler/mod.rs Outdated Show resolved Hide resolved

divagant-martian suggested changes Aug 4, 2023

View reviewed changes

ackintosh reviewed Aug 8, 2023

View reviewed changes

njgheorghita force-pushed the experimental branch from 4c361d9 to 49dde30 Compare August 16, 2023 17:33

ackintosh and others added 8 commits October 16, 2023 17:32

Fix redundant clone

c053ac2

Tweak udp4 port numbers to avoid Address already in use error

b153b14

Add docstring comments to new handler methods

4fdb780

Fix replay

d791fe3

Add doc comment

13bd6fa

Add test for update_packet()

5c84d22

Add doc comment & cargo fmt

f9e75c3

Fix clippy errors

76b88d9

njgheorghita force-pushed the experimental branch from afe7466 to 76b88d9 Compare October 16, 2023 21:33

Merge latest master

742d460

AgeManning approved these changes Nov 29, 2023

View reviewed changes

fmt

f2d345c

njgheorghita and others added 3 commits November 30, 2023 18:02

Fix broken clippy lint

35cfcb3

Merge branch 'master' into experimental

30a8f2c

smmal code cleanup

3e325d6

divagant-martian approved these changes Dec 4, 2023

View reviewed changes

divagant-martian reviewed Dec 4, 2023

View reviewed changes

njgheorghita added 2 commits December 5, 2023 11:09

Replace naked unwrap with expect

9c9cdd4

Log error message instead of panicking, if unable to re-encrypt activ…

84f303a

…e request

divagant-martian merged commit 546e19c into sigp:master Dec 10, 2023
6 checks passed

		.position(\|req\| req.packet().message_nonce() == nonce)
		.expect("to find request call by nonce");

	match self.active_requests_mapping.entry(node_address.to_owned()) {
	match self.active_requests_mapping.entry(node_address.clone()) {

		.remove(request_call.packet().message_nonce());
		Some(request_call)

Add support for concurrent requests to a single peer. #200

Add support for concurrent requests to a single peer. #200

Conversation

njgheorghita commented Jul 10, 2023 • edited Loading

Description

How was it fixed?

Notes & open questions

Change checklist

njgheorghita commented Jul 19, 2023

divagant-martian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ackintosh commented Jul 30, 2023

njgheorghita commented Aug 3, 2023

ackintosh commented Aug 3, 2023 • edited Loading

divagant-martian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita commented Aug 21, 2023

ackintosh commented Aug 22, 2023

njgheorghita commented Aug 23, 2023

ackintosh commented Aug 25, 2023

ackintosh commented Oct 18, 2023

AgeManning commented Nov 6, 2023

ackintosh commented Nov 6, 2023

AgeManning commented Nov 29, 2023

AgeManning commented Nov 29, 2023

divagant-martian commented Nov 29, 2023

AgeManning commented Dec 4, 2023

divagant-martian left a comment • edited Loading

Choose a reason for hiding this comment

divagant-martian Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita commented Dec 7, 2023

ackintosh commented Dec 13, 2023

njgheorghita commented Jul 10, 2023 •

edited

Loading

ackintosh commented Aug 3, 2023 •

edited

Loading

divagant-martian left a comment •

edited

Loading

divagant-martian Dec 4, 2023 •

edited

Loading