Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client side routing implementation #205

Merged
merged 11 commits into from
Jan 10, 2025

Conversation

madchicken
Copy link
Contributor

@madchicken madchicken commented Dec 16, 2024

This PR is a draft implementation of the client-side routing for new4j clusters.
The PR implements the ROUTE message for the unstable version of the protocol (v2).

  • Testing: I am currently testing the driver on a real cluster but I have no guarantee of the performance and stability of the routing algorithm
    - [ ] Bookmarks are not supported at the moment (the array is always sent empty)
  • The maximum supported version of the protocol is now 4.4 but we miss some of the features the protocol should expose from this version (the imp_user should be added to the transaction begin message)
  • Only the round-robin algorithm is currently implemented as a routing strategy. Maybe we want to implement a LeastUsed algo and make it configurable. This is considered acceptable for the moment
  • There is no handling of the initial routing context: it is always empty. The initial routing context now contains the URL used to connect to the cluster as other drivers do. We need to add support to parameters passed in the URL as query string
  • The retry mechanism should probably be reviewed with routing: it does not make sense to retry an operation on a failing connection, we should select another server from the cluster instead. - Retry is still in place but we consider as Permanent an error thrown by the pool when getting the connection

Additional Notes:

  • The lock mechanism for refreshing the routing table is sub-optimal (to use an euphemism). We should try to be smarter and use a different method to check for a new routing table.
    - Extra fields in the route message are not implemented yet (part of v4.4) Protocol version is now 4.4
  • I had to modify the way we serialize some fields: in a structure, the skip_field method is now rendering a null value
  • IMPORTANT: this PR includes a workaround fix to the v2 protocol for date fields deserialization that is crashing the driver. We should find a better solution to this.

@knutwalker
Copy link
Collaborator

Hey, thanks for this PR, this helps a lot!

I went over the changes and don't have anything to add, but I also did look in detail at the actual routing protocol and algorithms.

I'm ok with merging something before dealing with all limitations. In particular, I think it's fine to not support the routing context, bookmarks, only have partial 4.3 support, only having one routing strategy at the beginning. We don't need to have everything figured out upfront and can do this in steps/multiple PRs.
I would also be ok with only adding it to the new protocol implementation.

I've been thinking about the testing a bit (or a lot) lately; I don't like the amount of integration tests compared to what the unit tests do. Especially when testing bolt protocol details of a non-happy path, it can be either impossible or very difficult to setup.
The current setup is fine to actually run something against a one-and-done database, but it doesn't scale as well as I would like it to be.

I have half an idea to rewrite the core parts with the connection IO and pooling being abstracted out. I would like to be able to do more simulation testing where exact IO operations can be simulated in the tests without having to have an actual db to setup. I'd also like to move async out from the core to some outer layer. Maybe this goes in the sans-io direction, maybe it'll look a bit different.

@madchicken
Copy link
Contributor Author

Thanks for the reply @knutwalker, I will remove the implementation of the client-side routing for the stable protocol and leave it only when the unstable-bolt-protocol-impl-v2 feature is active. Now that I have the workaround for the dates deserialization bug I can do some more testing. One thing I need to understand is this: when selecting the server in the routing table, should I use the bolt protocol or stick with neo4j for the final connection? I think this is very important for the server, isn't it?

@knutwalker
Copy link
Collaborator

I will remove the implementation of the client-side routing for the stable protocol and leave it only when the unstable-bolt-protocol-impl-v2 feature is active.

great!

One thing I need to understand is this: when selecting the server in the routing table, should I use the bolt protocol or stick with neo4j for the final connection? I think this is very important for the server, isn't it?

There shouldn't be a need for any scheme. The first connection uses the scheme to determine the initial routing and the encryption to use. When connecting to a server from the routing table, you use the same encryption settings (and, when supported, routing context) and directly connect to that socket address, without going through the bolt/neo4j scheme parsing again. That is what the product drivers do, at least. That looks like this requires some changes to the pool creation, essentially using the same ConnectionInfo, but with only the host/port replaced.

@madchicken
Copy link
Contributor Author

There shouldn't be a need for any scheme. The first connection uses the scheme to determine the initial routing and the encryption to use. When connecting to a server from the routing table, you use the same encryption settings (and, when supported, routing context) and directly connect to that socket address, without going through the bolt/neo4j scheme parsing again. That is what the product drivers do, at least. That looks like this requires some changes to the pool creation, essentially using the same ConnectionInfo, but with only the host/port replaced.

Yes, sorry...I realized that too.

@madchicken
Copy link
Contributor Author

@knutwalker please take a look at my last commit where I had to modify the main Graph interface. I had essentially to expose the Operation::Read/Write parameter only for the new protocol implementation. I hope it makes sense.
I also added a note in the description of the PR about the retry mechanism.

@madchicken
Copy link
Contributor Author

@knutwalker I have finished my work on this first implementation of the client-side routing. I tested it in a simple local cluster and I am now running it on a k8s cluster and it seems to work properly. I still don't have any data about the performance impact regarding the locking mechanism implemented on the routing table the driver holds after the initialization.

@madchicken madchicken marked this pull request as ready for review January 5, 2025 17:23
@madchicken madchicken force-pushed the client-side-routing branch from 36316ae to 2651092 Compare January 6, 2025 16:31
@knutwalker
Copy link
Collaborator

4.4 IT failure could be related to #209

@madchicken madchicken force-pushed the client-side-routing branch from 8c7afc4 to 9c449c6 Compare January 8, 2025 17:05
Copy link
Collaborator

@knutwalker knutwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good, thanks for adding all this. I do have few comments I'd like to be resolved before merging in, but it isn't anything bigger (I hope) and some are more ideas and thoughts than suggestions.

lib/src/bolt/request/route.rs Outdated Show resolved Hide resolved
lib/src/bolt/request/route.rs Outdated Show resolved Hide resolved
lib/src/bolt/request/route.rs Outdated Show resolved Hide resolved
lib/src/bolt/request/route.rs Show resolved Hide resolved
lib/src/config.rs Outdated Show resolved Hide resolved
lib/src/routing/mod.rs Show resolved Hide resolved
lib/src/routing/routed_connection_manager.rs Show resolved Hide resolved
lib/src/txn.rs Outdated Show resolved Hide resolved
lib/tests/use_default_db.rs Outdated Show resolved Hide resolved
lib/tests/use_default_db.rs Outdated Show resolved Hide resolved
 - Use round-robin balancing
 - Store routing table in a Dashmap to allow fast concurrency
 - Perform the update of the routing table using a simple mutex on the TTL of the table itself
 - read/write mode is propagated into the RUN command
 - Bolt scheme in pools to allow direct connections with no routing to provided servers
@madchicken madchicken force-pushed the client-side-routing branch from 9c449c6 to cae0599 Compare January 9, 2025 09:49
@madchicken madchicken requested a review from knutwalker January 9, 2025 11:18
@knutwalker
Copy link
Collaborator

knutwalker commented Jan 9, 2025 via email

@knutwalker
Copy link
Collaborator

knutwalker commented Jan 9, 2025 via email

Copy link
Collaborator

@knutwalker knutwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@knutwalker
Copy link
Collaborator

The 4.4 test failures are persistent enough that they might actually be related to this change 🤔

}

impl Version {
pub fn add_supported_versions(bytes: &mut BytesMut) {
bytes.reserve(16);
bytes.put_u32(0x0404); // V4_4
bytes.put_u32(0x0304); // V4_3
bytes.put_u32(0x0104); // V4_1
bytes.put_u32(0x0004); // V4
bytes.put_u32(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to remove the two put_u32(0) calls. The initial message only expected 4 versions. I guess on 5.x, the server drains every additional bytes until the next message, on 4.4 those 0 values are part of the next message, effectively sending a few empty chunks.

@knutwalker knutwalker merged commit a202b3f into neo4j-labs:main Jan 10, 2025
11 checks passed
@madchicken madchicken deleted the client-side-routing branch January 11, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants