-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: share kafka client on meta #19058
Conversation
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
…ngwavelabs/risingwave into tab/share-kafka-client-enum
// drop the guard and acquire a new one to avoid a 10s blocking call | ||
drop(shared_client_guard); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this happen?
- Caller A tries to get a client for a connection, but cache missed, so it takes some time for calller A to build the client.
- During it, caller B tries to get a client for the same connection, but cache missed, so it also builds a client.
- Caller A built and inserted the client to the map and set ref_count = 1
- One second later, caller B also inserted the client to the map and set ref_count = 1, causing the client of caller A to leak and will never be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to use moka
to replace the HashMap
use moka::future::Cache;
Particularly, a caching structure should handle these concurrent gets correctly by letting the caller B blocks until caller A completes its operation and insert back the cached item i.e. the Kafka Client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also run some tests and describe the improvements in the PR description?
At least sth like "Manually tested that num of threads is reduced from xxx to yyy for zzz Kafka sources." Although the idea might be clear, the implementation is not that trivial, so we should verify it works. Besides, the background of the problem should also be mentioned. |
Cargo.lock
Outdated
|
||
[[patch.unused]] | ||
name = "quanta" | ||
version = "0.11.0" | ||
source = "git+https://github.com/madsim-rs/quanta.git?rev=948bdc3#948bdc3d4cd3fcfe3d52d03dd83deee96d97d770" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might break madsim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to fix it? Directly removing the part does not help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after removing
# from Cargo.toml, in `[patch.crates-io]`
quanta = { git = "https://github.com/madsim-rs/quanta.git", rev = "948bdc3" }
the [[patch.unused]]
section disappears. But the tests still stuck
Running [ 00:01:43] [=============================================================================> ] 533/1167: 12 running, 533 passed (4 slow), 22 skipped
Canceling due to interrupt: 12 tests still running
Canceling [ 00:01:43] [=============================================================================> ] 533/1167: 12 running, 533 passed (4 slow), 22 skipped
SIGINT [ 103.327s] risingwave_batch executor::generic_exchange::tests::test_exchange_multiple_sources
SIGINT [ 102.891s] risingwave_batch executor::merge_sort_exchange::tests::test_exchange_multiple_sources
SIGINT [ 102.802s] risingwave_batch executor::tests::test_clone_for_plan
SIGINT [ 102.726s] risingwave_batch task::task_manager::tests::test_task_id_conflict
SIGINT [ 83.848s] risingwave_frontend binder::bind_param::test::subquery
SIGINT [ 83.893s] risingwave_frontend binder::bind_param::test::cast_after_specific
SIGINT [ 83.899s] risingwave_frontend binder::bind_param::test::basic_value
SIGINT [ 83.887s] risingwave_frontend binder::bind_param::test::default_type
SIGINT [ 83.903s] risingwave_frontend binder::bind_param::test::basic_select
SIGINT [ 83.042s] risingwave_frontend binder::expr::value::tests::test_bind_radix
SIGINT [ 83.886s] risingwave_frontend binder::bind_param::test::infer_case
SIGINT [ 83.498s] risingwave_frontend binder::expr::value::tests::test_bind_interval
------------
the dep to quanta
is introduced by moka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should either revert to moka 0.12.0, or upgrade the madsim quanta to 0.12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
madsim-rs/quanta#2 related
#[derive(Debug, Clone, Deserialize, WithOptions)] | ||
pub struct KafkaCommon { | ||
#[derive(Debug, Clone, Deserialize, WithOptions, PartialEq, Hash, Eq)] | ||
pub struct KafkaConnection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we will be able to use this same struct for CREATE CONNECTION
later? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
Ok(item_val) => { | ||
let share_item = SharedKafkaItem { | ||
client: item_val.client.clone(), | ||
ref_count: item_val.ref_count - 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why do we need ref_count
. Can we just use sth like get_with
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how do we manage the RdKafka client instance if all related sources are dropped? IIUC, if we remove the ref_count, the client instance will always be kept in memory, until restarting the meta node. There is no evict policy enabled for the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by Bugen, I think we can store Weak
in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if Weak
fits well with moka
. We may also try dashmap
or weak-table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BugenZhao Why do you think Weak
might not work with moka
? Actually I'm also thinking what's the difference between moka
(cache) and dashmap
(concurrent hashmap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm also thinking what's the difference between moka (cache) and dashmap (concurrent hashmap)
I think moka is a dashmap with evict policy and guarantees the updates can be done atomically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dashmap
behaves similar to moka::sync::Cache
with unlimited capacity. 🤔
Why do you think
Weak
might not work withmoka
?
Because the interface does not seem to be that compatible with storing a Weak
, like, no auto-eviction, requires the caller to keep the strong reference:
let mut client_arc: Option<Arc<KafkaClientType>> = None;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires the caller to keep the strong reference
Isn't this expected usage? i.e., store Weak
in the map, while store Arc
in the caller. Otherwise who keeps the Arc
? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean when inserting. 🤣 For example, weak-table
allows you to pass a closure returning an Arc
in insert_with
while actually storing a Weak
but returning an Arc
back to the caller. With moka
we need to temporarily hold the strong reference to prevent it from being deallocated.
Tested locally Kafka env
command:
on main (a176ace): 1573 threads |
Would you mind testing Kafka with multiple brokers, where we might see a larger difference? |
This comment was marked as outdated.
This comment was marked as outdated.
tested with confluent with multiple AZ
testing script
EC2 idle: 71 why not test with more source? creating 5 source at once seems the SDK's maximum, I got the error afterward:
a little weird but irrelevant to the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
@@ -71,7 +71,7 @@ jni = { version = "0.21.1", features = ["invocation"] } | |||
jsonbb = { workspace = true } | |||
jsonwebtoken = "9.2.0" | |||
maplit = "1.0.2" | |||
moka = { version = "0.12.0", features = ["future"] } | |||
moka = { version = "0.12.8", features = ["future"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still use 0.12.0
in this PR, so that there's no changes in Cargo.lock
and bumping of quanta
can be reviewed in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid not. We are using and_try_compute_with
, which is unavailable in v0.12.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you try 0.12.3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to checkout the Cargo.lock
on the main branch to downgrade the locked version of quanta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. 😢 In its repo the version 0.12.3
released before bumping dependency on quanta
, but on crates.io it's the other way around.
Cargo.toml
Outdated
@@ -343,7 +339,7 @@ opt-level = 2 | |||
|
|||
[patch.crates-io] | |||
# Patch third-party crates for deterministic simulation. | |||
quanta = { git = "https://github.com/madsim-rs/quanta.git", rev = "948bdc3" } | |||
quanta = { git = "https://github.com/tabVersion/quanta.git", rev = "bb6c780894d06c0ec3f487d58c72920665b5cb0a" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may contribute to madsim-rs/quanta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge madsim-rs/quanta#2 first and we can switch back to madsim.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
as title, reuse the client if the broker addr is the same, to reduce the conn to the broker
The key changes in this PR revolve around optimizing Kafka client management by introducing connection pooling. Here's the main changes:
Introduction of SHARED_KAFKA_CLIENT:
The main motivations appear to be:
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.