-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
materialize-kafka: new connector #2196
Open
williamhbaker
wants to merge
11
commits into
main
Choose a base branch
from
wb/materialize-kafka
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+5,364
−0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
An initial working version of the Kafka materialization, that materializes documents as Avro messages.
Override the default behavior of logging on librdkafka errors as ERROR logs and log them as debug instead. If there's actually a fatal error the connector will encounter that through its normal processing and error out with the typical logging.
Adds an option for messages to be encoded as JSON if a schema registry isn't available, or just if JSON encoding is preferred.
Adds snapshot tests for the `Spec` command and also some simple verification of materialized documents.
Adds the dockerfile and CI build steps for materialize-kafka. I ended up not trying to share code from source-kafka right now since it was only needed in the tests, and have instead copy/pasted the Avro to JSON translating logic directly into the materialize-kafka tests file.
The synthetic flow_published_at field is recommended by default in all other materializations, so we'll recommend it here as well. The concept of a "synthetic" field is somewhat interesting when considering validation constraints, but for current practical purposes the only ones to be concerned with are flow_published_at and flow_document.
It may be helpful to have a little more logged information about checkpoint recovery, particularly to get an idea of how much compaction is done on the checkpoints topic over time.
…ints The higher-level FutureProducer posed some performance challenges, and so the more low-level ThreadedProducer has been swapped in. It seems like the intention with the FutureProducer is to create futures for all of your messages, and then await all of them individually. This is mostly a memory problem, since we'd have to create a future for every message in a transaction, or implement some kind of more complicated system of backpressure. The ThreadedProducer works more directly with the producer queue, and we apply some simple backpressure via a `poll` call to the producer when that queue is full. This seems to work well in my own testing, and has a good amount of throughput. The checkpoint persistence and recovery code has been ripped out, for two reasons: First, the ThreadedProducer is a little more complicated to use (particularly in async handling), and I didn't particularly feel like figuring out how to use it in the same way that the prior implementation with the FutureProducer worked. And second, Kafka has a concept of a "transaction timeout", which is the maximum length of time a transaction can be open. There's always going to be some timeout, although it could be on the order of 10's of minutes. But obviously this is problematic for the idea of materializations needing to handle unbounded transaction sizes, and we were always going to need some kind of "non transactional" mode of operation that features at-least-once message delivery. This initial version of the materialization connector will only have that at-least-once mode, and we can revisit exactly-once later if there is demand for it.
williamhbaker
force-pushed
the
wb/materialize-kafka
branch
from
December 11, 2024 16:35
a728c2d
to
d67cde6
Compare
Connecting to AWK MSK with the rust Kafka library we are using requires the client to call "poll" so that an auth token is generated. But we need to use an admin client to create topics during the materialization `Apply`, and there is no way that I could find to call "poll" with the admin client to generate a token. It's possible that I'm missing something here, but I tried pretty hard to make it work. If it it turns out that anybody wants to use this connector with MSK we can revisit this later. For now I think it's best to remove MSK as an option, since it isn't going to work well for most users unless they somehow pre-create all of the topics the connector needs, which is the only other option I can think of.
Take the selected key fields as-is, without unescaping ~0 or ~1 when building the slice of key pointers used to create the Avro schema. This fixes some inconsistencies with key serialization when key fields contain actual slashes in the names.
williamhbaker
force-pushed
the
wb/materialize-kafka
branch
from
December 11, 2024 20:17
d67cde6
to
70e0ea4
Compare
williamhbaker
force-pushed
the
wb/materialize-kafka
branch
from
December 11, 2024 20:37
70e0ea4
to
750aa09
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
New connector for materializing data to Kafka topics. It can encode messages as JSON, or Avro when there is a schema registry configured.
There were initial attempts made at an exactly-once mode of operation, where Kafka producer transactions were used and checkpoints were written to a checkpoints topic that could then be recovered. This was taken out as a simplification, and because of complications that would likely arise from the existence of Kafka transaction timeouts, which limit how long transactions can be open. An at-least-once non-transactional mode of operation will be needed no matter what because of this timeout consideration, and so the connector is being initially released with only that capability and we can add transaction support later if there is an actual need for it.
Field selection is a little different for this connector than for example SQL materializations: Here, all top-level fields are recommended by default, so that the default path is for users to end up with their entire document materialized nicely without having to do anything different. It is also possible to de-select top-level fields, or include additional nested fields if that is needed for some reason.
Closes #2138
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
I will write new documentation for the connector along with merging this PR.
Notes for reviewers:
(anything that might help someone review this PR)
This change is