Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBZ-6723 Allow specifying partitions for EventHub #51
DBZ-6723 Allow specifying partitions for EventHub #51
Changes from 1 commit
cef02da
45b7118
a9089ac
6c117f6
9b91da4
2ac80d8
4085b86
5c3f388
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not going to work. The
markProcessed
should be called sequentially for the records are they are made available from the input. There must not be gaps or change in order. IMHO this is not guaranteed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any advice on how to address this? When stepping through the code I see that the risk of non-sequentially calling markProcessed is that the wrong offset is ultimately written to the offsetWriter, and those offsets are only committed when
markBatchFinished()
is called.The first thing that comes to mind would be to called
committer.markProcessed(record)
as soon as it's scheduled in an EventDataBatch, as that is guaranteed to be in order. That seems to have the risk that that batch might fail to be emitted, while other batches to other EventHub partitions might succeed. That situation would result in an exception being thrown fromhandleBatch()
, meaning thatmarkBatchFinished()
would not be called. Is that sufficient to make sure the right offsets are being stored? A retry in that case would probably be fine, as that would lead to at-least-once delivery, whereas in the current situation writing out the wrong offset could lead to data-loss in case of retries?@jpechane Could you acknowledge that my train of thought here is correct? If so I'll make the change to markProcessed once records are added to a batch. If I'm wrong I'd hope you have an alternative suggestion on how to resolve this, without having to emit the batches to EventHub in serial order based on partition routing of incoming records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slknijnenburg Hi, there are generally two solutions
markBatchFinished()
. It brings larger number of duplicates in case of connector crash but it is simple and easy to implementI personally would recommend to implement option 1) to conclude the PR and handle option 2) as a follow-up (if needed)