Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1161484: Use insertRows instead of insertRow for schematization #796

Closed
wants to merge 7 commits into from

Conversation

sfc-gh-tzhang
Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang commented Mar 1, 2024

Overview

SNOW-1161484

  • Create channel with SKIP_BATCH instead of CONTINUE, to avoid the case that KC can crash right after adding the good rows to table, and the bad rows will be missing in the DLQ
  • Update the schematization code to use insertRows instead of insertRow
    - Better performance
    - If KC is configured with a longer buffer flush time, everything will be ingested as one batch so that the 1 second flush or 32MB channel size won't apply

Pre-review checklist

  • This change should be part of a Behavior Change Release. See go/behavior-change.
  • This change has passed Merge gate tests
  • Snowpipe Changes
  • Snowpipe Streaming Changes
  • This change is TEST-ONLY
  • This change is README/Javadocs only
  • This change is protected by a config parameter <PARAMETER_NAME> eg snowflake.ingestion.method.
    • Yes - Added end to end and Unit Tests.
    • No - Suggest why it is not param protected
  • Is his change protected by parameter <PARAMETER_NAME> on the server side?
    • The parameter/feature is not yet active in production (partial rollout or PrPr, see Changes for Unreleased Features and Fixes).
    • If there is an issue, it can be safely mitigated by turning the parameter off. This is also verified by a test (See go/ppp).

@sfc-gh-tzhang sfc-gh-tzhang marked this pull request as ready for review March 1, 2024 07:47
@@ -1059,7 +1059,7 @@ private SnowflakeStreamingIngestChannel openChannelForTable() {
.setDBName(this.sfConnectorConfig.get(Utils.SF_DATABASE))
.setSchemaName(this.sfConnectorConfig.get(Utils.SF_SCHEMA))
.setTableName(this.tableName)
.setOnErrorOption(OpenChannelRequest.OnErrorOption.CONTINUE)
.setOnErrorOption(OpenChannelRequest.OnErrorOption.SKIP_BATCH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a BCR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, there is no behavior difference as far as customer is concerned

@sfc-gh-japatel
Copy link
Collaborator

sfc-gh-japatel commented Mar 7, 2024

Could you consider using the checklist in PR? Let me know if you want to change something in the template.

Also is SKIP_BATCH needed for schematization or is needed in general? If needed in general, I would like if it be separated from this PR.

@sfc-gh-xhuang
Copy link
Collaborator

I would merge this into the next release if it is ready but up to you

Comment on lines +588 to +602
private StreamingBuffer rebuildBufferWithoutErrorRows(
StreamingBuffer streamingBufferToInsert,
List<InsertValidationResponse.InsertError> insertErrors) {
StreamingBuffer buffer = new StreamingBuffer();
int errorIdx = 0;
for (long rowIdx = 0; rowIdx < streamingBufferToInsert.getNumOfRecords(); rowIdx++) {
if (errorIdx < insertErrors.size() && rowIdx == insertErrors.get(errorIdx).getRowIndex()) {
errorIdx++;
} else {
buffer.insert(streamingBufferToInsert.getSinkRecord(rowIdx));
}
}
return buffer;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this result into a validation error you added recently?

@@ -576,6 +584,22 @@ InsertRowsResponse insertBufferedRecords(StreamingBuffer streamingBufferToInsert
return response;
}

/** Building a new buffer which contains only the good rows from the original buffer */
private StreamingBuffer rebuildBufferWithoutErrorRows(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test coverage for this new method?

Assert.assertTrue(topicPartitionChannel.isPartitionBufferEmpty());
Assert.assertEquals(0, kafkaRecordErrorReporter.getReportedRecords().size());

// Do it again without any schematization error, and we should have row in DLQ
Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am confused about the comment, do you mean not have a row in DLQ if there is no schematization error?

or do you mean to verify the DLQ results something along this lines? Assert.assertEquals(>1, kafkaRecordErrorReporter.getReportedRecords().size());?

Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about this change not having enough tests, do you feel existing tests are enough?

Also, highly encourage for atleast important changes to fill out PR checklist. Thanks

@sfc-gh-japatel
Copy link
Collaborator

Could you consider using the checklist in PR? Let me know if you want to change something in the template.

Also is SKIP_BATCH needed for schematization or is needed in general? If needed in general, I would like if it be separated from this PR.

talked offline, dont need to split PRs, SKIP_BATCH is needed and there is no behavior change.

Copy link
Collaborator

@sfc-gh-japatel sfc-gh-japatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left couple of comments, I dont have any more concerns, will approve in next iteration! Thank you, this will lower schematization latency 🥇

@sfc-gh-xhuang
Copy link
Collaborator

@sfc-gh-rcheng @sfc-gh-tzhang this should be merged for 2.2.2 release

Copy link
Collaborator

@sfc-gh-rcheng sfc-gh-rcheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little concerned about this big of a behavior change without additional tests, but looks like @sfc-gh-japatel discussed this with toby offline.

@sfc-gh-xhuang
Copy link
Collaborator

sfc-gh-xhuang commented May 10, 2024

With the latest, ingest-sdk merged. We should be able to update the ingest-sdk version and merge this PR?
#843

@sfc-gh-xhuang
Copy link
Collaborator

@sfc-gh-tzhang ingest-sdk update to 2.1.1 has been merged. Can this be merged now too?

Comment on lines -682 to -690
if (extraColNames == null && nonNullableColumns == null) {
InsertValidationResponse.InsertError newInsertError =
new InsertValidationResponse.InsertError(
insertError.getRowContent(), originalSinkRecordIdx);
newInsertError.setException(insertError.getException());
newInsertError.setExtraColNames(insertError.getExtraColNames());
newInsertError.setMissingNotNullColNames(insertError.getMissingNotNullColNames());
// Simply added to the final response if it's not schema related errors
finalResponse.addError(insertError);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-tzhang why you removed this part? IMO we still need it to properly add the rows to the rebuilt buffer and also send only the non-schema errors into DLQ

@sfc-gh-wtrefon
Copy link
Contributor

Closed in favor of #866 due to merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants