Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-redshift-batch: Add the 'use_schema_inference' feature flag #2187

Merged
merged 4 commits into from
Dec 6, 2024

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Dec 6, 2024

Description:

This PR introduces the /advanced/feature_flags config setting to the Redshift batch capture, adds a helper function in source-boilerplate for parsing those flags with default values and support for no_ prefixes, and then uses that feature flags infrastructure to add a new one named use_schema_inference which when set causes discovered collection schemas to have x-infer-schema: true set so that they use the inferred schema in conjunction with the discovered schema.

This is useful because Redshift is a data warehouse where users often use very loose declared types on the tables, and then want to be able to rely on more precise as-used-in-practice type guarantees when materializing their dataset. For example, it's tolerably common to have no declared primary keys and declare all columns with a potentially-nullable type, then manually specify a collection key made of one or more of those properties, and then want to materialize that collection to a SQL database. And currently that doesn't work because the columns designated as the key are potentially nullable in the source DB, even if there are no actual nulls in practice. Turning on schema inference is the escape hatch which allows this to work right up until the moment they actually stick a null into their source data.

Workflow steps:

Put use_schema_inference in the "Feature Flags" section of the advanced endpoint config. In the future if more feature flags than just this one exist, they should be comma-separated.

Note that once enabled on a particular capture, simply removing the feature flag will not turn schema inference back on -- as I understand things the use of schema inference is "sticky" and disabling it would require manually fiddling with the collection schemas.


This change is Reviewable

It looks like the Redshift snapshots were a bit stale even before
I made any changes, so this commit gets those updates out of the
way.
Implements a new feature flag for `source-redshift-batch` which
causes collection schemas to specify `x-infer-schema: true` and
thereby request schema inference be used in addition to the full
discovered schema from the connector.

This can be useful in cases where the declared types of the table
in the source DB permit things like nullability, but the dataset
doesn't actually use that and the user needs tighter constraints
on possible values for materialization purposes (in this example,
knowing that the property is never null in practice).
Just a simple test case to make sure the `use_schema_inference`
feature flag does what it was intended to.
@willdonnelly willdonnelly requested a review from a team December 6, 2024 18:36
Copy link
Contributor

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@willdonnelly willdonnelly merged commit 82315d3 into main Dec 6, 2024
52 of 53 checks passed
@willdonnelly willdonnelly deleted the wgd/redshift-schema-inference-flag-20241206 branch December 6, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants