Added Webhook variants #1280

jonwihl · 2024-02-26T15:57:07Z

Description:

(Describe the high level scope of new or changed features)

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

Bumps [h2](https://github.com/hyperium/h2) from 0.3.22 to 0.3.24. - [Release notes](https://github.com/hyperium/h2/releases) - [Changelog](https://github.com/hyperium/h2/blob/v0.3.24/CHANGELOG.md) - [Commits](hyperium/h2@v0.3.22...v0.3.24) --- updated-dependencies: - dependency-name: h2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [github.com/go-jose/go-jose/v3](https://github.com/go-jose/go-jose) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/go-jose/go-jose/releases) - [Changelog](https://github.com/go-jose/go-jose/blob/v3/CHANGELOG.md) - [Commits](go-jose/go-jose@v3.0.0...v3.0.1) --- updated-dependencies: - dependency-name: github.com/go-jose/go-jose/v3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

…rialized fields Many systems will preserve capitalization in the names of materialized fields (for example, columns of a SQL table), but still will not allow columns with names that differ only in capitalization to co-exist within the same table. To prevent errors during `Apply`, an additional configuration is provided for `Validate` to consider field names to be case insensitive for the purposes of determining ambiguous columns. This is relevant for materializations to BigQuery, Databricks, MotherDuck, MySQL, and SQLServer. For SQL materializations, the configuration is made available through `materialize-sql` as a new configuration of the `sql.Dialect`. I also moved the `MaxColumnCharLength` (name adjusted slightly) out of the `sql.Endpoint` configuration and into the `sql.Dialect`, since it seemed to thematically fit better there alongside the case-sensitivity configuration.

There was a systemic logic error in how we were escaping quote characters within a quoted identifier. This fixes a number of those. Databricks and MySQL will be fixed in later commits which also change how they are sanitizing identifiers.

MySQL does not allow identifiers to have trailing spaces or characters with code points greater than U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters will be sanitized with an underscore.

Databricks has a set of characters that are strictly disallowed for column names, regardless of how their identifiers are quoted. These characters will be sanitized to underscores. Databricks also has a different, not entirely overlapping set of disallowed characters for table names. These will be sanitized as part of the resource path response, rather than returning an error from `Validate`. This will allow table names with spaces, which may be a common occurrence for Databricks materializations linked to source captures that will automatically add new bindings without user involvement in what the configured name for the tables are.

…e-defined "Metadata" fields ElasticSearch has a number of pre-defined "Metadata" fields (see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html) that exist for every index, and trying to create an index with a separate mapping with any of these names will fail with a "field is defined more than once" sort of error. All we can really do is differentiate our Flow field by appending "_flow" to the name of the field to allow the field to be created. Doing this kind of sanitization is tricky because there are numerous places it has to be included for things to work out end-to-end without a higher-level framework like `materialize-sql`. And such a general framework may actually make sense to construct for materializations in general, but I'm not tackling that just yet.

Adds a set of tests that ensure destination systems can create materialized fields for Flow field names with some pretty challenging characters. The `materialize-motherduck` tests were enabled and ran as part of this, since they now work. Some additional handling in the test helpers was added to support delta updates configurations, since `materialize-motherduck` must currently always be set to delta updates.

Previously we would only normalize characters in index names that would be allowed by Flow collection names, but that's not fully sufficient since the configured index name in the resource configuration can be anything. It will usually be the collection name, but not always. This extends normalization to all characters that pose problems in index names.

Previously we just had duration strings to specify polling intervals for batch captures. This commit factors out that duration handling into a separate 'schedule' helper package and defines another type of schedule: `"daily at HH:MMZ"` which consistently executes at the requested time of day (in UTC). This refactoring will be applied to other batch SQL captures in a followup commit.

Adds another sentence to the description of the polling schedule properties of the endpoint config and resource spec describing the two accepted forms of schedule, and adds a (somewhat hairy but I tested it manually and I'm pretty sure it's correct for most reasonable inputs while also not using any forbidden Go struct-tag characters) regexp pattern so the UI can validate correct inputs.

Log some basic information about the database that is being connected to, including its version and maximum object size setting, if we are able to.

Updates a test snapshot that was missed in 130ae00

Adds a `paths` parameter to the endpoint config, which accepts an array of strings. Each path provided will result in a separate discovered collection. The paths will also be used to determine the `path` and `recommended_name` of each discovered binding. This allows source-http-ingest to be used with multiple collections and still use auto-discovery. The details of the implementation were largely motivated by a desire to preserve backward compatibility with existing captures. Existing capture bindings generally don't use the `path` parameter, and they end up using the flow collection name as the URL path. So we try to preserve that behavior in the case that no `paths` are provided in the endpoint config, with the goal of not breaking any existing webhook deliveries.

Previously, we were ignoring query parameters altogether. Now we include them in captured documents under `/_meta/query`.

…esponse

Axum and tower-http had a few minor breaking changes. Tower-http [improved error handling of RequestDecompressionLayer](tower-rs/tower-http#368), which allowed removal of some annoying error conversion code.

Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.1.0 to 10.2.0. - [Release notes](https://github.com/python-pillow/Pillow/releases) - [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst) - [Commits](python-pillow/Pillow@10.1.0...10.2.0) --- updated-dependencies: - dependency-name: pillow dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

The changes materializations to provide more actionable messages when producing unsatisfiable constraints. Particularly for SQL materializations, which will currently report the existing column type and proposed JSON schema of the field, the message will now also report the type of column that is needed to accommodate the proposed JSON schema. Overall the objective is to make it easier to understand how a user could migrate their existing table to be compatible with a changed collection schema.

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.4...41.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

…lo-world Adds labels to the docker images in order to make some connectors free to run on Estuary's platform. Goes along with estuary/flow#1403. Free connectors are: - source-google-sheets-native - materialize-google-sheets - source-hello-world

Table creation will sometimes fail for a variety of reasons and it is hard to tell which table it is for, or what SQL was attempted to create the table. This adds logging to allow for troubleshooting, and a slightly improved error message.

This may improve throughput when scanning a single large table with large documents.

Some resources always have _some_ level of activity every time they're checked, which would ordinarily prevent the resource from ever being considered "caught up", preventing the connector from exiting. If the LogCursor is a date-time, then additionally consider the task to be idle if the LogCursor is less-than `interval` old (where interval is configured on the ResourceConfig), and sleep until the cursor is `interval` old before issuing a successive fetch.

Much like FetchChangesFn, FetchPageFn is now an AsyncGenerator which yields documents, checkpoint-able PageCursors, or completes an iteration of a resource. This refactoring is consistent with recent updates to FetchChangesFn, and is motivated by use cases where the fetched "page" could be quite large and itself composed of multiple concurrent data fetches, implemented as constituent AsyncGenerator instances. By using an AsyncGenerator, such an implementation can immediately yield from across those concurrent fetches as data arrives such that data is spilled to disk and memory pressure kept low. Then, when the scatter / gather data fetch completes, the entire "page" is checkpointed.

MySQL's handling of enum values in the context of table collation is most vexing in a way that makes our backfill logic misbehave. See, if you write `ORDER BY category` then MySQL orders the rows based on the integer index of the enum values. But if you write `WHERE category > 'CATEGORY_FOO'` then MySQL goes and casts the enum values to strings and compares those. There is probably a way to write some expression like `CAST('CATEGORY_FOO', ...)` which would produce the desired enum ordering comparison in that `WHERE` clause, but the `...` portion of that query might get pretty hairy. We could instead go `ORDER BY CAST(category AS CHAR)` but this would probably force a table sort and we'd prefer to use the natural table ordering where possible. The best solution I have been able to come up with to this problem is to use the integer representation of the enum value as the key for backfill purposes. That means that when serializing a backfill row value into a scan key we convert it to the integer index which corresponds to that string, and so when issuing the next backfill query we effectively go like `WHERE category > 2` (except as a parameterized query).

The string `""` is always accepted by a MySQL enum column and any illegal inputs are mapped to this string, which is represented by the integer 0. We already supported all of this (modulo a small bug in `ALTER TABLE` parsing which I accidentally uncovered as part of this commit), but the logic becomes a bit simpler after this refactor, which moves the `""` option to be first in the list so that its index 0 corresponds more directly to MySQL representing it as the integer zero.

active sessions prevent the warehouse from shutting down using the auto-delay configuration

Added Webhook variants

ba6f0b2

jonwihl requested a review from dyaffe February 26, 2024 15:57

mdibaiee and others added 28 commits March 13, 2024 12:31

materialize-firebolt: report error in case of client error

b241a95

materialize-firebolt: query request debug log

78ad58d

Update LICENSE-BSL

caab986

materialize-mysql: sanitize certain field names

261e8e2

MySQL does not allow identifiers to have trailing spaces or characters with code points greater than U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters will be sanitized with an underscore.

source-{bigquery,postgres}-batch: Apply schedule refactor

baf2ecc

source-mongodb: log buildInfo on startup

5af0e6b

Log some basic information about the database that is being connected to, including its version and maximum object size setting, if we are able to.

materialize-databricks: fix sqlgen test snapshot

f7e6739

Updates a test snapshot that was missed in 130ae00

source-http-ingest: include url query parameters in captured documents

acc04df

Previously, we were ignoring query parameters altogether. Now we include them in captured documents under `/_meta/query`.

source-http-ingest: remove deprecated resource_path from discovered r…

971a248

…esponse

source-http-ingest: update flow dependency

461816f

source-http-ingest: add titles to endpoint config

9fa8783

source-http-ingest: update all dependencies

d44ba81

Axum and tower-http had a few minor breaking changes. Tower-http [improved error handling of RequestDecompressionLayer](tower-rs/tower-http#368), which allowed removal of some annoying error conversion code.

mdibaiee and others added 29 commits March 13, 2024 12:31

materialize-snowflake: more comments on pipe retry loop

cffc83d

materialize-snowflake: hint for enabling snowpipe for delta updates

bfc6ca1

Handle sync tokens

70f0eef

Fix Events stream

b23887e

ci: make usage_rate a string in matrix definition

5ac35c2

ci: cosmetic change to get ci to run

cbcc539

ci: fix passing usage_rate build arg to python connectors

e87d9be

source-dynamodb: increase concurrent table backfill scan workers

c9a9f89

This may improve throughput when scanning a single large table with large documents.

estuary-cdk: add basic README

27f3316

estuary-cdk: rate_limiter cannot be directly logged

ad7dc16

Split target and template SQL generation test

af23aef

Add date and date-time to SQL generation tests

d5cfe08

Rename fields in template

1f311f3

Add date and date-time support for starburst-materialization

1c95f49

estuary-cdk: fix cursor field nesting in airbyte shim

01b3959

source-asana: add tests/__init__.py

5571329

source-google-ads: upgrade to estuary CDK

5b642fa

source-airtable: upgrade to estuary-cdk

024fe6f

source-facebook-marketing: upgrade to estuary-cdk and fix ads_insights

7345edc

source-hubspot: update to estuary-cdk

af948d9

source-hubspot: update schema test snapshot (new properties)

6733696

materialize-databricks: don't keep an active session

abcb561

active sessions prevent the warehouse from shutting down using the auto-delay configuration

source-airtable: re-add removed auth_method property

ceb23b9

jonwihl closed this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Webhook variants #1280

Added Webhook variants #1280

jonwihl commented Feb 26, 2024 •

edited by jgraettinger

Loading

Added Webhook variants #1280

Added Webhook variants #1280

Conversation

jonwihl commented Feb 26, 2024 • edited by jgraettinger Loading

jonwihl commented Feb 26, 2024 •

edited by jgraettinger

Loading