Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Webhook variants #1280

Closed
wants to merge 145 commits into from
Closed

Added Webhook variants #1280

wants to merge 145 commits into from

Conversation

jonwihl
Copy link
Contributor

@jonwihl jonwihl commented Feb 26, 2024

Description:

(Describe the high level scope of new or changed features)

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jonwihl jonwihl requested a review from dyaffe February 26, 2024 15:57
mdibaiee and others added 28 commits March 13, 2024 12:31
Bumps [h2](https://github.com/hyperium/h2) from 0.3.22 to 0.3.24.
- [Release notes](https://github.com/hyperium/h2/releases)
- [Changelog](https://github.com/hyperium/h2/blob/v0.3.24/CHANGELOG.md)
- [Commits](hyperium/h2@v0.3.22...v0.3.24)

---
updated-dependencies:
- dependency-name: h2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/go-jose/go-jose/v3](https://github.com/go-jose/go-jose) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/go-jose/go-jose/releases)
- [Changelog](https://github.com/go-jose/go-jose/blob/v3/CHANGELOG.md)
- [Commits](go-jose/go-jose@v3.0.0...v3.0.1)

---
updated-dependencies:
- dependency-name: github.com/go-jose/go-jose/v3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
…rialized fields

Many systems will preserve capitalization in the names of materialized fields (for example, columns
of a SQL table), but still will not allow columns with names that differ only in capitalization to
co-exist within the same table. To prevent errors during `Apply`, an additional configuration is
provided for `Validate` to consider field names to be case insensitive for the purposes of
determining ambiguous columns.

This is relevant for materializations to BigQuery, Databricks, MotherDuck, MySQL, and SQLServer.

For SQL materializations, the configuration is made available through `materialize-sql` as a new
configuration of the `sql.Dialect`. I also moved the `MaxColumnCharLength` (name adjusted slightly)
out of the `sql.Endpoint` configuration and into the `sql.Dialect`, since it seemed to thematically
fit better there alongside the case-sensitivity configuration.
There was a systemic logic error in how we were escaping quote characters within a quoted
identifier. This fixes a number of those. Databricks and MySQL will be fixed in later commits which
also change how they are sanitizing identifiers.
MySQL does not allow identifiers to have trailing spaces or characters with code points greater than
U+FFFF. (ref https://dev.mysql.com/doc/refman/8.0/en/identifiers.html). These disallowed characters
will be sanitized with an underscore.
Databricks has a set of characters that are strictly disallowed for column names, regardless of how
their identifiers are quoted. These characters will be sanitized to underscores.

Databricks also has a different, not entirely overlapping set of disallowed characters for table
names. These will be sanitized as part of the resource path response, rather than returning an error
from `Validate`. This will allow table names with spaces, which may be a common occurrence for
Databricks materializations linked to source captures that will automatically add new bindings
without user involvement in what the configured name for the tables are.
…e-defined "Metadata" fields

ElasticSearch has a number of pre-defined "Metadata" fields (see
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html) that exist for
every index, and trying to create an index with a separate mapping with any of these names will fail
with a "field is defined more than once" sort of error. All we can really do is differentiate our
Flow field by appending "_flow" to the name of the field to allow the field to be created.

Doing this kind of sanitization is tricky because there are numerous places it has to be included
for things to work out end-to-end without a higher-level framework like `materialize-sql`. And such
a general framework may actually make sense to construct for materializations in general, but I'm
not tackling that just yet.
Adds a set of tests that ensure destination systems can create materialized fields for Flow field
names with some pretty challenging characters.

The `materialize-motherduck` tests were enabled and ran as part of this, since they now work. Some
additional handling in the test helpers was added to support delta updates configurations, since
`materialize-motherduck` must currently always be set to delta updates.
Previously we would only normalize characters in index names that would be allowed by Flow
collection names, but that's not fully sufficient since the configured index name in the resource
configuration can be anything. It will usually be the collection name, but not always.

This extends normalization to all characters that pose problems in index names.
Previously we just had duration strings to specify polling
intervals for batch captures. This commit factors out that
duration handling into a separate 'schedule' helper package
and defines another type of schedule: `"daily at HH:MMZ"`
which consistently executes at the requested time of day (in UTC).

This refactoring will be applied to other batch SQL captures
in a followup commit.
Adds another sentence to the description of the polling schedule
properties of the endpoint config and resource spec describing
the two accepted forms of schedule, and adds a (somewhat hairy
but I tested it manually and I'm pretty sure it's correct for
most reasonable inputs while also not using any forbidden Go
struct-tag characters) regexp pattern so the UI can validate
correct inputs.
Log some basic information about the database that is being connected to, including its version and
maximum object size setting, if we are able to.
Updates a test snapshot that was missed in 130ae00
Adds a `paths` parameter to the endpoint config, which accepts an array of
strings.  Each path provided will result in a separate discovered collection.
The paths will also be used to determine the `path` and `recommended_name` of
each discovered binding. This allows source-http-ingest to be used with
multiple collections and still use auto-discovery.

The details of the implementation were largely motivated by a desire to
preserve backward compatibility with existing captures. Existing capture
bindings generally don't use the `path` parameter, and they end up using the
flow collection name as the URL path.  So we try to preserve that behavior in
the case that no `paths` are provided in the endpoint config, with the goal of
not breaking any existing webhook deliveries.
Previously, we were ignoring query parameters altogether. Now we include them
in captured documents under `/_meta/query`.
Axum and tower-http had a few minor breaking changes. Tower-http
[improved error handling of RequestDecompressionLayer](tower-rs/tower-http#368),
which allowed removal of some annoying error conversion code.
Bumps [pillow](https://github.com/python-pillow/Pillow) from 10.1.0 to 10.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](python-pillow/Pillow@10.1.0...10.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
The changes materializations to provide more actionable messages when producing unsatisfiable
constraints.

Particularly for SQL materializations, which will currently report the existing column type and
proposed JSON schema of the field, the message will now also report the type of column that is
needed to accommodate the proposed JSON schema.

Overall the objective is to make it easier to understand how a user could migrate their existing
table to be compatible with a changed collection schema.
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@41.0.4...41.0.6)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
mdibaiee and others added 29 commits March 13, 2024 12:31
…lo-world

Adds labels to the docker images in order to make some connectors free to run
on Estuary's platform.  Goes along with estuary/flow#1403. Free connectors are:

- source-google-sheets-native
- materialize-google-sheets
- source-hello-world
Table creation will sometimes fail for a variety of reasons and it is hard to tell which table it is
for, or what SQL was attempted to create the table. This adds logging to allow for troubleshooting,
and a slightly improved error message.
This may improve throughput when scanning a single large table with large documents.
Some resources always have _some_ level of activity every time they're
checked, which would ordinarily prevent the resource from ever being
considered "caught up", preventing the connector from exiting.

If the LogCursor is a date-time, then additionally consider the task to
be idle if the LogCursor is less-than `interval` old (where interval is
configured on the ResourceConfig), and sleep until the cursor is
`interval` old before issuing a successive fetch.
Much like FetchChangesFn, FetchPageFn is now an AsyncGenerator which
yields documents, checkpoint-able PageCursors, or completes an
iteration of a resource.

This refactoring is consistent with recent updates to FetchChangesFn,
and is motivated by use cases where the fetched "page" could be quite
large and itself composed of multiple concurrent data fetches,
implemented as constituent AsyncGenerator instances.

By using an AsyncGenerator, such an implementation can immediately
yield from across those concurrent fetches as data arrives such that
data is spilled to disk and memory pressure kept low. Then, when the
scatter / gather data fetch completes, the entire "page" is
checkpointed.
MySQL's handling of enum values in the context of table collation
is most vexing in a way that makes our backfill logic misbehave.
See, if you write `ORDER BY category` then MySQL orders the rows
based on the integer index of the enum values. But if you write
`WHERE category > 'CATEGORY_FOO'` then MySQL goes and casts the
enum values to strings and compares those.

There is probably a way to write some expression like `CAST('CATEGORY_FOO', ...)`
which would produce the desired enum ordering comparison in that
`WHERE` clause, but the `...` portion of that query might get
pretty hairy. We could instead go `ORDER BY CAST(category AS CHAR)`
but this would probably force a table sort and we'd prefer to
use the natural table ordering where possible.

The best solution I have been able to come up with to this problem
is to use the integer representation of the enum value as the key
for backfill purposes. That means that when serializing a backfill
row value into a scan key we convert it to the integer index which
corresponds to that string, and so when issuing the next backfill
query we effectively go like `WHERE category > 2` (except as a
parameterized query).
The string `""` is always accepted by a MySQL enum column and any
illegal inputs are mapped to this string, which is represented by
the integer 0.

We already supported all of this (modulo a small bug in `ALTER TABLE`
parsing which I accidentally uncovered as part of this commit), but
the logic becomes a bit simpler after this refactor, which moves the
`""` option to be first in the list so that its index 0 corresponds
more directly to MySQL representing it as the integer zero.
active sessions prevent the warehouse from shutting down using the
auto-delay configuration
@jonwihl jonwihl closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants