-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP: Decide what to do with tables when they are removed from this repository #386
Comments
I'm also vaguely pro (2), as it would be nice to more generally have a concept of "deprecated" or "historical" data. There are many ping types we have collected over the years that we probably don't care about processing (e.g. appusage). In #334 we decided which pings we care about and don't care about for the purposes of schemas. I do like the idea of being able to make an active decision to no longer process a ping type by removing its schemas. I don't like (1) because it means the state of production and the state of generated-schemas would not be precisely the same. In this case I would prefer that we never drop schemas (hitherto the standard practice). This goes back to developing a notion of "deprecated" data, which doesn't exist for ingestion currently. |
You make a good point about (1.) not matching the
I do believe that makes mozilla/bigquery-etl#291 a dependency here. |
I'm not convinced that we gain much by actually moving the table to a historical location. I'd like to see a way of marking a docType as deprecated, perhaps via metadata in the JSON schema file itself. Once a schema is marked as deprecated, perhaps we'd want the Perhaps we should have a This vaguely seems like the kind of metadata we would want to maintain in GCP's Data Catalog. |
BQ tables are auto-generated and updated as the schema changes. Once the schema is removed from this repository, the table is dropped. We shouldn't be dropping data when a schema is removed, instead we should retain the historical data for however long the retention period is (cc @mreid-moz).
Option 1: We keep the table in the same location, allowing for the small possibility that a new schema will be written to that location (we could add automatic checking for these, it would be especially bad if the schemas weren't compatible).
Option 2: We move that data to a historical location, such that we know it's not being updated and new data is not flowing in, and a new ping can replace it; however it will remain queryable (for the duration of the retention period).
I'm leaning towards (2.), but the downside is we either need to manually change queries to point to the new location, or move views to point there (and version views for the new data).
The text was updated successfully, but these errors were encountered: