Skip to content

Commit

Permalink
DENG 4972 add suffix usage to naming in bigquery (#851)
Browse files Browse the repository at this point in the history
* Use case clarification.

* Update using_aggregates.md

Typo

* Spell check.

* Spell check.

* Bring back deleted line.

* Add the use of suffix in table naming.

* Add the use of suffix in table naming.

* Spelling
  • Loading branch information
lucia-vargas-a authored Sep 27, 2024
1 parent 1e32e83 commit 8e3085d
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 0 deletions.
1 change: 1 addition & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ checkpointing
chromehangs
CircleCI
CLI
clients_daily_v6
cloudops-infra
Colaboratory
Colab
Expand Down
2 changes: 2 additions & 0 deletions src/cookbooks/bigquery/querying.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ The table and view types referenced above are defined as follows:
- _Live ping tables_ are the final destination for the [telemetry ingestion pipeline](https://mozilla.github.io/gcp-ingestion/). Dataflow jobs process incoming ping payloads from clients, batch them together by document type, and load the results to these tables approximately every five minutes, although a few document types are opted in to a more expensive streaming path that makes records available in BigQuery within seconds of ingestion. These tables are partitioned by date according to `submission_timestamp` and are also clustered on that same field, so it is possible to make efficient queries over short windows of recent data such as the last hour. They have a rolling expiration period of 30 days, but that window may be shortened in the future. Analyses should only use these tables if they need results for the current (partial) day.
- _Historical ping tables_ have exactly the same schema as their corresponding live ping tables, but they are populated only once per day (`12:00:00am` to `11:59:59pm` UTC) via an Airflow job and have a 25 month retention period. These tables are superior to the live ping tables for historical analysis because they never contain partial days, they have additional deduplication applied, and they are clustered on `sample_id`, allowing efficient queries on a 1% sample of clients. It is guaranteed that `document_id` is distinct within each day of each historical ping table, but it is still possible for a document to appear multiple times if a client sends the same payload across multiple UTC days. Note that this requirement is relaxed for older telemetry ping data that was backfilled from AWS; approximately 0.5% of documents are duplicated in `telemetry.main` and other historical ping tables for 2019-04-30 and earlier dates.
- _Derived tables_ are populated by nightly [Airflow](https://workflow.telemetry.mozilla.org/home) jobs and are considered an implementation detail; their structure may change at any time at the discretion of the data platform team to allow refactoring or efficiency improvements.
- Tables (unsuffixed) may contain `client_id` or other id-level columns, e.g. [clients_daily_v6](https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/metadata.yaml).
- Tables without `client_id`-level information use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).
- _User-facing views_ are the schema objects that users are primarily expected to use in analyses. Many of these views correspond directly to an underlying historical ping table or derived table, but they provide the flexibility to hide deprecated columns or present additional calculated columns to users. These views are the schema contract with users and they should not change in backwards-incompatible ways without a version increase or an announcement to users about a breaking change.

Spark and other applications relying on the BigQuery Storage API for data access need to reference derived tables or historical ping tables directly rather than user-facing views. Unless the query result is relatively large, we recommend instead that users run a query on top of user-facing views with the output saved in a destination table, which can then be accessed from Spark.
Expand Down

0 comments on commit 8e3085d

Please sign in to comment.