diff --git a/.spelling b/.spelling index a9db9a7eb..445e62d9c 100644 --- a/.spelling +++ b/.spelling @@ -73,6 +73,7 @@ checkpointing chromehangs CircleCI CLI +clients_daily_v6 cloudops-infra Colaboratory Colab diff --git a/src/cookbooks/bigquery/querying.md b/src/cookbooks/bigquery/querying.md index 031682750..f43586ecd 100644 --- a/src/cookbooks/bigquery/querying.md +++ b/src/cookbooks/bigquery/querying.md @@ -69,6 +69,8 @@ The table and view types referenced above are defined as follows: - _Live ping tables_ are the final destination for the [telemetry ingestion pipeline](https://mozilla.github.io/gcp-ingestion/). Dataflow jobs process incoming ping payloads from clients, batch them together by document type, and load the results to these tables approximately every five minutes, although a few document types are opted in to a more expensive streaming path that makes records available in BigQuery within seconds of ingestion. These tables are partitioned by date according to `submission_timestamp` and are also clustered on that same field, so it is possible to make efficient queries over short windows of recent data such as the last hour. They have a rolling expiration period of 30 days, but that window may be shortened in the future. Analyses should only use these tables if they need results for the current (partial) day. - _Historical ping tables_ have exactly the same schema as their corresponding live ping tables, but they are populated only once per day (`12:00:00am` to `11:59:59pm` UTC) via an Airflow job and have a 25 month retention period. These tables are superior to the live ping tables for historical analysis because they never contain partial days, they have additional deduplication applied, and they are clustered on `sample_id`, allowing efficient queries on a 1% sample of clients. It is guaranteed that `document_id` is distinct within each day of each historical ping table, but it is still possible for a document to appear multiple times if a client sends the same payload across multiple UTC days. Note that this requirement is relaxed for older telemetry ping data that was backfilled from AWS; approximately 0.5% of documents are duplicated in `telemetry.main` and other historical ping tables for 2019-04-30 and earlier dates. - _Derived tables_ are populated by nightly [Airflow](https://workflow.telemetry.mozilla.org/home) jobs and are considered an implementation detail; their structure may change at any time at the discretion of the data platform team to allow refactoring or efficiency improvements. + - Tables (unsuffixed) may contain `client_id` or other id-level columns, e.g. [clients_daily_v6](https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/metadata.yaml). + - Tables without `client_id`-level information use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml). - _User-facing views_ are the schema objects that users are primarily expected to use in analyses. Many of these views correspond directly to an underlying historical ping table or derived table, but they provide the flexibility to hide deprecated columns or present additional calculated columns to users. These views are the schema contract with users and they should not change in backwards-incompatible ways without a version increase or an announcement to users about a breaking change. Spark and other applications relying on the BigQuery Storage API for data access need to reference derived tables or historical ping tables directly rather than user-facing views. Unless the query result is relatively large, we recommend instead that users run a query on top of user-facing views with the output saved in a destination table, which can then be accessed from Spark.