DENG 4972 add suffix usage to naming in bigquery #851

lucia-vargas-a · 2024-09-26T16:15:01Z

Adding the clarification on the use of suffixes to resolve the confusion and lack of uniformity that users have reported in table naming.

Typo

scholtzan · 2024-09-27T00:31:23Z

src/cookbooks/bigquery/querying.md

@@ -69,6 +69,8 @@ The table and view types referenced above are defined as follows:
 - _Live ping tables_ are the final destination for the [telemetry ingestion pipeline](https://mozilla.github.io/gcp-ingestion/). Dataflow jobs process incoming ping payloads from clients, batch them together by document type, and load the results to these tables approximately every five minutes, although a few document types are opted in to a more expensive streaming path that makes records available in BigQuery within seconds of ingestion. These tables are partitioned by date according to `submission_timestamp` and are also clustered on that same field, so it is possible to make efficient queries over short windows of recent data such as the last hour. They have a rolling expiration period of 30 days, but that window may be shortened in the future. Analyses should only use these tables if they need results for the current (partial) day.
 - _Historical ping tables_ have exactly the same schema as their corresponding live ping tables, but they are populated only once per day (`12:00:00am` to `11:59:59pm` UTC) via an Airflow job and have a 25 month retention period. These tables are superior to the live ping tables for historical analysis because they never contain partial days, they have additional deduplication applied, and they are clustered on `sample_id`, allowing efficient queries on a 1% sample of clients. It is guaranteed that `document_id` is distinct within each day of each historical ping table, but it is still possible for a document to appear multiple times if a client sends the same payload across multiple UTC days. Note that this requirement is relaxed for older telemetry ping data that was backfilled from AWS; approximately 0.5% of documents are duplicated in `telemetry.main` and other historical ping tables for 2019-04-30 and earlier dates.
 - _Derived tables_ are populated by nightly [Airflow](https://workflow.telemetry.mozilla.org/home) jobs and are considered an implementation detail; their structure may change at any time at the discretion of the data platform team to allow refactoring or efficiency improvements.
+  - Tables (unsuffixed) may contain `client_id` or other id-level columns, e.g. [clients_daily_v6](https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/metadata.yaml). 
+  - Tables without `client_id` and aggregation use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).   


Suggested change

- Tables without `client_id` and aggregation use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).

- Tables without `client_id`-level information use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).

Otherwise it sounds like these tables don't aggregate data

lucia-vargas-a and others added 11 commits June 28, 2024 17:24

Use case clarification.

38f2d1f

Update using_aggregates.md

3b2097c

Typo

Spell check.

7f290ab

Spell check.

eabc1a5

Bring back deleted line.

a826d6b

Merge remote-tracking branch 'origin/main'

dd9bcac

Merge remote-tracking branch 'origin/main'

38871a6

Merge remote-tracking branch 'origin/main'

6e34b00

Merge remote-tracking branch 'origin/main'

fa6f71f

Merge remote-tracking branch 'origin/main'

840c83b

Add the use of suffix in table naming.

48e49ce

lucia-vargas-a requested review from badboy and scholtzan September 26, 2024 16:15

scholtzan reviewed Sep 27, 2024

View reviewed changes

lucia-vargas-a added 2 commits September 27, 2024 13:34

Add the use of suffix in table naming.

a58d427

Spelling

636e68c

lucia-vargas-a requested a review from scholtzan September 27, 2024 11:50

scholtzan approved these changes Sep 27, 2024

View reviewed changes

scholtzan merged commit 8e3085d into main Sep 27, 2024
9 checks passed

scholtzan deleted the DENG-4972_add_suffix_usage_to_naming_in_bigquery branch September 27, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DENG 4972 add suffix usage to naming in bigquery #851

DENG 4972 add suffix usage to naming in bigquery #851

lucia-vargas-a commented Sep 26, 2024

scholtzan Sep 27, 2024

	- Tables without `client_id` and aggregation use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).
	- Tables without `client_id`-level information use the suffix `_aggregates`, e.g. `addon_aggregates_v2` (https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/telemetry_derived/addon_aggregates_v2/metadata.yaml).

DENG 4972 add suffix usage to naming in bigquery #851

DENG 4972 add suffix usage to naming in bigquery #851

Conversation

lucia-vargas-a commented Sep 26, 2024

scholtzan Sep 27, 2024

Choose a reason for hiding this comment