Update recommendations around working with live data

mozilla · Oct 18, 2024 · 9d28bb6 · 9d28bb6
1 parent 3c2c20d
commit 9d28bb6
Showing 1 changed file with 85 additions and 5 deletions.
diff --git a/src/cookbooks/live_data.md b/src/cookbooks/live_data.md
@@ -1,18 +1,58 @@
 # Working with Live Data
 
+Use cases, such as real-time monitoring, dashboards, or personalized user experiences, require low-latency access to data. When working with live data, engineers need to carefully evaluate specific requirements, perform cost analysis on query patterns, and select an appropriate solution from the available options. Cost is a significant factor, particularly in high-frequency query scenarios like monitoring large datasets (e.g., Firefox-scale telemetry). This guide helps in choosing the right approach based on data size and latency needs.
+
+| Option                                 | Recommended Use Case                                   | Data Size/Complexity                                 | Latency   | Setup Complexity |
+| -------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------- | --------- | ---------------- | ------ |
+| 1. Querying Live Tables Directly       | Small datasets or infrequent, simple queries           | Low                                                  | 10-30 min | Low              |
+| 2. Scheduled queries                   | Medium                                                 | Periodic updates (e.g., hourly/daily) for dashboards | Medium    | 1h or more       | Medium |
+| 3. Using Materialized Views            | Large datasets, complex queries with low-latency needs | Medium to High                                       | 10-30 min | Medium to High   |
+| 4. Dataflow                            | Very low-latency streaming for large datasets          | High                                                 | <10 min   | High             |
+| 5. Cloud function with Pub/Sub trigger | Low-latency for smaller subsets of data                | Medium                                               | <10 min   | Medium to High   |
+
+## Options for working with Live Data
+
 Live ping tables are the final destination for the telemetry ingestion pipeline. Incoming ping data is loaded into these tables approximately every 10 minutes, though a delay of up to 30 minutes is normal. Data in these tables is set to expire after 30 days.
 
-Data from live ping tables is expected to be accessed through user-facing views. The names of all views accessing live data should include the `_live` suffix (for example `monitoring.topsites_click_rate_live`). Live tables are generally clustered and partitioned by `submission_timestamp`, which allows for writing more efficient queries that filter over short time windows. When querying live data by running the same query multiple times ensure that query caching is turned off, otherwise newly arrived data might not appear in the query results.
+### 1. Querying Live Tables Directly
 
-Of note, user-facing views for live ping tables, like the ones for historical ping tables, will not automatically be provisioned. Instead, it is expected for live views to be more curated, with each view being tied to a specific use case.
+Data from live ping tables is expected to be accessed through user-facing views. The names of all views accessing live data should include the `_live` suffix (for example `monitoring.topsites_click_rate_live`). Live tables are generally clustered and partitioned by `submission_timestamp`, which allows for writing more efficient queries that filter over short time windows. When querying live data by running the same query multiple times ensure that query caching is turned off, otherwise newly arrived data might not appear in the query results.
 
 Live tables in `_derived` datasets can be queried from Redash and Looker, however it is best practice to set up user-facing views for accessing the data.
 
-## Using Materialized Views
+Directly querying live tables allows access to the most up-to-date data. A user-facing view is typically set up to pull data from the last two days from live tables, with older data coming from stable tables. To ensure fresh data is returned, caching must be disabled.
+
+- **Recommended Use:** For smaller data volumes and low-complexity queries, or when queries are infrequent.
+- **Pros:**:
+  - Easy to set up via bigquery-etl
+- **Cons:**:
+  - High cost and slow performance when querying large datasets or running frequent complex queries.
+- **Example:** [Active Hub Subscriptions](https://github.com/mozilla/bigquery-etl/blob/12470a846ef379cfba42995044f592c00fbc4e5b/sql/moz-fx-data-shared-prod/hubs/active_subscription_ids/view.sql#L1-L23)
 
-When creating a view on live data, engineers should consider the specific use cases for the view and do some analysis of the cost associated with query patterns. In particular, cost will likely be an issue if the view is going to be used for some type of monitoring that involves frequent queries on Firefox-scale data. [BigQuery offers support for materialized views](https://cloud.google.com/bigquery/docs/materialized-views-intro). Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. Creating a materialized view that provides the relevant aggregates will be a significant cost savings when working with a large volume of live data.
+### 2. Scheduled Queries
 
-When a materialized view is appropriate, it should be created in a relevant `_derived` dataset with a version suffix (following the same conventions as we do for derived tables populated via scheduled queries). It is then made user-facing via a simple `SELECT * FROM my_materialized_view` virtual view in a relevant user-facing dataset. More complex cases may need to union together multiple materialized views at this user-facing view level.
+Scheduled queries run periodically (e.g., hourly or daily) using Airflow. They can write results to derived tables, which should be partitioned by the update interval (e.g., hourly) for efficiency. Scheduled queries typically backfill from stable tables and query live tables for recent data (e.g., the current day). For better efficiency, the destination tables should be partitioned based on the update interval (e.g. hourly), otherwise destination tables might need to be re-written.
+To support backfilling, the scheduled queries should combine querying historical data from stable tables, and only query live tables for recently added data (e.g. current and/or past day).
+
+- **Recommended Use:** For medium to high data sizes and query complexity, especially when higher lag is acceptable and update intervals match table partitioning.
+- **Pros:**:
+  - Built-in monitoring and notifications via Airflow.
+  - Faster for dashboards by querying pre-aggregated data.
+- **Cons:**:
+  - Slightly more complex to set up via bigquery-etl.
+  - High cost or slow performance when querying large amounts of live data or running complex queries frequently.
+  - Increased load on Airflow with frequent updates.
+  - Duplicates from live tables written to derived table.
+  - Potentially more complicated logic required for tables updates (appending columns, rewriting partitions, ...).
+- **Other Considerations:**
+  - Partition limits (10,000 in BigQuery) might restrict data storage.
+  - Delays in event writes to live tables can lead to scheduling adjustments and lag in destination tables.
+
+### 3. Using Materialized Views
+
+[BigQuery offers support for materialized views](https://cloud.google.com/bigquery/docs/materialized-views-intro). Materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. Creating a materialized view that provides the relevant aggregates will be a significant cost savings when working with a large volume of live data.
+
+When a materialized view is appropriate, it should be created in a relevant `_derived` dataset with a version suffix (following the same conventions as we do for derived tables populated via scheduled queries). It is then made user-facing via a simple `SELECT * FROM my_materialized_view` virtual view in a relevant user-facing dataset. More complex cases may need to union together multiple materialized views at this user-facing view level. The user-facing view should combine querying historical data from stable tables, and only query data from materialized views for the current day.
 
 It is generally recommended that materialized views over live data have `enable_refresh=true` and `refresh_interval_minutes=10`. The `refresh_interval_minutes` parameter determines the minimum time between refreshes.
 
@@ -34,6 +74,46 @@ IF NOT EXISTS my_dataset_derived.my_new_materialized_view_live_v1
     submission_timestamp > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAYS)  -- limit amount of data to be backfilled ($$$)
 ```
 
+- **Recommended Use:** For medium to large datasets, complex queries, and when low data latency is essential.
+- **Pros:**:
+  - Cost-effective as it only queries new data.
+  - Historical data is coming from stable tables, so eventually the queried data is consistent without any duplicates.
+  - Fast query performance due to data aggregation.
+- **Cons:**:
+  - Complex to set up and maintain.
+  - Materialized views querying very large datasets may fall behind, causing BigQuery to default to querying base tables directly.
+  - Limitations around materialized views such as no support for UDFs and restricted JOIN capabilities.
+  - Delay in permissions being applied after creating materialized views.
+- **Other Considerations:**
+  - Materialized views inherit expiration settings from underlying tables.
+  - Clustering can improve performance.
+- **Example:** [Experiment Monitoring](https://github.com/mozilla/bigquery-etl/tree/77f8facb93908dc22b6c94ca98e72f8c38e9534b/sql_generators/experiment_monitoring/templates)
+
+### 4. Dataflow
+
+The telemetry ingestion pipeline is built using Apache Beam and Dataflow. Additional jobs can be created to read incoming messages directly from Pub/Sub and stream them into BigQuery.
+
+- **Recommended Use:** For very low-latency data processing when higher operational costs are acceptable.
+- **Pros:**:
+  - Low latency
+- **Cons:**:
+  - High setup and operational complexity.
+  - Expensive to maintain.
+- **Example:** [Contextual Services Pipeline](https://github.com/mozilla/gcp-ingestion/tree/main/ingestion-beam/src/main/java/com/mozilla/telemetry/contextualservices)
+
+### 5. Cloud function with Pub/Sub trigger
+
+Google Cloud Functions can process incoming messages in Pub/Sub and stream the data to BigQuery.
+
+- **Recommended Use:** For low-latency data processing with smaller data volumes or subsets.
+- **Pros:**:
+  - Easier to set up and operate than Dataflow.
+  - Low latency.
+- **Cons:**:
+  - More expensive than Dataflow for large volumes of data.
+  - Subscription filtering can reduce costs, but not entirely.
+- **Example:** [Glean Debug Viewer](https://github.com/mozilla/debug-ping-view/blob/main/functions/index.js#L52)
+
 ## Tools for Visualizing Live Data
 
 Currently, live data in BigQuery could be considered "near real time" or nearline (as opposed to online), with latency usually below 1 hour to each live table in our ingestion-sink code base.