From 106a7badba45dabb27d3ba5e6b38ace4eb1d63bf Mon Sep 17 00:00:00 2001
From:  <>
Date: Thu, 21 Nov 2024 17:53:38 +0000
Subject: [PATCH] [ci skip] Deployed 11d854427c with MkDocs version: 1.6.1

---
 bqetl/index.html         |   8 ++++----
 search/search_index.json |   2 +-
 sitemap.xml.gz           | Bin 127 -> 127 bytes
 3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/bqetl/index.html b/bqetl/index.html
index ed705bc5d3b..4eff55a2779 100644
--- a/bqetl/index.html
+++ b/bqetl/index.html
@@ -2621,10 +2621,10 @@ <h4 id="initialize"><code>initialize</code></h4>
 
 --sql_dir:<span class="w"> </span>Path<span class="w"> </span>to<span class="w"> </span>directory<span class="w"> </span>which<span class="w"> </span>contains<span class="w"> </span>queries.
 --project_id:<span class="w"> </span>GCP<span class="w"> </span>project<span class="w"> </span>ID
---billing_project:<span class="w"> </span>GCP<span class="w"> </span>project<span class="w"> </span>ID<span class="w"> </span>to<span class="w"> </span>run<span class="w"> </span>the<span class="w"> </span>query<span class="w"> </span><span class="k">in</span>.<span class="w"> </span>This<span class="w"> </span>can<span class="w"> </span>be<span class="w"> </span>used<span class="w"> </span>to<span class="w"> </span>run<span class="w"> </span>a<span class="w"> </span>query<span class="w"> </span>using<span class="w"> </span>a<span class="w"> </span>different<span class="w"> </span>slot<span class="w"> </span>reservation<span class="w"> </span>than<span class="w"> </span>the<span class="w"> </span>one<span class="w"> </span>used<span class="w"> </span>by<span class="w"> </span>the<span class="w"> </span>query<span class="s1">&#39;s default project.</span>
-<span class="s1">--dry_run: Dry run the initialization</span>
-<span class="s1">--parallelism: Number of threads for parallel processing</span>
-<span class="s1">--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don&#39;</span>t<span class="w"> </span>already<span class="w"> </span>exist.
+--billing_project:<span class="w"> </span>GCP<span class="w"> </span>project<span class="w"> </span>ID<span class="w"> </span>to<span class="w"> </span>run<span class="w"> </span>the<span class="w"> </span>query<span class="w"> </span><span class="k">in</span>.<span class="w"> </span>This<span class="w"> </span>can<span class="w"> </span>be<span class="w"> </span>used<span class="w"> </span>to<span class="w"> </span>run<span class="w"> </span>a<span class="w"> </span>query<span class="w"> </span>using<span class="w"> </span>a<span class="w"> </span>different<span class="w"> </span>slot<span class="w"> </span>reservation<span class="w"> </span>than<span class="w"> </span>the<span class="w"> </span>one<span class="w"> </span>used<span class="w"> </span>by<span class="w"> </span>the<span class="w"> </span>query<span class="err">&#39;</span>s<span class="w"> </span>default<span class="w"> </span>project.
+--dry_run:<span class="w"> </span>Dry<span class="w"> </span>run<span class="w"> </span>the<span class="w"> </span>initialization
+--parallelism:<span class="w"> </span>Number<span class="w"> </span>of<span class="w"> </span>threads<span class="w"> </span><span class="k">for</span><span class="w"> </span>parallel<span class="w"> </span>processing
+--skip_existing:<span class="w"> </span>Skip<span class="w"> </span>initialization<span class="w"> </span><span class="k">for</span><span class="w"> </span>existing<span class="w"> </span>artifacts,<span class="w"> </span>otherwise<span class="w"> </span>initialization<span class="w"> </span>is<span class="w"> </span>run<span class="w"> </span><span class="k">for</span><span class="w"> </span>empty<span class="w"> </span>tables.
 --force:<span class="w"> </span>Run<span class="w"> </span>the<span class="w"> </span>initialization<span class="w"> </span>even<span class="w"> </span><span class="k">if</span><span class="w"> </span>the<span class="w"> </span>destination<span class="w"> </span>table<span class="w"> </span>contains<span class="w"> </span>data.
 </code></pre></div>
 <p><strong>Examples</strong></p>
diff --git a/search/search_index.json b/search/search_index.json
index 3618588eee2..c9c3cc26456 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"<p>The <code>bqetl</code> command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.</p> <p>Running some commands, for example to create or query tables, will require Mozilla GCP access.</p>"},{"location":"bqetl/#installation","title":"Installation","text":"<p>Follow the Quick Start to set up bigquery-etl and the bqetl CLI.</p>"},{"location":"bqetl/#configuration","title":"Configuration","text":"<p><code>bqetl</code> can be configured via the <code>bqetl_project.yaml</code> file. See Configuration to find available configuration options.</p>"},{"location":"bqetl/#commands","title":"Commands","text":"<p>To list all available commands in the bqetl CLI:</p> <pre><code>$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n  CLI tools for working with bigquery-etl.\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  alchemer    Commands for importing alchemer data.\n  dag         Commands for managing DAGs.\n  dependency  Build and use query dependency graphs.\n  dryrun      Dry run SQL.\n  format      Format SQL.\n  glam        Tools for GLAM ETL.\n  mozfun      Commands for managing mozfun routines.\n  query       Commands for managing queries.\n  routine     Commands for managing routines.\n  stripe      Commands for Stripe ETL.\n  view        Commands for managing views.\n  backfill    Commands for managing backfills.\n</code></pre> <p>See help for any command:</p> <pre><code>$ ./bqetl [command] --help\n</code></pre>"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"<p>CLI autocomplete for <code>bqetl</code> can be enabled for bash and zsh shells using the <code>script/bqetl_complete</code> script:</p> <pre><code>source script/bqetl_complete\n</code></pre> <p>Then pressing tab after <code>bqetl</code> commands should print possible commands, e.g. for zsh: <pre><code>% bqetl query&lt;TAB&gt;&lt;TAB&gt;\nbackfill       -- Run a backfill for a query.\ncreate         -- Create a new query with name...\ninfo           -- Get information about all or specific...\ninitialize     -- Run a full backfill on the destination...\nrender         -- Render a query Jinja template.\nrun            -- Run a query.\n...\n</code></pre></p> <p><code>source script/bqetl_complete</code> can also be added to <code>~/.bashrc</code> or <code>~/.zshrc</code> to persist settings across shell instances.</p> <p>For more details on shell completion, see the click documentation.</p>"},{"location":"bqetl/#query","title":"<code>query</code>","text":"<p>Commands for managing queries.</p>"},{"location":"bqetl/#create","title":"<code>create</code>","text":"<p>Create a new query with name     ., for example: telemetry_derived.active_profiles.     Use the <code>--project_id</code> option to change the project the query is added to;     default is <code>moz-fx-data-shared-prod</code>. Views are automatically generated     in the publicly facing dataset. <p>Usage</p> <pre><code>$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query create telemetry_derived.deviations_v1 \\\n  --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n</code></pre>"},{"location":"bqetl/#schedule","title":"<code>schedule</code>","text":"<p>Schedule an existing query</p> <p>Usage</p> <pre><code>$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations \\\n  --task-name=deviations\n</code></pre>"},{"location":"bqetl/#info","title":"<code>info</code>","text":"<p>Get information about all or specific queries.</p> <p>Usage</p> <pre><code>$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n  --cost --last_updated\n</code></pre>"},{"location":"bqetl/#backfill","title":"<code>backfill</code>","text":"<p>Run a backfill for a query. Additional parameters will get passed to bq.</p> <p>Usage</p> <pre><code>$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n</code></pre> <p>Examples</p> <pre><code># Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n  --dry_run\n</code></pre>"},{"location":"bqetl/#run","title":"<code>run</code>","text":"<p>Run a query. Additional parameters will get passed to bq.     If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored.     If the <code>name</code> is not found within the <code>sql/</code> folder bqetl assumes it hasn't been generated yet     and will start the generating process for all <code>sql_generators/</code> files.     This generation process will take some time and run dryrun calls against BigQuery but this is expected.      Additional parameters (all parameters that are not specified in the Options) must come after the query-name.     Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.</p> <p>Usage</p> <pre><code>$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n</code></pre> <p>Examples</p> <pre><code># Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1         --project_id=moz-fx-data-shared-prod         --dataset_id=telemetry_derived         --destination_table=ssl_ratios_v1\n</code></pre>"},{"location":"bqetl/#run-multipart","title":"<code>run-multipart</code>","text":"<p>Run a multipart query.</p> <p>Usage</p> <pre><code>$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n</code></pre> <p>Examples</p> <pre><code># Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n</code></pre>"},{"location":"bqetl/#validate","title":"<code>validate</code>","text":"<p>Validate a query.     Checks formatting, scheduling information and dry runs the query.</p> <p>Usage</p> <pre><code>$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n  --use_cloud_function=false \\\n  --project_id=moz-fx-data-marketing-prod \\\n  ga_derived.blogs_goals_v1\n</code></pre>"},{"location":"bqetl/#initialize","title":"<code>initialize</code>","text":"<p>Run a full backfill on the destination table for the query.        Using this command will:         - Create the table if it doesn't exist and run a full backfill.         - Run a full backfill if the table exists and is empty.         - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query.        It supports <code>query.sql</code> files that use the is_init() pattern.        To run in parallel per sample_id, include a @sample_id parameter in the query.</p> <p>Usage</p> <pre><code>$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist.\n--force: Run the initialization even if the destination table contains data.\n</code></pre> <p>Examples</p> <pre><code>Examples:\n   - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n   - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n</code></pre>"},{"location":"bqetl/#render","title":"<code>render</code>","text":"<p>Render a query Jinja template.</p> <p>Usage</p> <pre><code>$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code>./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n  --output-dir=/tmp\n</code></pre>"},{"location":"bqetl/#schema","title":"<code>schema</code>","text":"<p>Commands for managing query schemas.</p>"},{"location":"bqetl/#update","title":"<code>update</code>","text":"<p>Update the query schema based on the destination table schema and the query schema.     If no schema.yaml file exists for a query, one will be created.</p> <p>Usage</p> <pre><code>$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n</code></pre>"},{"location":"bqetl/#deploy","title":"<code>deploy</code>","text":"<p>Deploy the query schema.</p> <p>Usage</p> <pre><code>$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.  Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema deploy telemetry_derived.clients_daily_v6\n</code></pre>"},{"location":"bqetl/#validate_1","title":"<code>validate</code>","text":"<p>Validate the query schema</p> <p>Usage</p> <pre><code>$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema validate telemetry_derived.clients_daily_v6\n</code></pre>"},{"location":"bqetl/#dag","title":"<code>dag</code>","text":"<p>Commands for managing DAGs.</p>"},{"location":"bqetl/#info_1","title":"<code>info</code>","text":"<p>Get information about available DAGs.</p> <p>Usage</p> <pre><code>$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n</code></pre> <p>Examples</p> <pre><code># Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n</code></pre>"},{"location":"bqetl/#create_1","title":"<code>create</code>","text":"<p>Create a new DAG with name bqetl_, for example: bqetl_search     When creating new DAGs, the DAG name must have a <code>bqetl_</code> prefix.     Created DAGs are added to the <code>dags.yaml</code> file. <p>Usage</p> <pre><code>$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n</code></pre> <p>Examples</p> <pre><code>./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n</code></pre>"},{"location":"bqetl/#generate","title":"<code>generate</code>","text":"<p>Generate Airflow DAGs from DAG definitions.</p> <p>Usage</p> <pre><code>$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n</code></pre> <p>Examples</p> <pre><code># Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n</code></pre>"},{"location":"bqetl/#remove","title":"<code>remove</code>","text":"<p>Remove a DAG.     This will also remove the scheduling information from the queries that were scheduled     as part of the DAG.</p> <p>Usage</p> <pre><code>$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n</code></pre> <p>Examples</p> <pre><code># Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n</code></pre>"},{"location":"bqetl/#dependency","title":"<code>dependency</code>","text":"<p>Build and use query dependency graphs.</p>"},{"location":"bqetl/#show","title":"<code>show</code>","text":"<p>Show table references in sql files.</p> <p>Usage</p> <pre><code>$ ./bqetl dependency show [OPTIONS] [paths]\n</code></pre>"},{"location":"bqetl/#record","title":"<code>record</code>","text":"<p>Record table references in metadata. Fails if metadata already contains references section.</p> <p>Usage</p> <pre><code>$ ./bqetl dependency record [OPTIONS] [paths]\n</code></pre>"},{"location":"bqetl/#dryrun","title":"<code>dryrun</code>","text":"<p>Dry run SQL.         Uses the dryrun Cloud Function by default which only has access to shared-prod.         To dryrun queries accessing tables in another project use set         <code>--use-cloud-function=false</code> and ensure that the command line has access to a         GCP service account.</p> <p>Usage</p> <pre><code>$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n</code></pre> <p>Examples</p> <pre><code>Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n</code></pre>"},{"location":"bqetl/#format","title":"<code>format</code>","text":"<p>Format SQL files.</p> <p>Usage</p> <pre><code>$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code># Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n</code></pre>"},{"location":"bqetl/#routine","title":"<code>routine</code>","text":"<p>Commands for managing routines for internal use.</p>"},{"location":"bqetl/#create_2","title":"<code>create</code>","text":"<p>Create a new routine. Specify whether the routine is a UDF or     stored procedure by adding a --udf or --stored_prodecure flag.</p> <p>Usage</p> <pre><code>$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n</code></pre> <p>Examples</p> <pre><code># Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n</code></pre>"},{"location":"bqetl/#info_2","title":"<code>info</code>","text":"<p>Get routine information.</p> <p>Usage</p> <pre><code>$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n</code></pre> <p>Examples</p> <pre><code># Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n</code></pre>"},{"location":"bqetl/#validate_2","title":"<code>validate</code>","text":"<p>Validate formatting of routines and run tests.</p> <p>Usage</p> <pre><code>$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n</code></pre> <p>Examples</p> <pre><code># Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n</code></pre>"},{"location":"bqetl/#publish","title":"<code>publish</code>","text":"<p>Publish routines to BigQuery. Requires service account access.</p> <p>Usage</p> <pre><code>$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n</code></pre> <p>Examples</p> <pre><code># Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n</code></pre>"},{"location":"bqetl/#rename","title":"<code>rename</code>","text":"<p>Rename routine or routine dataset. Replaces all usages in queries with     the new name.</p> <p>Usage</p> <pre><code>$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n</code></pre>"},{"location":"bqetl/#mozfun","title":"<code>mozfun</code>","text":"<p>Commands for managing public mozfun routines.</p>"},{"location":"bqetl/#create_3","title":"<code>create</code>","text":"<p>Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the <code>mozfun</code> project.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n</code></pre> <p>Examples</p> <pre><code># Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n</code></pre>"},{"location":"bqetl/#info_3","title":"<code>info</code>","text":"<p>Get mozfun routine information.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n</code></pre> <p>Examples</p> <pre><code># Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n</code></pre>"},{"location":"bqetl/#validate_3","title":"<code>validate</code>","text":"<p>Validate formatting of mozfun routines and run tests.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n</code></pre> <p>Examples</p> <pre><code># Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n</code></pre>"},{"location":"bqetl/#publish_1","title":"<code>publish</code>","text":"<p>Publish mozfun routines. This command is used by Airflow only.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n</code></pre>"},{"location":"bqetl/#rename_1","title":"<code>rename</code>","text":"<p>Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n</code></pre>"},{"location":"bqetl/#backfill_1","title":"<code>backfill</code>","text":"<p>Commands for managing backfills.</p>"},{"location":"bqetl/#create_4","title":"<code>create</code>","text":"<p>Create a new backfill entry in the backfill.yaml file.  Create     a backfill.yaml file if it does not already exist.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n</code></pre> <p>Examples</p> <pre><code>./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n</code></pre>"},{"location":"bqetl/#validate_4","title":"<code>validate</code>","text":"<p>Validate backfill.yaml file format and content.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code>./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n    ./bqetl backfill validate\n</code></pre>"},{"location":"bqetl/#info_4","title":"<code>info</code>","text":"<p>Get backfill(s) information from all or specific table(s).</p> <p>Usage</p> <pre><code>$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n</code></pre> <p>Examples</p> <pre><code># Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n</code></pre>"},{"location":"bqetl/#scheduled","title":"<code>scheduled</code>","text":"<p>Get information on backfill(s) that require processing.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n</code></pre> <p>Examples</p> <pre><code># Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n</code></pre>"},{"location":"bqetl/#initiate","title":"<code>initiate</code>","text":"<p>Process entry in backfill.yaml with Initiate status that has not yet been processed.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n</code></pre>"},{"location":"bqetl/#complete","title":"<code>complete</code>","text":"<p>Complete entry in backfill.yaml with Complete status that has not yet been processed..</p> <p>Usage</p> <pre><code>$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n</code></pre>"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"<p>This is a quick guide of how to perform common workflows in bigquery-etl using the <code>bqetl</code> CLI.</p> <p>For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the <code>bqetl</code> CLI needs to be installed by running <code>./bqetl bootstrap</code>.</p>"},{"location":"cookbooks/common_workflows/#adding-a-new-scheduled-query","title":"Adding a new scheduled query","text":"<p>The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.</p> <ol> <li>Run <code>./bqetl query create &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code><ol> <li>Specify the desired destination dataset and table name for <code>&lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code></li> <li>Directories and files are generated automatically</li> </ol> </li> <li>Open <code>query.sql</code> file that has been created in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/</code> to write the query</li> <li>[Optional] Run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to generate the <code>schema.yaml</code> file<ul> <li>Optionally add column descriptions to <code>schema.yaml</code></li> </ul> </li> <li>Open the <code>metadata.yaml</code> file in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/</code><ul> <li>Add a description of the query</li> <li>Add BigQuery information such as table partitioning or clustering<ul> <li>See clients_daily_v6 for reference</li> </ul> </li> </ul> </li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to dry run and format the query</li> <li>To schedule the query, first select a DAG from the <code>./bqetl dag info</code> list or create a new DAG <code>./bqetl dag create &lt;bqetl_new_dag&gt;</code></li> <li>Run <code>./bqetl query schedule &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt; --dag &lt;bqetl_dag&gt;</code> to schedule the query</li> <li>Create a pull request</li> <li>PR gets reviewed and eventually approved</li> <li>Merge pull-request</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to create new datasets earlier</li> </ul> </li> <li>Backfill data<ul> <li>Option 1: via Airflow interface</li> <li>Option 2: <code>./bqetl query backfill --project-id &lt;project id&gt; &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code></li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#update-an-existing-query","title":"Update an existing query","text":"<ol> <li>Open the <code>query.sql</code> file of the query to be updated and make changes</li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to dry run and format the query</li> <li>If the query scheduling metadata has changed, run <code>./bqetl dag generate &lt;bqetl_dag&gt;</code> to update the DAG file</li> <li>If the query adds new columns, run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to make local <code>schema.yaml</code> updates</li> <li>Open PR with changes</li> <li>PR reviewed and approved</li> <li>Merge pull-request</li> <li>Table deploys (including schema changes) happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#formatting-sql","title":"Formatting SQL","text":"<p>We enforce consistent SQL formatting as part of CI. After adding or changing a query, use <code>./bqetl format</code> to apply formatting rules.</p> <p>Directories and files passed as arguments to <code>./bqetl format</code> will be formatted in place, with directories recursively searched for files with a <code>.sql</code> extension, e.g.:</p> <pre><code>$ echo 'SELECT 1,2,3' &gt; test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n  1,\n  2,\n  3\n</code></pre> <p>If no arguments are specified the script will read from stdin and write to stdout, e.g.:</p> <pre><code>$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n  1,\n  2,\n  3\n</code></pre> <p>To turn off sql formatting for a block of SQL, wrap it in <code>format:off</code> and <code>format:on</code> comments, like this:</p> <pre><code>SELECT\n  -- format:off\n  submission_date, sample_id, client_id\n  -- format:on\n</code></pre>"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"<p>Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.</p> <ol> <li>Open the <code>query.sql</code> file inside the <code>&lt;dataset&gt;.&lt;table&gt;</code> location and add the new definitions for the field.</li> <li>Run <code>./bqetl format &lt;path to the query&gt;</code> to format the query. Alternatively, run <code>./bqetl format $(git ls-tree -d HEAD --name-only)</code> validate the format of all queries that have been modified.</li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;</code> to dry run the query.<ul> <li>For data scientists (and anyone without <code>jobs.create</code> permissions in <code>moz-fx-data-shared-prod</code>), run:<ul> <li>(a) <code>gcloud auth login --update-adc   # to authenticate to GCP</code></li> <li>(b) <code>gcloud config set project mozdata    # to set the project</code></li> <li>(c) <code>./bqetl query validate --use-cloud-function=false --project-id=mozdata &lt;full path to the query file&gt;</code></li> </ul> </li> </ul> </li> <li>Run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt; --update_downstream</code> to make local schema.yaml updates and update schemas of downstream dependencies.<ul> <li> This requires GCP access.</li> <li> <code>--update_downstream</code> is optional as it takes longer. It is recommended when you know that there are downstream dependencies whose <code>schema.yaml</code> need to be updated, in which case, the update will happen automatically.</li> <li> <code>--force</code> should only be used in very specific cases, particularly the <code>clients_last_seen</code> tables. It skips some checks that would otherwise catch some error scenarios.</li> </ul> </li> <li>Open a new PR with these changes.</li> <li>PR reviewed and approved.</li> <li>Find and run again the CI pipeline for the PR.<ul> <li> Make sure all dry runs are successful.</li> </ul> </li> <li>Merge pull-request.</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol> <p>The following is an example to update a new field in <code>telemetry_derived.clients_daily_v6</code></p>"},{"location":"cookbooks/common_workflows/#example-add-a-new-field-to-clients_daily","title":"Example: Add a new field to clients_daily","text":"<ol> <li>Open the <code>clients_daily_v6</code> <code>query.sql</code> file and add new field definitions.</li> <li>Run <code>./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql</code></li> <li>Run <code>./bqetl query validate telemetry_derived.clients_daily_v6</code>.</li> <li>Authenticate to GCP: <code>gcloud auth login --update-adc</code></li> <li>Run <code>./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false</code>.<ul> <li> <code>schema.yaml</code> files of downstream dependencies, like <code>clients_last_seen_v1</code> are updated.</li> <li>If the schema has no changes, we do not run schema updates on any of its downstream dependencies.</li> <li><code>--use-cloud-function=false</code> is necessary when updating tables related to <code>clients_daily</code> but optional for other tables. The dry run cloud function times out when fetching the deployed table schema for some of <code>clients_daily</code>s downstream dependencies. Using GCP credentials instead works, however this means users need to have permissions to run queries in <code>moz-fx-data-shared-prod</code>.</li> </ul> </li> <li>Open a PR with these changes.</li> <li>PR is reviewed and approved.</li> <li>Merge pull-request.</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#remove-a-field-from-a-table-schema","title":"Remove a field from a table schema","text":"<p>Deleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.</p>"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"<ol> <li>Run <code>./bqetl mozfun create &lt;dataset&gt;.&lt;name&gt; --udf</code>.</li> <li>Navigate to the <code>udf.sql</code> file in <code>sql/mozfun/&lt;dataset&gt;/&lt;name&gt;/</code> and add UDF the definition and tests.</li> <li>Run <code>./bqetl mozfun validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests.<ul> <li>Before running the tests, you need to setup the access to the Google Cloud API.</li> </ul> </li> <li>Open a PR.</li> <li>PR gets reviewed, approved and merged.</li> <li>To publish UDF immediately:<ul> <li>Go to Airflow <code>mozfun</code> DAG and clear latest run.</li> <li>Or else it will get published within a day when mozfun is executed next.</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-a-new-internal-udf","title":"Adding a new internal UDF","text":"<p>Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a <code>mozfun</code> UDF.</p> <ol> <li>Run <code>./bqetl routine create &lt;dataset&gt;.&lt;name&gt; --udf</code></li> <li>Navigate to the <code>udf.sql</code> in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;name&gt;/</code> file and add UDF definition and tests</li> <li>Run <code>./bqetl routine validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests<ul> <li>Before running the tests, you need to setup the access to the Google Cloud API.</li> </ul> </li> <li>Open a PR</li> <li>PR gets reviewed and approved and merged</li> <li>UDF deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-a-stored-procedure","title":"Adding a stored procedure","text":"<p>The same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute <code>./bqetl mozfun create &lt;dataset&gt;.&lt;name&gt; --stored_procedure</code> or <code>./bqetl routine create &lt;dataset&gt;.&lt;name&gt; --stored_procedure</code> for internal stored procedures.</p>"},{"location":"cookbooks/common_workflows/#updating-an-existing-udf","title":"Updating an existing UDF","text":"<ol> <li>Navigate to the <code>udf.sql</code> file and make updates</li> <li>Run <code>./bqetl mozfun validate &lt;dataset&gt;.&lt;name&gt;</code> or <code>./bqetl routine validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests</li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#renaming-an-existing-udf","title":"Renaming an existing UDF","text":"<ol> <li>Run <code>./bqetl mozfun rename &lt;dataset&gt;.&lt;name&gt; &lt;new_dataset&gt;.&lt;new_name&gt;</code><ul> <li>References in queries to the UDF are automatically updated</li> </ul> </li> <li>Open a PR</li> <li>PR gets reviews, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#using-a-private-internal-udf","title":"Using a private internal UDF","text":"<ol> <li>Follow the steps for Adding a new internal UDF above to create a stub of the private UDF. Note this should not contain actual private UDF code or logic. The directory name and function parameters should match the private UDF.</li> <li>Do Not publish the stub UDF. This could result in incorrect results for other users of the private UDF.</li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#creating-a-new-bigquery-dataset","title":"Creating a new BigQuery Dataset","text":"<p>To provision a new BigQuery dataset for holding tables, you'll need to create a <code>dataset_metadata.yaml</code> which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.</p> <p>The <code>bqetl query create</code> command will automatically generate a skeleton <code>dataset_metadata.yaml</code> file if the query name contains a dataset that is not yet defined.</p> <p>See example with commentary for <code>telemetry_derived</code>:</p> <pre><code>friendly_name: Telemetry Derived\ndescription: |-\n  Derived data based on pings from legacy Firefox telemetry, plus many other\n  general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n#   \"derived\" for `_derived` datasets that contain concrete tables\n#   \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n  members:\n  - workgroup:mozilla-confidential\n</code></pre>"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"<p>See also the reference for Public Data.</p> <ol> <li>Get a data review by following the data publishing process</li> <li>Update the <code>metadata.yaml</code> file of the query to be published<ul> <li>Set <code>public_bigquery: true</code> and optionally <code>public_json: true</code></li> <li>Specify the <code>review_bugs</code></li> </ul> </li> <li>If an internal dataset already exists, move it to <code>mozilla-public-data</code></li> <li>If an <code>init.sql</code> file exists for the query, change the destination project for the created table to <code>mozilla-public-data</code></li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged<ul> <li>Once, ETL is running a view will get automatically published to <code>moz-fx-data-shared-prod</code> referencing the public dataset</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-new-python-requirements","title":"Adding new Python requirements","text":"<p>When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either <code>requirements.txt</code> or <code>constraints.txt</code>, but not both.</p> <pre><code># Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 &gt;&gt; requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n</code></pre>"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"<p>When opening a pull-request to merge a fork, the <code>manual-trigger-required-for-fork</code> CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <code>&lt;username&gt;:&lt;branch&gt;</code> of the fork as parameter. The parameter will also show up in the logs of the <code>manual-trigger-required-for-fork</code> CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.</p>"},{"location":"cookbooks/common_workflows/#building-the-documentation","title":"Building the Documentation","text":"<p>The repository documentation is built using MkDocs. To generate and check the docs locally:</p> <ol> <li>Run <code>./bqetl docs generate --output_dir generated_docs</code></li> <li>Navigate to the <code>generated_docs</code> directory</li> <li>Run <code>mkdocs serve</code> to start a local <code>mkdocs</code> server.</li> </ol>"},{"location":"cookbooks/common_workflows/#setting-up-change-control-to-code-files","title":"Setting up change control to code files","text":"<p>Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.</p> <p>Here is a sample PR with the implementation of change control for contextual services data.</p> <ol> <li>Select or create a Github team or identity and add the GitHub emails of the query codeowners. A GitHub identity is particularly useful when you need to include non @mozilla emails or to randomly assign PR reviewers from the team members. This team requires edit permissions to bigquery-etl, to achieve this, inherit the team from one that has the required permissions e.g. <code>mozilla &gt; telemetry</code>.</li> <li>Open the <code>metadata.yaml</code> for the query where you want to apply change control:<ul> <li>In the section <code>owners</code>, add the selected GitHub identity, along with the list of owners' emails.</li> <li>In the section <code>labels</code>, add <code>change_controlled: true</code>. This enables identifying change controlled data in the BigQuery console and in the Data Catalog.</li> </ul> </li> <li>Setup the <code>CODEOWNERS</code>:<ul> <li>Open the <code>CODEOWNERS</code> file located in the root of the repo.</li> <li>Add a new row with the path and owners for the query. You can place it in the corresponding section or create a new section in the file, e.g. <code>/sql_generators/active_users/templates/ @mozilla/kpi_table_reviewers</code>.</li> </ul> </li> <li>The queries labeled change_controlled are automatically validated in the CI. To run the validation locally:<ul> <li>Run the command <code>script/bqetl query validate &lt;query_path&gt;</code>.</li> <li>If the query is generated using the <code>/sql-generators</code>, first run <code>./script/bqetl generate &lt;path&gt;</code> and then run <code>script/bqetl query validate &lt;query_path&gt;</code>.</li> </ul> </li> </ol>"},{"location":"cookbooks/creating_a_derived_dataset/","title":"A quick guide to creating a derived dataset with BigQuery-ETL and how to set it up as a public dataset","text":"<p>This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.</p> <p>This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.</p> <p>To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.</p> <p>If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"<p>Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.</p> <p>In this example, we will create a table of aggregated metrics related to <code>mozregression</code>, that will be used in dashboards to help prioritize feature development inside Mozilla.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#initial-steps","title":"Initial steps","text":"<p>Set up bigquery-etl on your system per the instructions in the README.md.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"<p>The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it <code>org_mozilla_mozregression_derived.mozregression_aggregates</code>.</p> <p>The <code>org_mozilla_mozregression_derived</code> part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the <code>_derived</code> postfix to hold derived tables like this one.</p> <p>Run: <pre><code>./bqetl query create &lt;dataset&gt;.&lt;table_name&gt;\n</code></pre> In our example:</p> <pre><code>./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n</code></pre> <p>This command does three things:</p> <ul> <li>Generate the template files <code>metadata.yaml</code> and <code>query.sql</code> representing the query to build the dataset in <code>sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1</code></li> <li>Generate a \"view\" of the dataset in <code>sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates</code>.</li> <li>Add the scheduling information in the metadata, required to create a task in Airflow DAG <code>bqetl_internal_tooling</code>.<ul> <li>When the dag name is not given, the query is scheduled by default in DAG <code>bqetl_default</code>.</li> <li>When the option <code>--no-schedule</code> is used, queries are not schedule. This option is available for queries that run once or should be scheduled at a later time. The query can be manually scheduled at a later time.</li> </ul> </li> </ul> <p>We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the <code>mozdata</code> project.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-yaml","title":"Fill out the YAML","text":"<p>The next step is to modify the generated <code>metadata.yaml</code> and <code>query.sql</code> sections with specific information.</p> <p>Let's look at what the <code>metadata.yaml</code> file for our example looks like. Make sure to adapt this file for your own dataset.</p> <pre><code>friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n</code></pre> <p>Most of the fields are self-explanatory. <code>incremental</code> means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.</p> <p>For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the <code>date</code> field and a clustering on <code>app_used</code> and <code>os</code>.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#the-yaml-file-structure-for-a-public-dataset","title":"The YAML file structure for a public dataset","text":"<p>Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: <code>public_json</code>, <code>public_bigquery</code> and <code>review_bugs</code> which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.</p> <pre><code>friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1691105\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n</code></pre>"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"<p>Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a <code>@submission_date</code> parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.</p> <p>Test your query and add it to the <code>query.sql</code> file.</p> <p>In our example, the query is tested in <code>sql.telemetry.mozilla.org</code>, and the <code>query.sql</code> file looks like this:</p> <pre><code>SELECT\n  DATE(submission_timestamp) AS date,\n  client_info.app_display_version AS mozregression_version,\n  metrics.string.usage_variant AS mozregression_variant,\n  metrics.string.usage_app AS app_used,\n  normalized_os AS os,\n  mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n  count(DISTINCT(client_info.client_id)) AS distinct_clients,\n  count(*) AS total_uses\nFROM\n  `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n  DATE(submission_timestamp) = @submission_date\n  AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n  date,\n  mozregression_version,\n  mozregression_variant,\n  app_used,\n  os,\n  os_version;\n</code></pre> <p>We use the <code>truncate_version</code> UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.</p> <p>We also have a short clause (<code>client_info.app_display_version NOT LIKE '%.dev%'</code>) to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#formatting-and-validating-the-query","title":"Formatting and validating the query","text":"<p>Now that we've written our query, we can format it and validate it. Once that's done, we run:</p> <p><pre><code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;\n</code></pre> For our example: <pre><code>./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre> If there are no problems, you should see no output.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"<p>Use bqetl to set up the schema that will be used to create the table.</p> <p>Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.</p> <pre><code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;\n</code></pre> <p>For our example: <pre><code>./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre></p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"<p>BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).</p> <p>Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in <code>dags.yaml</code> at the root or run <code>./bqetl dag info</code>. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:</p> <pre><code>./bqetl dag create &lt;dag_name&gt; --schedule-interval \"0 4 * * *\" --owner &lt;email_for_notifications&gt; --description \"Add a clear description of the DAG here\" --start-date &lt;YYYY-MM-DD&gt; --tag impact/&lt;tier&gt;\n</code></pre> <p>For our example, the starting date is <code>2020-06-01</code> and we use a schedule interval of <code>0 4 \\* \\* \\*</code> (4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.</p> <p>The <code>--tag impact/tier3</code> parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.</p> <p>When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as <code>--tag triage/no_triage</code>. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the <code>triage/no_triage</code> tag can be removed and problems will addressed during the Airflow Triage process.</p> <pre><code>./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n</code></pre>"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"<p>Queries are automatically scheduled during creation in the DAG set using the option <code>--dag</code>, or in the default DAG <code>bqetl_default</code> when this option is not used.</p> <p>If the query was created with <code>--no-schedule</code>, it is possible to manually schedule the query via the <code>bqetl</code> tool:</p> <pre><code>./bqetl query schedule &lt;dataset&gt;.&lt;table&gt; --dag &lt;dag_name&gt; --task-name &lt;task_name&gt;\n</code></pre> <p>Here is the command for our example. Notice the name of the table as created with the suffix _v1. <pre><code>./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n</code></pre></p> <p>Note that we are scheduling the generation of the underlying table which is <code>org_mozilla_mozregression_derived.mozregression_aggregates_v1</code> rather than the view.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#get-data-review","title":"Get Data Review","text":"<p>This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.</p> <p>Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.</p> <p>The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"<p>Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:</p> <pre><code>git checkout -b &lt;new_branch_name&gt;\ngit add dags.yaml dags/&lt;dag_name&gt;.py sql/moz-fx-data-shared-prod/telemetry/&lt;view&gt; sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;\ngit commit\ngit push origin &lt;new_branch_name&gt;\n</code></pre> <p>And next is the workflow for our specific example:</p> <pre><code>git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n</code></pre> <p>Then create your pull request, either from the GitHub web interface or the command line, per your preference.</p> <p>Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.</p> <p>This example assumes that <code>origin</code> points to your fork. Adjust the last push invocation appropriately if you have a different remote set.</p> <p>Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"<p>Once the PR has been approved, deploy the schema to bqetl using this command:</p> <pre><code>./bqetl query schema deploy &lt;schema&gt;.&lt;table&gt;\n</code></pre> <p>For our example: <pre><code>./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre></p>"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"<p>Note For large sets of data, follow the recommended practices for backfills.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"<ol> <li> <p>Create a backfill schedule entry to (re)-process data in your table:</p> <pre><code>bqetl backfill create &lt;project&gt;.&lt;dataset&gt;.&lt;table&gt; --start_date=&lt;YYYY-MM-DD&gt; --end_date=&lt;YYYY-MM-DD&gt;\n</code></pre> <ul> <li>If the backfill requires shredder_mitigation to maintain metrics stable, use the <code>--shredder_mitigation</code> parameter in the backfill command:</li> </ul> <pre><code>bqetl backfill create &lt;project&gt;.&lt;dataset&gt;.&lt;table&gt; --start_date=&lt;YYYY-MM-DD&gt; --end_date=&lt;YYYY-MM-DD&gt; --shredder_mitigation\n</code></pre> </li> <li> <p>Fill out the missing details:</p> <ul> <li>Watchers: Mozilla Emails for users that should be notified via Slack about backfill progress.</li> <li>Reason: Why are you backfilling this table?</li> </ul> </li> <li> <p>Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.</p> </li> <li> <p>Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.</p> </li> </ol>"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-backfill","title":"Completing the backfill:","text":"<ol> <li> <p>Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)</p> </li> <li> <p>If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.</p> </li> <li> <p>You will be notified when swapping is complete.</p> </li> </ol> <p>Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"<p>At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.</p>"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"<p>This repository uses <code>pytest</code>:</p> <pre><code># create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n</code></pre> <p>To provide authentication credentials for the Google Cloud API the <code>GOOGLE_APPLICATION_CREDENTIALS</code> environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.</p>"},{"location":"cookbooks/testing/#how-to-configure-a-udf-test","title":"How to Configure a UDF Test","text":"<p>Include a comment like <code>-- Tests</code> followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.</p> <p>Each test must use the UDF and throw an error to fail. Assert functions defined in <code>sql/mozfun/assert/</code> may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like <code>#xfail</code>, similar to a SQL dialect prefix in the BigQuery Cloud Console.</p> <p>For example:</p> <pre><code>CREATE TEMP FUNCTION udf_example(option INT64) AS (\n  CASE\n  WHEN option &gt; 0 then TRUE\n  WHEN option = 0 then FALSE\n  ELSE ERROR(\"invalid option\")\n  END\n);\n-- Tests\nSELECT\n  mozfun.assert.true(udf_example(1)),\n  mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n  udf_example(-1);\n#xfail\nSELECT\n  udf_example(NULL);\n</code></pre>"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"<p>Queries are tested by running the <code>query.sql</code> with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named <code>tests/sql/{project}/{dataset}/{table}/{test_name}/</code>,    e.g. <code>tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day</code>    - <code>table</code> must match a directory named like <code>{dataset}/{table}</code>, e.g.      <code>telemetry_derived/clients_last_seen_v1</code>    - <code>test_name</code> should start with <code>test_</code>, e.g. <code>test_single_day</code>    - If <code>test_name</code> is <code>test_init</code> or <code>test_script</code>, then the query with <code>is_init()</code> set to <code>true</code>      or <code>script.sql</code> respectively; otherwise, the test will run <code>query.sql</code> 1. Add <code>.yaml</code> files for input tables, e.g. <code>clients_daily_v6.yaml</code>    - Include the dataset prefix if it's set in the tested query,      e.g. <code>analysis.clients_last_seen_v1.yaml</code>    - Include the project prefix if it's set in the tested query,      e.g. <code>moz-fx-other-data.new_dataset.table_1.yaml</code>      - This will result in the dataset prefix being removed from the query,        e.g. <code>query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\")</code> 1. Add <code>.sql</code> files for input view queries, e.g. <code>main_summary_v4.sql</code>    - Don't include a <code>CREATE ... AS</code> clause    - Fully qualify table names as <code>`{project}.{dataset}.table`</code>    - Include the dataset prefix if it's set in the tested query,      e.g. <code>telemetry.main_summary_v4.sql</code>      - This will result in the dataset prefix being removed from the query,        e.g. <code>query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\")</code> 1. Add <code>expect.yaml</code> to validate the result    - <code>DATE</code> and <code>DATETIME</code> type columns in the result are coerced to strings      using <code>.isoformat()</code>    - Columns named <code>generated_time</code> are removed from the result before      comparing to <code>expect</code> because they should not be static    - <code>NULL</code> values should be omitted in <code>expect.yaml</code>. If a column is expected to be <code>NULL</code> don't add it to <code>expect.yaml</code>.      (Be careful with spreading previous rows (<code>-&lt;&lt;: *base</code>) here) 1. Optionally add <code>.schema.json</code> files for input table schemas to the table directory, e.g.    <code>tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json</code>.    These tables will be available for every test in the suite.    The <code>schema.json</code> file need to match the table name in the <code>query.sql</code> file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add <code>query_params.yaml</code> to define query parameters    - <code>query_params</code> must be a list</p>"},{"location":"cookbooks/testing/#init-tests","title":"Init Tests","text":"<p>Tests of <code>is_init()</code> statements are supported, similarly to other generated tests. Simply name the test <code>test_init</code>. The other guidelines still apply.</p>"},{"location":"cookbooks/testing/#additional-guidelines-and-options","title":"Additional Guidelines and Options","text":"<ul> <li>If the destination table is also an input table then <code>generated_time</code> should   be a required <code>DATETIME</code> field to ensure minimal validation</li> <li>Input table files<ul> <li>All of the formats supported by <code>bq load</code> are supported</li> <li><code>yaml</code> and <code>json</code> format are supported and must contain an array of rows   which are converted in memory to <code>ndjson</code> before loading</li> <li>Preferred formats are <code>yaml</code> for readability or <code>ndjson</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li><code>expect.yaml</code><ul> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred formats are <code>yaml</code> for readability or <code>ndjson</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li>Schema files<ul> <li>Setting the description of a top level field to <code>time_partitioning_field</code>   will cause the table to use it for time partitioning</li> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred formats are <code>yaml</code> for readability or <code>json</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li>Query parameters<ul> <li>Scalar query params should be defined as a dict with keys <code>name</code>, <code>type</code> or   <code>type_</code>, and <code>value</code></li> <li><code>query_parameters.yaml</code> may be used instead of <code>query_params.yaml</code>, but   they are mutually exclusive</li> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred format is <code>yaml</code> for readability</li> </ul> </li> </ul>"},{"location":"cookbooks/testing/#how-to-run-circleci-locally","title":"How to Run CircleCI Locally","text":"<ul> <li>Install the CircleCI Local CI</li> <li>Download GCP service account keys<ul> <li>Integration tests will only successfully run with service account keys   that belong to the <code>circleci</code> service account in the <code>biguqery-etl-integration-test</code> project</li> </ul> </li> <li>Run <code>circleci build</code> and set required environment variables <code>GOOGLE_PROJECT_ID</code> and   <code>GCLOUD_SERVICE_KEY</code>:</li> </ul> <pre><code>gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64, n INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"<p>Given a map of representing activity for STRING <code>key</code>s, this function returns an array of which <code>key</code>s were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_seen_bits_map ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, start_offset INT64, n_bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"<p>This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days.  On the first of each month, a new month is appended, and the first month is dropped.  If the date is not the first of the month, the new entry is added to the last element in the array.  For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]:  On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24]  This happens for every aggregate (searches, ad clicks, etc.)</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;, curr STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"<p>Adds together two engine searches structs. Each engine searches struct has a MAP[engine -&gt; search_counts_struct]. We want to add add together the prev and curr's values for a certain engine.  This allows us to be flexible with the number of engines we're using.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;, curr ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>value\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"<p>Return sums of each search type grouped by the index. Results are ordered by index.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>searches ARRAY&lt;STRUCT&lt;total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"<p>This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY&gt;, i.e. the output of <code>SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4</code>, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addons ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"<p>Returns an aggregated map with all the keys and the first corresponding value from the given maps</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>maps ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>search_counts ARRAY&lt;STRUCT&lt;engine STRING, source STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"<p>Aggregates the total counts of the given search counters</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine_searches_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"<p>An array of 11 zeroes, followed by a supplied value</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>val INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"<p>Drop the first element of an array, and append the given element.  Result is an array with the same length as the input.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arr ANY TYPE, append ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"<p>An array of 12 zeroes</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arr ANY TYPE, start_index INT64, end_index INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"<p>This function counts the 1s in lowest 7 bits of an INT64</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"<p>A bitmask for 365 bits</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( 0x0FFFFFFF\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( 0x7F\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"<p>Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask.  The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start_ordinal INT64, _length INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;DATE&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"<p>Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: <code>sql bits_from_offsets(   ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL)             IGNORE NULLS) )</code> The resulting value can be cast to an INT64 representing the most recent 64 days via: <code>sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits &gt;&gt; i, 4))) AS INT64)</code> Or representing the most recent 28 days (compatible with bits28 functions) via: <code>sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits &gt;&gt; i, 4))) AS INT64) &lt;&lt; 36 &gt;&gt; 36</code></p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_25","title":"Parameters","text":"<p>INPUTS</p> <pre><code>offsets ARRAY&lt;INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"<p>Given a BYTE and an INT64, return whether the user was active that many weeks ago.  NULL input returns NULL output.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, n INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"<p>Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"<p>Given a BYTES, return the number of days since the client was first seen.  If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0.  Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS ((     SELECT MAX(n)     FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n     WHERE BIT_COUNT(SUBSTR(b &gt;&gt; n, -1) &amp; b'\\x01') &gt; 0)); See also: bits_to_days_since_seen.sql</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"<p>Given a BYTES, return the number of days since the client was last seen.  If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0.  Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION   udf.bits_to_days_since_seen(b BYTES) AS ((     SELECT MIN(n)     FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n     WHERE BIT_COUNT(SUBSTR(b &gt;&gt; n, -1) &amp; b'\\x01') &gt; 0)); See also: bits_to_days_since_first_seen.sql</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"<p>Convert a boolean to 365 bit byte array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"<p>INPUTS</p> <pre><code>val BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"<p>Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"<p>We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev INT64, curr INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"<p>Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev BYTES, curr BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"<p>Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev INT64, curr INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev BYTES, curr BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"<p>The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See <code>udf.combine_experiment_days</code> for a more specific example of this approach.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_36","title":"Parameters","text":"<p>INPUTS</p> <pre><code>-- prev ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, -- curr ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"<p>The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns.  This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"<p>INPUTS</p> <pre><code>-- prev ARRAY&lt;STRUCT&lt;experiment STRING, branch STRING, bits INT64&gt;&gt;, -- curr ARRAY&lt;STRUCT&lt;experiment STRING, branch STRING, bits INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"<p>For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji.  This is just for fun. See:  - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country_code string\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"<p>Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_seen_bytes BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"<p>Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_created_profile INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"<p>Rename struct fields in anonymous event tuples to meaningful names.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"<p>INPUTS</p> <pre><code>tuple STRUCT&lt;f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"<p>INPUTS</p> <pre><code>raw BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"<p>Return an array containing only distinct values of the given array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) RETURNS STRING AS ( 'helloworld'\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"<p>INPUTS</p> <pre><code>code_points ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;INT64&gt;\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"<p>Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metric ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"<p>Extract the document type from a table name e.g. _TABLE_SUFFIX.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"<p>INPUTS</p> <pre><code>table_name STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"<p>Extract the document version from a table name e.g. _TABLE_SUFFIX.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"<p>INPUTS</p> <pre><code>table_name STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"<p>This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed.  It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"<p>Return a path derived from an error message in <code>payload_bytes_error</code></p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_53","title":"Parameters","text":"<p>INPUTS</p> <pre><code>error_message STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"<p>Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.</p> <p>Fenix originally used an 8-digit app_build format&gt;</p> <p>In short it is <code>yDDDHHmm</code>:</p> <ul> <li>y is years since 2018</li> <li>DDD is day of year, 0-padded, 001-366</li> <li>HH is hour of day, 00-23</li> <li>mm is minute of hour, 00-59</li> </ul> <p>The last date seen with an 8-digit build ID is 2020-08-10.</p> <p>Newer builds use a 10-digit format&gt; where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.</p> <p>This function tolerates both formats.</p> <p>After using this you may wish to <code>DATETIME_TRUNC(result, DAY)</code> for grouping by build date.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_54","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"<p>INPUTS</p> <pre><code>device_category STRING, browser STRING, operating_system STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"<p>INPUTS</p> <pre><code>silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"<p>Determine if a browser in a Google Analytics data is produced by Mozilla</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"<p>INPUTS</p> <pre><code>browser STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"<p>Convert geoip lookup fields to a struct, replacing '??' with NULL.  Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"<p>Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"<p>Given a gzip compressed byte string, extract the uncompressed size from the footer.  WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"<p>INPUTS</p> <pre><code>compressed BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"<p>Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE, percentiles ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING, threshold INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"<p>Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation  This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"<p>INPUTS</p> <pre><code>key BYTES, message BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"<p>Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_histogram ANY TYPE, target_key STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"<p>Returns a JSON string which has the <code>pair</code> appended to the provided <code>input</code> JSON string. NULL is also valid for <code>input</code>. Examples:    udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)])    '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)])    '{\"baz\": \"boo\"}'</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_79","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING, arr ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"<p>Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT&gt;&gt;. To use this with that representation, it should be as <code>udf.kv_array_to_json_string(struct.key_value)</code>."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"<p>INPUTS</p> <pre><code>kv_arr ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"<p>INPUTS</p> <pre><code>processes ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"<p>For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"<p>Only for use by the LTV Revenue join.  Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"<p>This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"<p>INPUTS</p> <pre><code>medium STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"<p>Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"<p>INPUTS</p> <pre><code>aggs ARRAY&lt;STRUCT&lt;metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"<p>This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"<p>INPUTS</p> <pre><code>dividend BYTES, divisor INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"<p>Stub monetized_search UDF for tests</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING, country STRING, distribution_id STRING, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"<p>This struct represents the past year's worth of searches. Each month has its own entry, hence 12.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( STRUCT( udf.array_of_12_zeroes(\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"<p>Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"<p>INPUTS</p> <pre><code>telemetry_sdk_build STRING, metrics ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_info ANY TYPE, metrics ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"<p>Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"<p>INPUTS</p> <pre><code>payload ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metadata ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"<p>Sum up the monthy search count arrays by normalized engine</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine_searches ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"<p>Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"<p>Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"<p>One represented as a byte array of 365 bits</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( CONCAT(REPEAT(b'\\x00', 45\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"<p>This is a UDF to help distinguish desktop client attribution as being organic or paid</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"<p>INPUTS</p> <pre><code>medium STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"<p>This is a UDF to help distinguish mobile client attribution as being organic or paid</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_properties ANY TYPE, indices ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"<p>Sum an array from a parquet-derived field. These are lists of an <code>element</code> that contain the field value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_106","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"<p>Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"<p>INPUTS</p> <pre><code>uri STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"<p>Take a ISO 8601 date or date and time string and return a DATE.  Return null if parse fails.  Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"<p>INPUTS</p> <pre><code>date_str STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"<p>Returns the 0-based index of the first set bit. No set bits returns NULL.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"<p>INPUTS</p> <pre><code>i INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"<p>Identical to bits28_days_since_seen.  Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0).  To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"<p>Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"<p>INPUTS</p> <pre><code>legacy_app_name STRING, normalized_os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"<p>Pseudonymize Ad IDs, handling opt-outs.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"<p>INPUTS</p> <pre><code>hashed_ad_id STRING, key BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"<p>This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"<p>Floor a timestamp object to the given minute interval.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timestamp_expression TIMESTAMP, minute INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"<p>Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes.  This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"<p>Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"<p>Calculate the sums of search counts per source and engine</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ARRAY&lt;STRUCT&lt;engine STRING, source STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"<p>Shift input bits one day left and drop any bits beyond 28 days.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"<p>Shift input bits one day left and drop any bits beyond 365 days.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"<p>Returns the bitfield shifted by one day, 0 for NULL</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"<p>Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bit_arrays ARRAY&lt;STRUCT&lt;days_created_profile_bits INT64, days_active_bits INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"<p>This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"<p>INPUTS</p> <pre><code>a ARRAY&lt;INT64&gt;, b ARRAY&lt;INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"<p>Zero represented as a 365-bit byte array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( REPEAT(b'\\x00', 46\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"<p>Generates an array if all zeroes, of arbitrary length</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"<p>INPUTS</p> <pre><code>len INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"<p>Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>percentiles ARRAY&lt;INT64&gt;, histogram STRUCT&lt;values ARRAY&lt;STRUCT&lt;key FLOAT64, value FLOAT64&gt;&gt;&gt;, metric STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"<p>Calculate the CRC-32 hash of an input string.  The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted.  Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>data STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64 DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"<p>URL decodes the raw firefox_installer.install.attribution string to a STRUCT.  The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted.  If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>attribution STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"<p>Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from <code>payload</code> which is in bytes.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>payload BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"<p>Unzips a GZIP string.  This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements.  Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, values_per_bucket ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;low FLOAT64, high FLOAT64, pm FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"<p>Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>percentile FLOAT64, histogram STRUCT&lt;values ARRAY&lt;STRUCT&lt;key FLOAT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;low FLOAT64, high FLOAT64, percentile FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example:  WITH bucketed AS (   SELECT     submission_date, SUM(active_days_in_week) AS active_days_in_week,     SUM(wau) AS wau FROM     mytable   GROUP BY     submission_date,     bucket_id ) SELECT   submission_date,   udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM   bucketed GROUP BY submission_date</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, values_per_bucket ARRAY&lt;STRUCT&lt;numerator FLOAT64, denominator FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY&lt;STRUCT&lt;numerator FLOAT64, denominator FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements.  Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.  Example:  WITH bucketed AS (   SELECT     submission_date,     SUM(dau) AS dau_sum   FROM     mytable GROUP BY     submission_date,     bucket_id ) SELECT   submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM   bucketed GROUP BY   submission_date</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, counts_per_bucket ARRAY&lt;INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;total INT64, low INT64, high INT64, pm INT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"<p>Returns a parsed struct from a JSON string representing a histogram.  This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"<p>Returns an array of parsed structs from a JSON string representing a keyed histogram.  This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"<p>Extract missing columns from additional properties.  More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...].  param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node.  Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram   (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements']  See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING, indicates_node ARRAY&lt;STRING&gt;, known_nodes ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"<p>Add fields from additional_attributes to active_addons in main pings.  Return an array instead of a \"map\" for backwards compatibility.  The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL.  The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addons ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING&gt;&gt;&gt;, active_addons_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"<p>Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;keyed_boolean_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value BOOL&gt;&gt;&gt;&gt;, keyed_uint_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;&gt;&gt;, string_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;, keyed_string_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;&gt;, uint_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, boolean_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value BOOL&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"<p>Report the ids of the addons which are in the addonDetails but not in the activeAddons.  They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addon_ids ARRAY&lt;STRING&gt;, addon_details_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"<p>Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>params STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"<p>Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"<p>This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/about/","title":"mozfun","text":"<p><code>mozfun</code> is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.</p>"},{"location":"mozfun/addons/","title":"Addons","text":""},{"location":"mozfun/addons/#is_adblocker-udf","title":"is_adblocker (UDF)","text":"<p>Returns whether a given Addon ID is an adblocker.</p> <p>Determine if a given Addon ID is for an adblocker.</p> <p>As an example, this query will give the number of users who have an adblocker installed. <pre><code>SELECT\n    submission_date,\n    COUNT(DISTINCT client_id) AS dau,\nFROM\n    mozdata.telemetry.addons\nWHERE\n    mozfun.addons.is_adblocker(addon_id)\n    AND submission_date &gt;= \"2023-01-01\"\nGROUP BY\n    submission_date\n</code></pre></p>"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>addon_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>name STRING, expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"<p>Like map_equals but error message contains only the offending entry</p>"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre>"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"<p>Compare SQL Strings for equality</p>"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/","title":"bits28","text":"<p>The <code>bits28</code> functions provide an API for working with \"bit pattern\" INT64 fields, as used in the <code>clients_last_seen</code> dataset for desktop Firefox and similar datasets for other applications.</p> <p>A powerful feature of the <code>clients_last_seen</code> methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.</p> <p>The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.</p>"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"<p>Return a boolean indicating if any bits are set in the specified range of a bit pattern. The <code>start_offset</code> must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at <code>start_offset</code>.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"<p>Return the position of the rightmost set bit in an INT64 bit pattern.</p> <p>To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  mozfun.bits28.days_since_seen(18)\n-- &gt;&gt; 1\n</code></pre>"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"<p>Convert a string representing individual bits into an INT64.</p> <p>Implementation based on https://stackoverflow.com/a/51600210/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"<p>Return an INT64 representing a range of bits from a source bit pattern.</p> <p>The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.</p> <p>n_bits is the number of bits to consider, counting right from the bit at start_offset.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n  mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n  mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n  `mozdata.telemetry.clients_last_seen`\nWHERE\n  submission_date &gt; '2020-01-01'\n</code></pre>"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"<p>Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.</p>"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"<p>Convert a bit pattern into an array of the dates is represents.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;DATE&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"<p>Convert an INT64 field into a 28-character string representing the individual bits.</p> <p>Implementation based on https://stackoverflow.com/a/51600210/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- &gt;&gt;&gt; ['0000000000000000000000000001',\n--      '0000000000000000000000000010',\n--      '0000000000000000000000000011']\n</code></pre>"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"<p>Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.</p>"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bit_pos INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"<p>Extract bits from a byte array. Roughly matches substr with three arguments:   b: bytes - The byte string we need to extract from   start: int - The position of the first bit we want to extract.     Can be negative to start from the end of the byte array.     One-indexed, like substring.   length: int - The number of bits we want to extract</p> <p>The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.</p> <p>Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'</p>"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, `begin` INT64, length INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"<p>Zero bits on the right of byte</p>"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, length INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"<p>These functions are specific for use with the <code>events_daily</code> and <code>event_types</code> tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.</p> <p>The <code>events_daily</code> table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the     compressed format, that clients' event history for that day.     The characters are ordered by the timestamp which they appeared     that day.</p> <p>The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.</p>"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"<p>Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.</p>"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>match_strings ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"<p>Generate the SQL statement that can be used to create an easily queryable view on events data.</p>"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"<p>Create a view that queries the <code>events_daily</code> table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.</p>"},{"location":"mozfun/event_analysis/#usage","title":"Usage","text":"<pre><code>create_events_view(\n    view_name STRING,\n    project STRING,\n    dataset STRING,\n    funnels ARRAY&lt;STRUCT&lt;\n        funnel_name STRING,\n        funnel ARRAY&lt;STRUCT&lt;\n            step_name STRING,\n            events ARRAY&lt;STRUCT&lt;\n                category STRING,\n                event_name STRING&gt;&gt;&gt;&gt;&gt;&gt;,\n    counts ARRAY&lt;STRUCT&lt;\n        count_name STRING,\n        events ARRAY&lt;STRUCT&lt;\n            category STRING,\n            event_name STRING&gt;&gt;&gt;&gt;\n  )\n</code></pre> <ul> <li><code>view_name</code>: The name of the view that will be created. This view     will be in the shared-prod project, in the analysis bucket,     and so will be queryable at:         <pre><code>`moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre></li> </ul> <ul> <li><code>project</code>: The project where the <code>dataset</code> is located.</li> </ul> <ul> <li><code>dataset</code>: The dataset that must contain both the <code>events_daily</code> and     <code>event_types</code> tables.</li> </ul> <ul> <li><code>funnels</code>: An array of funnels that will be created. Each funnel has     two parts:     1. <code>funnel_name</code>: The name of the funnel is what the column representing         the funnel will be named in the view. For example, with the value         <code>\"onboarding\"</code>, the view can be selected as follows:         <pre><code>SELECT onboarding\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>     2. <code>funnel</code>: The ordered series of steps that make up a funnel.         Each step also has:         1. <code>step_name</code>: Used to name the column             within the funnel and represents whether the user completed             that step on that day. For example, within <code>onboarding</code> a user may             have <code>completed_first_card</code> as a step; this can be queried at             <pre><code>SELECT onboarding.completed_first_step\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>         2. <code>events</code>: The set of events which indicate the user completed             that step of the funnel. Most of the time this is a single event.             Each event has a <code>category</code> and <code>event_name</code>.</li> </ul> <ul> <li><code>counts</code>: An array of counts. Each count has two parts, similar to funnel steps:     1. <code>count_name</code>: Used to name the column representing the event count. E.g.         <code>\"clicked_settings_count\"</code> would be queried at         <pre><code>SELECT clicked_settings_count\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>     2. <code>events</code>: The set of events you want to count. Each event has         a <code>category</code> and <code>event_name</code>.</li> </ul>"},{"location":"mozfun/event_analysis/#recommended-pattern","title":"Recommended Pattern","text":"<p>Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.</p> <p>This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.</p>"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"<p>The view will be created at  <pre><code>`moz-fx-data-shared-prod`.analysis.{event_name}.\n</code></pre></p> <p>The view will have a schema roughly matching the following: <pre><code>root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n |  |-- {funnel_step_1_name} boolean\n |  |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n |  |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n</code></pre></p>"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"<p>Each funnel will be a <code>STRUCT</code> with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.</p> <pre><code>STRUCT(\n    completed_step_1 BOOLEAN,\n    completed_step_2 BOOLEAN,\n    ...\n) AS funnel_name\n</code></pre> <p>With one row per-user per-day, you can use <code>COUNTIF(funnel_name.completed_step_N)</code> to query these fields. See below for an example.</p>"},{"location":"mozfun/event_analysis/#event-counts","title":"Event Counts","text":"<p>Each event count is simply an <code>INT64</code> representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.</p> <pre><code>event_count_name INT64\n</code></pre>"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"<p>The following creates a few fields: - <code>collection_flow</code> is a funnel for those that started creating     a collection within Fenix, and then finished, either by adding     those tabs to an existing collection or saving it as a new     collection. - <code>collection_flow_saved</code> represents users who started the collection     flow then saved it as a new collection. - <code>number_of_collections_created</code> is the number of collections created - <code>number_of_collections_deleted</code> is the number of collections deleted</p> <pre><code>CALL mozfun.event_analysis.create_events_view(\n  'fenix_collection_funnels',\n  'moz-fx-data-shared-prod',\n  'org_mozilla_firefox',\n\n  -- Funnels\n  [\n    STRUCT(\n      \"collection_flow\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"completed_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name),\n        STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n    ] AS funnel),\n\n    STRUCT(\n      \"collection_flow_saved\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"saved_collection\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n    ] AS funnel)\n  ],\n\n  -- Event Counts\n  [\n    STRUCT(\n      \"number_of_collections_created\" AS count_name,\n      [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n    ),\n    STRUCT(\n      \"number_of_collections_deleted\" AS count_name,\n      [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n    )\n  ]\n);\n</code></pre> <p>From there, you can query a few things. For example, the fraction  of users who completed each step of the collection flow over time: <pre><code>SELECT\n    submission_date,\n    COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n    COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date &gt;= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n</code></pre></p> <p>Or you can see the number of collections created and deleted: <pre><code>SELECT\n    submission_date,\n    SUM(number_of_collections_created) AS number_of_collections_created,\n    SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date &gt;= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n</code></pre></p>"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>view_name STRING, project STRING, dataset STRING, funnels ARRAY&lt;STRUCT&lt;funnel_name STRING, funnel ARRAY&lt;STRUCT&lt;step_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;&gt;&gt;, counts ARRAY&lt;STRUCT&lt;count_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"<p>Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.</p>"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>step_regexes ARRAY&lt;STRING&gt;, intermediate_steps BOOLEAN\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"<p>Generate the SQL statement that can be used to create an easily queryable view on events data.</p>"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, funnel ARRAY&lt;STRUCT&lt;list ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"<p>Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.</p>"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"<p>Given an event index string, create a match string that is an exact match in the events_daily table.</p>"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>index STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"<p>Given an event index and property index from an <code>event_types</code> table, returns a regular expression to match corresponding events within an <code>events_daily</code> table's <code>events</code> string that aren't missing the specified property.</p>"},{"location":"mozfun/event_analysis/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_index STRING, property_index INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"<p>Given an event index, property index, and property value from an <code>event_types</code> table, returns a regular expression to match corresponding events within an <code>events_daily</code> table's <code>events</code> string.</p>"},{"location":"mozfun/event_analysis/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_index STRING, property_index INTEGER, property_value STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"<p>Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.</p>"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"<pre><code>extract_event_counts(\n    events STRING\n)\n</code></pre> <p><code>events</code> - A comma-separated events string, where each event is represented as a string of unicode chars.</p>"},{"location":"mozfun/event_analysis/#example","title":"Example","text":"<p>See this dashboard for example usage.</p>"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>events STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;index STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"<p>Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.</p>"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"<pre><code>extract_event_counts_with_properties(\n    events STRING\n)\n</code></pre> <p><code>events</code> - A comma-separated events string, where each event is represented as a string of unicode chars.</p>"},{"location":"mozfun/event_analysis/#example_1","title":"Example","text":"<p>See this query for example usage.</p>"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"<p>This function extracts both counts for events with each property, and for all events without their properties.</p> <p>This allows us to include both total counts for an event (with any property value), and events that don't have properties.</p>"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>events STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;event_index STRING, property_index INT64, property_value_index STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"<p>For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.</p>"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, count_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>count_sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"<p>For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.</p>"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, funnel_name STRING, funnel ARRAY&lt;STRUCT&lt;step_name STRING, list ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>funnel_sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"<p>Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"</p> <p>UDF for handling empty Google Analytics data.</p>"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"<p>Parses the custom build id used for Fenix builds in GLAM to a datetime.</p>"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>build_hour STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>DATETIME\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"<p>Returns a custom build id generated from the build seconds of a FOG build.</p>"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>build_hour STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"<p>Returns a custom build id generated from the build hour of a Fenix build.</p>"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;STRING&gt;, val FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"<p>Cast histogram buckets into a string array.</p>"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"<p>Cast a histogram into a JSON blob.</p>"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"<p>Cast a String-based JSON histogram to an Array of Structs</p>"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>json_str STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;KEY STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"<p>Interpolate missing histogram buckets with empty buckets.</p>"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input_map ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, buckets ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"<p>Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.</p> <p>See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg</p> <p>To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.</p> <p>This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).</p> <p>Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).</p> <p>Calculate the total number of buckets K, as well as the total number of profiles <code>N reporting</code></p> <p>Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }</p>"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input_map ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, buckets ARRAY&lt;STRING&gt;, total_users INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"<p>Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.</p>"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>aggs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"<p>Create an empty histogram from an array of buckets.</p>"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"<p>Generate exponential buckets for a histogram.</p>"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min FLOAT64, max FLOAT64, nBuckets FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"<p>Generate functional buckets for a histogram. This is specific to Glean.</p> <p>See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs</p> <p>A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:</p> <p>i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$</p> <p>In other words, there are n buckets for each power of <code>base</code> magnitude.</p>"},{"location":"mozfun/glam/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>log_base INT64, buckets_per_magnitude INT64, range_max INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"<p>Generate linear buckets for a histogram.</p>"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min FLOAT64, max FLOAT64, nBuckets FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"<p>Generate scalar buckets for a histogram using a fixed number of buckets.</p>"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"<p>Compute the normalized sum of an array of histograms.</p>"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arrs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, weight FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"<p>Compute the normalized and the non-normalized sum of an array of histograms.</p>"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arrs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, weight FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64, non_norm_value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>required ARRAY&lt;FLOAT64&gt;, `values` ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>required ARRAY&lt;FLOAT64&gt;, `values` ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"<p>Get the value of the approximate CDF at the given percentile.</p>"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"<p>INPUTS</p> <pre><code>pct FLOAT64, histogram ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, type STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/","title":"glean","text":"<p>Functions for working with Glean data.</p>"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"<p>Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects</p> <p>This UDF transforms the <code>ping_info.experiments</code> field from Glean pings into the format for <code>experiments</code> used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.</p> <p>If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.</p>"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info__experiments ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;branch STRING, extra STRUCT&lt;type STRING, enrollment_id STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"<p>Parses a Glean datetime metric string value as a BigQuery timestamp.</p> <p>See https://mozilla.github.io/glean/book/reference/metrics/datetime.html</p>"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>datetime_string STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>TIMESTAMP\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"<p>Returns the number of nanoseconds represented by a Glean timespan struct.</p> <p>See https://mozilla.github.io/glean/book/user/metrics/timespan.html</p>"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"<p>Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.</p> <p>See https://mozilla.github.io/glean/book/user/metrics/timespan.html</p>"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"<p>Extract Segments from a campaign name. Includes region, country_code, and language.</p>"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>campaign_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;campaign_region STRING, campaign_country_code STRING, campaign_language STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"<p>Functions for use with Google Search Console data.</p>"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"<p>Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".</p>"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>site_domain_name STRING, query STRING, search_type STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"<p>Extract the country code from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"<p>Extract the domain name from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"<p>Extract the language code from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"<p>Extract the locale from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"<p>Extract the path from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"<p>Extract a particular path segment from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING, segment_number INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/","title":"hist","text":"<p>Functions for working with string encodings of histograms from desktop telemetry.</p>"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"<p>Given histogram h, return the count of all measurements across all buckets.</p> <p>Given histogram h, return the count of all measurements across all buckets.</p> <p>Extracts the values from the histogram and sums them, returning the total_count.</p>"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"<p>Return a parsed struct from a string-encoded histogram.</p> <p>We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.</p> <p>The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.</p> <p>For more on desktop telemetry histogram structure, see:</p> <ul> <li>https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/histograms.html</li> </ul> <p>The compact encodings were originally proposed in:</p> <ul> <li>https://docs.google.com/document/d/1k_ji_1DB6htgtXnPpMpa7gX0klm-DGV5NMY7KkvVB00/edit#</li> </ul> <pre><code>SELECT\n  mozfun.hist.extract(\n    '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n  ).sum\n-- 1\n</code></pre> <pre><code>SELECT\n  mozfun.hist.extract('5').sum\n-- 5\n</code></pre>"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"<p>Extract a histogram sum from a JSON str representation</p>"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"<p>Sum of a keyed histogram, across all keys it contains.</p>"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"<p>Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is <code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;</code></p> <p>The return type is <code>INT64</code>.</p> <p>The <code>key</code> field will be ignored, and the `value is expected to be the compact histogram representation.</p>"},{"location":"mozfun/hist/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_histogram ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"<p>Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.</p> <p>https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307</p>"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;sum INT64, VALUES ARRAY&lt;STRUCT&lt;value INT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"<p>Merge an array of histograms into a single histogram.</p> <ul> <li>The histogram values will be summed per-bucket</li> <li>The count will be summed</li> <li>Other fields will take the mode_last</li> </ul>"},{"location":"mozfun/hist/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"<p>Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.</p>"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"<p>Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.</p>"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE, percentiles ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;percentile FLOAT64, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"<p>Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.</p>"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"<p>Return the number of recorded observations greater than threshold for the histogram.  CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.</p> <p>https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239</p>"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING, threshold INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"<p>Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.</p>"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start DATETIME, `end` DATETIME\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;`interval` STRING, interval_count INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"<p>Used to parse <code>data</code> field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions</p>"},{"location":"mozfun/iap/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"<p>Used to parse <code>data</code> field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171</p>"},{"location":"mozfun/iap/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"<p>Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody</p>"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>provider_receipt_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY&lt;STRUCT&lt;cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64&gt;&gt;, pending_renewal_info ARRAY&lt;STRUCT&lt;auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING&gt;&gt;, receipt STRUCT&lt;adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY&lt;STRUCT&lt;cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64&gt;&gt;, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64&gt;, status INT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"<p>Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.</p>"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>apple_receipt ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;environment STRING, active_period STRUCT&lt;start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64&gt;, trial_period STRUCT&lt;start_time TIMESTAMP, end_time TIMESTAMP&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/","title":"json","text":"<p>Functions for parsing Mozilla-specific JSON data types.</p>"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"<p>Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.</p> <p>This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.</p>"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"<p>Converts a standard  \"map\" like datastructure <code>array&lt;struct&lt;key, value&gt;&gt;</code> into a JSON value.</p> <p>Convert the standard <code>Array&lt;Struct&lt;key, value&gt;&gt;</code> style maps to <code>JSON</code> values.</p>"},{"location":"mozfun/json/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input JSON\n</code></pre> <p>OUTPUTS</p> <pre><code>json\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"<p>Converts a nested JSON object with repeated key/value pairs into a nested JSON object.</p> <p>Convert a JSON object like <code>{ \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] }</code> to a <code>JSON</code> object like <code>{ \"metric\": { \"key\": 2 } }</code>.</p> <p>This only works on JSON types.</p>"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"<p>OUTPUTS</p> <pre><code>json\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"<p>Returns an array of key/value structs from a string representing a JSON map.</p> <p>BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.</p> <p>Non-string non-null values are encoded as json.</p>"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.</p>"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"</p>"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"</p>"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"</p> <p>These states include whether a client was paid or organic.</p>"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"<p>Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.</p>"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"<p>LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date</p>"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"<p>LTV states for iOS.</p>"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/","title":"map","text":"<p>Functions for working with arrays of key/value structs.</p>"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"<p>Sums all values in a keyed scalar.</p>"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"<p>Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is <code>ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;</code></p> <p>The return type is <code>INT64</code>.</p> <p>The <code>key</code> field will be ignored.</p>"},{"location":"mozfun/map/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_scalar ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"<p>Create a map from two arrays (like zipping)</p>"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keys ANY TYPE, `values` ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"<p>Fetch the value associated with a given key from an array of key/value structs.</p> <p>Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.</p>"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"<p>Fetch the value associated with a given key from an array of key/value structs.</p> <p>Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.</p>"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.</p>"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"<p>Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.</p> <p><code>map.set_key</code></p> <p>Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: <code>ARRAY&lt;STRUCT&lt;key ANY TYPE, value ANY TYPE&gt;&gt;</code>.</p> <p>The type of the key and value you are setting must match the types in the map itself.</p>"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"<p>Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY&gt;, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"<p>Please provide a description for the routine</p>"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"<p>This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.</p> <p>We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p> <p>See also: <code>marketing.parse_campaign_name</code>, which does the same, but for campaign names.</p>"},{"location":"mozfun/marketing/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ad_group_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"<p>Parse a campaign name. Extracts things like region, country_code, and language.</p>"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"<p>This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.</p> <p>We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p>"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>campaign_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"<p>Parse segments from a creative name.</p>"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"<p>This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.</p> <p>We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p> <p>See also: <code>marketing.parse_campaign_name</code>, which does the same, but for campaign names.</p>"},{"location":"mozfun/marketing/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>creative_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"<p>Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).</p>"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"<p>Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here</p>"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_name STRING, os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;normalized_app_name STRING, normalized_app_name_os STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/","title":"norm","text":"<p>Functions for normalizing data.</p>"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"<p>Adds metadata related to the browser version in a struct.</p> <p>This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.</p>"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>version_string STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"<p>Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start =&gt; '2021-01-01', end =&gt; '2021-03-01', grace_period =&gt; INTERVAL 0 day, inclusive =&gt; FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start =&gt; '2021-01-01', end =&gt; '2021-03-02 00:00:00.000001', grace_period =&gt; INTERVAL 1 DAY, inclusive =&gt; FALSE) returns 2, because start plus two months plus grace period is less than end.</p>"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"<p>Extracts numeric version data from a version string like <code>&lt;major&gt;.&lt;minor&gt;.&lt;patch&gt;</code>.</p> <p>Note: Non-zero minor and patch versions will be floating point <code>Numeric</code>.</p> <p>Usage:</p> <pre><code>SELECT\n    mozfun.norm.extract_version(version_string, 'major') as major_version,\n    mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n    mozfun.norm.extract_version(version_string, 'patch') as patch_version\n</code></pre> <p>Example using <code>\"96.05.01\"</code>:</p> <pre><code>SELECT\n    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1\n</code></pre>"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>version_string STRING, extraction_level STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>NUMERIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"<p>Returns canonical, human-understandable identification info for Fenix sources.</p> <p>The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId  values associated with different builds of Fenix,  each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names  (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.</p> <p>This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.</p> <p>The first argument (<code>app_id</code>) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.</p> <p>The second argument (<code>app_build_id</code>) should be the value in client_info.app_build.</p> <p>The function returns a <code>STRUCT</code> that contains the logical <code>app_name</code> and <code>channel</code> as well as the Play Store <code>app_id</code> in the canonical form which would appear in Play Store URLs.</p> <p>Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using <code>telemetry.fenix_clients_last_seen</code> which already handles the union. Otherwise, see the example below as a template for how construct a custom union.</p> <p>Mapping of channels to datasets:</p> <ul> <li>release: <code>org_mozilla_firefox</code></li> <li>beta: <code>org_mozilla_firefox_beta</code> (current) and <code>org_mozilla_fenix</code></li> <li>nightly: <code>org_mozilla_fenix</code> (current), <code>org_mozilla_fennec_aurora</code>, and <code>org_mozilla_fenix_nightly</code></li> </ul> <pre><code>-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n  (\n    SELECT AS STRUCT\n      m.submission_timestamp,\n      m.metrics.string.geckoview_version,\n      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n  )\n);\n\nWITH base AS (\n  SELECT\n    extract_fields('org_mozilla_firefox_beta', m).*\n  FROM\n    `mozdata.org_mozilla_firefox_beta.metrics` AS m\n  UNION ALL\n  SELECT\n    extract_fields('org_mozilla_fenix', m).*\n  FROM\n    `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n  DATE(submission_timestamp) AS submission_date,\n  geckoview_version,\n  COUNT(*)\nFROM\n  base\nWHERE\n  app_name = 'Fenix'  -- excludes 'Firefox Preview'\n  AND channel = 'beta'\n  AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n  submission_date,\n  geckoview_version\n</code></pre>"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_id STRING, app_build_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, channel STRING, app_id STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"<p>Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.</p> <p>Fenix originally used an 8-digit app_build format</p> <p>In short it is <code>yDDDHHmm</code>:</p> <ul> <li>y is years since 2018</li> <li>DDD is day of year, 0-padded, 001-366</li> <li>HH is hour of day, 00-23</li> <li>mm is minute of hour, 00-59</li> </ul> <p>The last date seen with an 8-digit build ID is 2020-08-10.</p> <p>Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.</p> <p>This function tolerates both formats.</p> <p>After using this you may wish to <code>DATETIME_TRUNC(result, DAY)</code> for grouping by build date.</p>"},{"location":"mozfun/norm/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"<p>Map Fenix package name to the channel name</p>"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>package_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"<p>This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.</p> <p>The function will return the first value pair in the input array, that is not null and has the earliest timestamp.</p> <p>Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.</p> <p>Usage:</p> <pre><code>SELECT\n   mozfun.norm.get_earliest_value(ARRAY&lt;STRUCT&lt;value STRING, value_source STRING, value_date DATETIME&gt;&gt;) AS &lt;alias&gt;\n</code></pre>"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value_set ARRAY&lt;STRUCT&lt;value STRING, value_source STRING, value_date DATETIME&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"<p>Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"<p>This function is primarily designed to parse the field <code>os_version</code> in table <code>mozdata.default_browser_agent.default_browser</code>. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the <code>os_version</code> values collected in table <code>mozdata.default_browser_agent.default_browser</code>.</p>"},{"location":"mozfun/norm/#status-as-of-november-2022","title":"Status as of November 2022","text":"<p>As of November 2022, the expected valid values of <code>os_version</code> are either <code>x.y.z</code> or <code>w.x.y.z</code> where <code>w</code>, <code>x</code>, <code>y</code>, and <code>z</code> are integers.</p> <p>As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of <code>z</code> in <code>x.y.z</code> are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of <code>z</code> in <code>w.x.y.z</code> are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.</p> <p>As of November 2022, the parser currently handles just over 99.89% of data in the field <code>os_version</code> in table <code>mozdata.default_browser_agent.default_browser</code>.</p>"},{"location":"mozfun/norm/#build-number-conventions","title":"Build number conventions","text":"<p>Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number <code>22621.900</code> for version <code>22621</code>. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.</p>"},{"location":"mozfun/norm/#example-usage","title":"Example usage","text":"<pre><code>SELECT\n  `os_version`,\n  mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` &gt; (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n</code></pre>"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os_version STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;name STRING, version_name STRING, version_number DECIMAL, build_number INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"<p>Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.</p>"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_info ANY TYPE, metrics ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>string\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"<p>Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.</p>"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"<p>Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.</p>"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metadata ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>`date`, CAST(NULL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"<p>Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".</p> <p>This is a reimplementation of logic used in the data pipeline   to populate <code>normalized_os</code>.</p>"},{"location":"mozfun/norm/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"<p>Returns a normalized <code>app_name</code> and <code>canonical_app_name</code> for a product based on <code>legacy_app_name</code> and <code>normalized_os</code> values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.</p> <p>As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The <code>probeinfo</code> API is the single source of truth for metadata about applications sending Glean telemetry; the values for <code>app_name</code> and <code>canonical_app_name</code> returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.</p> <p>For legacy telemetry pings like <code>main</code> ping for desktop and <code>core</code> ping for mobile products, the <code>legacy_app_name</code> given as input to this function should come from the submission URI (stored as <code>metadata.uri.app_name</code> in BigQuery ping tables). For Glean pings, we have invented <code>product</code> values that can be passed in to this function as the <code>legacy_app_name</code> parameter.</p> <p>The returned <code>app_name</code> values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. <code>product</code> is a deprecated field that was similar in intent.</p> <p>The returned <code>canonical_app_name</code> is more verbose and is suited for displaying in visualizations. <code>canonical_name</code> is a synonym that we provide for historical compatibility with previous versions of this function.</p> <p>The returned struct also contains boolean <code>contributes_to_2021_kpi</code> as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.</p> <p>The <code>normalized_os</code> value that's passed in should be the top-level <code>normalized_os</code> value present in any ping table or you may want to wrap a raw value in <code>mozfun.norm.os</code> like <code>mozfun.norm.product_info(app_name, mozfun.norm.os(os))</code>.</p> <p>This function also tolerates passing in a <code>product</code> value as <code>legacy_app_name</code> so that this function is still useful for derived tables which have thrown away the raw <code>app_name</code> value from legacy pings.</p> <p>The mappings are as follows:</p> legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>legacy_app_name STRING, normalized_os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"<p>Convert urlbar result types into product-friendly names</p> <p>This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.</p>"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>res STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"<p>Truncates a version string like <code>&lt;major&gt;.&lt;minor&gt;.&lt;patch&gt;</code> to either the major or minor version. The return value is <code>NUMERIC</code>, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).</p> <p>For example, \"5.1.0\" would be translated to <code>5.1</code> if the parameter is \"minor\" or <code>5</code> if the parameter is major.</p> <p>If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as <code>10</code> when run through this function, no matter what the arguments).</p> <p>This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.</p>"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os_version STRING, truncation_level STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>NUMERIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"<p>Accepts vpn attribution fields as input and returns a struct of normalized fields.</p>"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"<p>Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.</p> <p>Requires os, os_version and windows_build_number.</p> <p>E.G. from windows_build_number &gt;= 22000 return Windows 11</p>"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING, os_version STRING, windows_build_number INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/serp_events/","title":"serp_events","text":"<p>Functions for working with Glean SERP events.</p>"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"<p>Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.</p>"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>num_loaded INT, num_blocked INT\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"<p>Determine whether a SERP display component referenced in the serp events contains monetizable ads</p>"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>component STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/stats/","title":"stats","text":"<p>Statistics functions.</p>"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Returns the most frequently occuring element in an array.</p> <p>In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: <code>stats.mode_last_retain_nulls</code>, which retains nulls.</p>"},{"location":"mozfun/stats/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"<p>Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.</p>"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"<p>Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.</p> <p>Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.</p> <p>Use it like: <pre><code>DECLARE res ARRAY&lt;STRUCT&lt;i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING&gt;&gt;;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n</code></pre></p> <p>You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.</p>"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>query_a STRING, query_b STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>res ARRAY&lt;STRUCT&lt;i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"<p>Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns. <p>This UDF extracts UTM parameters from a URL string.</p> <p>UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.</p>"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"<p>Extract the Path from a URL</p> <p>This UDF extracts path from a URL string.</p> <p>The path is everything after the host and before parameters. This function returns \"/\" if there is no path.</p>"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/","title":"vpn","text":"<p>Functions for processing VPN data.</p>"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"<p>Assign an acquisition channel based on utm parameters</p>"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"<p>Assign a channel group based on utm parameters</p>"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"<p>Normalize utm parameters to use the same NULL placeholders as Google Analytics</p>"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"<p>Combine the pricing and interval for a subscription plan into a single field</p>"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"<p>Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.</p> <p>Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.</p> <p>More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org <code>proposals/</code> folder).</p>"},{"location":"reference/airflow_tags/#valid-tags","title":"Valid tags","text":""},{"location":"reference/airflow_tags/#impacttier-tag","title":"impact/tier tag","text":"<p>We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:</p> <ul> <li>impact/tier_1: Highest priority/impact/critical DAG. A job with this tag implies that many downstream processes are impacted and affects Mozilla\u2019s (many users across different teams and departments) ability to make decisions. A bug ticket must be created and the issue needs to be resolved as soon as possible.</li> <li>impact/tier_2:  Job of increased importance and impact, however, not critical and only limited impact on other processes. One team or group of people is affected and the pipeline does not generate any business critical metrics. A bug ticket must be created and should be addressed within a few working days.</li> <li>impact/tier_3: No impact on other processes and is not used to generate any metrics used by business users or to make any decisions. A bug ticket should be created and it\u2019s up to the job owner to fix this issue in whatever time frame they deem to be reasonable.</li> </ul>"},{"location":"reference/airflow_tags/#triage-tag","title":"triage/ tag","text":"<p>This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.</p> <ul> <li>triage/record_only: Failures should only be recorded and the job owner informed without taking any active steps to fix the failure.</li> </ul> <ul> <li>triage/no_triage: No triage should be performed on this job. Should only be used in a limited number of cases, like this is still WIP, where no production processes are affected.</li> </ul> <ul> <li>triage/confidential - Failures should be recorded by the triage engineer as normal, and bug should be marked Confidential.</li> </ul>"},{"location":"reference/configuration/","title":"Configuration","text":"<p>The behaviour of <code>bqetl</code> can be configured via the <code>bqetl_project.yaml</code> file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.</p> <p>The general structure of <code>bqetl_project.yaml</code> is as follows:</p> <pre><code>dry_run:\n  function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n  test_project: bigquery-etl-integration-test\n  skip:\n  - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n  - sql/**/apple_ads_external*/**/query.sql\n  # - ...\n\nviews:\n  skip_validation:\n  - sql/moz-fx-data-test-project/test/simple_view/view.sql\n  - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n  - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n  # - ...\n  skip_publishing:\n  - activity_stream/tile_id_types/view.sql\n  - pocket/pocket_reach_mau/view.sql\n  # - ...\n  non_user_facing_suffixes:\n  - _derived\n  - _external\n  # - ...\n\nschema:\n  skip_update:\n  - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n  # - ...\n  skip_prefixes:\n  - pioneer\n  - rally\n\nroutines:\n  skip_publishing:\n  - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n  skip:\n  - bigquery_etl/glam/templates/*.sql\n  - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n  - stored_procedures/safe_crc32_uuid.sql\n  # - ...\n</code></pre>"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"<p><code>ConfigLoader</code> can be used in the bigquery_etl tooling codebase to access configuration parameters. <code>bqetl_project.yaml</code> is automatically loaded in <code>ConfigLoader</code> and parameters can be accessed via a <code>get()</code> method:</p> <pre><code>from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n</code></pre> <p>The <code>ConfigLoader.get()</code> method allows multiple string parameters to reference a configuration value that is stored in a  nested structure. A <code>fallback</code> value can be optionally provided in case the configuration parameter is not set.</p>"},{"location":"reference/configuration/#adding-configuration-parameters","title":"Adding configuration parameters","text":"<p>New configuration parameters can simply be added to <code>bqetl_project.yaml</code>. <code>ConfigLoader.get()</code> allows for these new parameters simply to be referenced without needing to be changed or updated.</p>"},{"location":"reference/data_checks/","title":"bqetl Data Checks","text":"<p>Instructions on how to add data checks can be found in the Adding data checks section below.</p>"},{"location":"reference/data_checks/#background","title":"Background","text":"<p>To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.</p>"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"<p>bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.</p> <p>This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the <code>checks.sql</code> files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.</p> <p>It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.</p>"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"<p>Each check needs to be categorised with a marker, currently following markers are available:</p> <ul> <li><code>#fail</code> indicates that the ETL pipeline should stop if this check fails (circuit-breaker pattern) and a notification is sent out. This marker should be used for checks that indicate a serious data issue.</li> </ul> <ul> <li><code>#warn</code> indicates that the ETL pipeline should continue even if this check fails. These type of checks can be used to indicate potential issues that might require more manual investigation.</li> </ul> <p>Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.</p>"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"<p>Inside the query directory, which usually contains <code>query.sql</code> or <code>query.py</code>, <code>metadata.yaml</code> and <code>schema.yaml</code>, create a new file called <code>checks.sql</code> (unless already exists).</p> <p>Please make sure each check you add contains a marker (see: the Two categories of checks section above).</p> <p>Once checks have been added, we need to <code>regenerate the DAG</code> responsible for scheduling the query.</p>"},{"location":"reference/data_checks/#update-checkssql","title":"Update checks.sql","text":"<p>If <code>checks.sql</code> already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.</p> <p>When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.</p>"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"<p>All checks can be removed by deleting the <code>checks.sql</code> file and regenerating the DAG responsible for scheduling the query.</p> <p>Alternatively, specific checks can be removed by deleting them from the <code>checks.sql</code> file.</p>"},{"location":"reference/data_checks/#example-checkssql","title":"Example checks.sql","text":"<p>Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in <code>tests/checks</code> which may take different parameters used to generate the SQL check expression.</p> <p>Example of what a <code>checks.sql</code> may look like:</p> <pre><code>-- raw SQL checks\n#fail\nASSERT (\n  SELECT\n    COUNTIF(ISNULL(country)) / COUNT(*)\n    FROM telemetry.table_v1\n    WHERE submission_date = @submission_date\n  ) &gt; 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"<p>Usage: <pre><code>Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre></p> <p>Example:</p> <pre><code>#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n</code></pre>"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n</code></pre> <p>Please keep in mind the below checks can be combined and specified in the same <code>checks.sql</code> file. For example:</p> <pre><code>#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"<p>Compares the row count of the current partition to the average of <code>number_of_days</code> past partitions and checks if the row count is within the average +- <code>threshold_percentage</code> %</p> <p>Usage:</p> <pre><code>Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n</code></pre> <p>Example: <pre><code>#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"<p>Checks that the column has values of specific character length.</p> <p>Usage:</p> <pre><code>Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n</code></pre> <p>Example: <pre><code>#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"<p>Checks that the column values adhere to a pattern based on a regex expression.</p> <p>Usage:</p> <pre><code>Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n</code></pre> <p>Example: <pre><code>#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"<p>To list all available commands in the bqetl data checks CLI:</p> <pre><code>$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n  Commands for managing and running bqetl data checks.\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n  IN ACTIVE DEVELOPMENT\n\n  The current progress can be found under:\n\n          https://mozilla-hub.atlassian.net/browse/DENG-919\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  render  Renders data check query using parameters provided (OPTIONAL).\n  run     Runs data checks defined for the dataset (checks.sql).\n</code></pre> <p>To see see how to use a specific command use:</p> <pre><code>$ ./bqetl check [command] --help\n</code></pre> <p><code>render</code></p>"},{"location":"reference/data_checks/#usage","title":"Usage","text":"<pre><code>$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --help                          Show this message and exit.\n</code></pre>"},{"location":"reference/data_checks/#example","title":"Example","text":"<pre><code>./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n</code></pre> <p><code>run</code></p>"},{"location":"reference/data_checks/#usage_1","title":"Usage","text":"<pre><code>$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --dry_run, --dry-run            To dry run the query to make sure it is\n                                  valid\n  --marker TEXT                   Marker to filter checks.\n  --help                          Show this message and exit.\n</code></pre>"},{"location":"reference/data_checks/#examples","title":"Examples","text":"<pre><code># to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n</code></pre>"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":"<ul> <li>BigQuery billing discounts for destination table partitions not modified in   the last 90 days</li> <li>May use dags.utils.gcp.bigquery_etl_query to simplify airflow configuration   e.g. see dags.main_summary.exact_mau28_by_dimensions</li> <li>May use script/generate_incremental_table to automate backfilling</li> <li>Should use <code>WRITE_TRUNCATE</code> mode or <code>bq query --replace</code> to replace   partitions atomically to prevent duplicate data</li> <li>Will have tooling to generate an optimized mostly materialized view that   only calculates the most recent partition</li> </ul>"},{"location":"reference/incremental/#properties","title":"Properties","text":"<ul> <li>Must accept a date via <code>@submission_date</code> query parameter<ul> <li>Must output a column named <code>submission_date</code> matching the query parameter</li> </ul> </li> <li>Must produce similar results when run multiple times<ul> <li>Should produce identical results when run multiple times</li> </ul> </li> <li>May depend on the previous partition<ul> <li>Should be impacted by values from a finite number of preceding partitions<ul> <li>This allows for backfilling in chunks instead of serially for all time   and limiting backfills to a certain number of days following updated data</li> <li>For example <code>sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql</code> can be run serially on any 28 day   period and the last day will be the same whether or not the partition   preceding the first day was missing because values are only impacted by   27 preceding days</li> </ul> </li> </ul> </li> </ul>"},{"location":"reference/public_data/","title":"Public Data","text":"<p>For background, see Accessing Public Data on <code>docs.telemetry.mozilla.org</code>.</p> <ul> <li>To make query results publicly available, the <code>public_bigquery</code> flag must be set in   <code>metadata.yaml</code><ul> <li>Tables will get published in the <code>mozilla-public-data</code> GCP project which is accessible   by everyone, also external users</li> </ul> </li> <li>To make query results publicly available as JSON, <code>public_json</code> flag must be set in   <code>metadata.yaml</code><ul> <li>Data will be accessible under https://public-data.telemetry.mozilla.org<ul> <li>A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json</li> </ul> </li> <li>For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json</li> <li>Output JSON files have a maximum size of 1GB, data can be split up into multiple files (<code>000000000000.json</code>, <code>000000000001.json</code>, ...)</li> <li><code>incremental_export</code> controls how data should be exported as JSON:<ul> <li><code>false</code>: all data of the source table gets exported to a single location<ul> <li>https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json</li> </ul> </li> <li><code>true</code>: only data that matches the <code>submission_date</code> parameter is exported as JSON to a separate directory for this date<ul> <li>https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json</li> </ul> </li> </ul> </li> </ul> </li> <li>For each dataset, a <code>metadata.json</code> gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json</li> <li>The timestamp when the dataset was last updated is recorded in <code>last_updated</code>, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated</li> </ul>"},{"location":"reference/recommended_practices/","title":"Recommended practices","text":""},{"location":"reference/recommended_practices/#queries","title":"Queries","text":"<ul> <li>Should be defined in files named as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/query.sql</code> e.g.<ul> <li><code>&lt;project&gt;</code> defines both where the destination table resides and in which project the query job runs   <code>sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql</code></li> <li>Queries that populate tables should always be named with a version suffix;   we assume that future optimizations to the data representation may require   schema-incompatible changes such as dropping columns</li> </ul> </li> <li>May be generated using a python script that prints the query to stdout<ul> <li>Should save output as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/query.sql</code> as above</li> <li>Should be named as <code>sql/&lt;project&gt;/query_type.sql.py</code> e.g. <code>sql/moz-fx-data-shared-prod/clients_daily.sql.py</code></li> <li>May use options to generate queries for different destination tables e.g.   using <code>--source telemetry_core_parquet_v3</code> to generate   <code>sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql</code> and using <code>--source main_summary_v4</code> to   generate <code>sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql</code></li> <li>Should output a header indicating options used e.g.   <pre><code>-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet\n</code></pre></li> </ul> </li> <li>For tables in <code>moz-fx-data-shared-prod</code> the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)</li> <li>Should be incremental</li> <li>Should filter input tables on partition and clustering columns</li> <li>Should use <code>_</code> prefix in generated column names not meant for output</li> <li>Should use <code>_bits</code> suffix for any integer column that represents a bit pattern</li> <li>Should not use <code>DATETIME</code> type, due to incompatibility with   spark-bigquery-connector</li> <li>Should read from <code>*_stable</code> tables instead of including custom deduplication<ul> <li>Should use the earliest row for each <code>document_id</code> by <code>submission_timestamp</code>   where filtering duplicates is necessary</li> </ul> </li> <li>Should not refer to views in the <code>mozdata</code> project which are duplicates of views in another project   (commonly <code>moz-fx-data-shared-prod</code>). Refer to the original view instead.</li> <li>Should escape identifiers that match keywords, even if they aren't reserved keywords</li> <li>Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#querying-metrics","title":"Querying Metrics","text":"<ul> <li>Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub<ul> <li>To reference metrics use <code>{{ metrics.calculate() }}</code>:   <pre><code>SELECT\n  *\nFROM\n  {{ metrics.calculate(\n    metrics=['days_of_use', 'active_hours'],\n    platform='firefox_desktop',\n    group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},\n    where='submission_date = \"2023-01-01\"'\n  ) }}\n\n-- this translates to\nSELECT\n  *\nFROM\n  (\n    WITH clients_daily AS (\n      SELECT\n        client_id AS client_id,\n        submission_date AS submission_date,\n        COALESCE(SUM(active_hours_sum), 0) AS active_hours,\n        COUNT(submission_date) AS days_of_use,\n      FROM\n        mozdata.telemetry.clients_daily\n      GROUP BY\n        client_id,\n        submission_date\n    )\n    SELECT\n      clients_daily.client_id,\n      clients_daily.submission_date,\n      active_hours,\n      days_of_use,\n    FROM\n      clients_daily\n  )\n</code></pre><ul> <li><code>metrics</code>: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)</li> <li><code>platform</code>: platform to compute metrics for (e.g. <code>firefox_desktop</code>, <code>firefox_ios</code>, <code>fenix</code>, ...)</li> <li><code>group_by</code>: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; <code>GROUP BY</code> always includes the configured <code>client_id</code> and <code>submission_date</code> fields</li> <li><code>where</code>: SQL filter clause</li> <li><code>group_by_client_id</code>: Whether the field configured as <code>client_id</code> (defined as part of the data source specification in metric-hub) should be part of the <code>GROUP BY</code>. <code>True</code> by default</li> <li><code>group_by_submission_date</code>: Whether the field configured as <code>submission_date</code> (defined as part of the data source specification in metric-hub) should be part of the <code>GROUP BY</code>. <code>True</code> by default</li> </ul> </li> <li>To reference data source definitions use <code>{{ metrics.data_source() }}</code>:   <pre><code>SELECT\n  *\nFROM\n  {{ metrics.data_source(\n    data_source='main',\n    platform='firefox_desktop',\n    where='submission_date = \"2023-01-01\"'\n  ) }}\n\n-- this translates to\nSELECT\n  *\nFROM\n  (\n    SELECT *\n    FROM `mozdata.telemetry.main`\n    WHERE submission_date = \"2023-01-01\"\n  )\n</code></pre></li> </ul> </li> <li>To render queries that use Jinja expressions or statements use <code>./bqetl query render path/to/query.sql</code></li> <li>The <code>generated-sql</code> branch has rendered queries/views/UDFs</li> <li><code>./bqetl query run</code> does support running Jinja queries</li> </ul>"},{"location":"reference/recommended_practices/#query-metadata","title":"Query Metadata","text":"<ul> <li>For each query, a <code>metadata.yaml</code> file should be created in the same directory</li> <li>This file contains a description, owners and labels. As an example:</li> </ul> <pre><code>friendly_name: SSL Ratios\ndescription: &gt;\n  Percentages of page loads Firefox users have performed that were\n  conducted over SSL broken down by country.\nowners:\n  - example@mozilla.com\nlabels:\n  application: firefox\n  incremental: true # incremental queries add data to existing tables\n  schedule: daily # scheduled in Airflow to run daily\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1414839 # Bugzilla bug ID of data review\n  incremental_export: false # non-incremental JSON export writes all data to a single location\n</code></pre> <ul> <li>only labels where value types are eithers integers or strings are published, all other values types are being skipped</li> </ul>"},{"location":"reference/recommended_practices/#views","title":"Views","text":"<ul> <li>Should be defined in files named as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;/view.sql</code> e.g.   <code>sql/moz-fx-data-shared-prod/telemetry/core/view.sql</code><ul> <li>Views should generally not be named with a version suffix; a view represents a   stable interface for users and whenever possible should maintain compatibility   with existing queries; if the view logic cannot be adapted to changes in underlying   tables, breaking changes must be communicated to <code>fx-data-dev@mozilla.org</code></li> </ul> </li> <li>Must specify project and dataset in all table names<ul> <li>Should default to using the <code>moz-fx-data-shared-prod</code> project;   the <code>scripts/publish_views</code> tooling can handle parsing the definitions to publish   to other projects such as <code>derived-datasets</code></li> </ul> </li> <li>Should not refer to views in the <code>mozdata</code> project which are duplicates of views in another project   (commonly <code>moz-fx-data-shared-prod</code>). Refer to the original view instead.</li> <li>Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#udfs","title":"UDFs","text":"<ul> <li>Should limit the number of expression subqueries to avoid: <code>BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.</code></li> <li>Should be used to avoid code duplication</li> <li>Must be named in files with lower snake case names ending in <code>.sql</code>   e.g. <code>mode_last.sql</code><ul> <li>Each file must only define effectively private helper functions and one   public function which must be defined last<ul> <li>Helper functions must not conflict with function names in other files</li> </ul> </li> <li>SQL UDFs must be defined in the <code>udf/</code> directory and JS UDFs must be defined   in the <code>udf_js</code> directory<ul> <li>The <code>udf_legacy/</code> directory is an exception which must only contain   compatibility functions for queries migrated from Athena/Presto.</li> </ul> </li> </ul> </li> <li>Functions must be defined as persistent UDFs   using <code>CREATE OR REPLACE FUNCTION</code> syntax<ul> <li>Function names must be prefixed with a dataset of <code>&lt;dir_name&gt;.</code> so, for example,   all functions in <code>udf/*.sql</code> are part of the <code>udf</code> dataset<ul> <li>The final syntax for creating a function in a file will look like   <code>CREATE OR REPLACE FUNCTION &lt;dir_name&gt;.&lt;file_name&gt;</code></li> </ul> </li> <li>We provide tooling in <code>scripts/publish_persistent_udfs</code> for   publishing these UDFs to BigQuery<ul> <li>Changes made to UDFs need to be published manually in order for the   dry run CI task to pass</li> </ul> </li> </ul> </li> <li>Should use <code>SQL</code> over <code>js</code> for performance</li> <li>UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#large-backfills","title":"Large Backfills","text":"<ul> <li>Should be documented and reviewed by a peer using a   new bug that describes   the context that required the backfill and the command or script used.</li> <li>Frequent backfills should be avoided<ul> <li>Backfills may increase storage costs for a table for 90 days by moving   data from long-term storage to short-term storage and requiring a production snapshot.</li> <li>Should combine multiple backfills happening around the same time</li> <li>Should delay column deletes until the next other backfill<ul> <li>Should use <code>NULL</code> for new data and <code>EXCEPT</code> to exclude from views until   dropped</li> </ul> </li> </ul> </li> <li>After running the backfill, it is important to validate that the job(s) ran without errors   and the execution times and bytes processed are as expected.   Here is a query you may use for this purpose:   <pre><code>SELECT\n  job_type,\n  state,\n  submission_date,\n  destination_dataset_id,\n  destination_table_id,\n  total_terabytes_billed,\n  total_slot_ms,\n  error_location,\n  error_reason,\n  error_message\nFROM\n  moz-fx-data-shared-prod.monitoring.bigquery_usage\nWHERE\n  submission_date &lt;= CURRENT_DATE()\n  AND destination_dataset_id LIKE \"%backfills_staging_derived%\"\n  AND destination_table_id LIKE \"%{{ your table name }}%\"\nORDER BY\n  submission_date DESC\n</code></pre></li> </ul>"},{"location":"reference/scheduling/","title":"Scheduling Queries in Airflow","text":"<ul> <li>bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries</li> <li>To be scheduled, a query must be assigned to a DAG that is specified in <code>dags.yaml</code><ul> <li>New DAGs can be configured in <code>dags.yaml</code>, e.g., by adding the following: <pre><code>bqetl_ssl_ratios: # name of the DAG; must start with bqetl_\n  schedule_interval: 0 2 * * * # query schedule\n  description: The DAG schedules SSL ratios queries.\n  default_args:\n    owner: example@mozilla.com\n    start_date: \"2020-04-05\" # YYYY-MM-DD\n    email: [\"example@mozilla.com\"]\n    retries: 2 # number of retries if the query execution fails\n    retry_delay: 30m\n</code></pre></li> <li>All DAG names need to have <code>bqetl_</code> as prefix.</li> <li><code>schedule_interval</code> is either defined as a CRON expression or alternatively as one of the following CRON presets: <code>once</code>, <code>hourly</code>, <code>daily</code>, <code>weekly</code>, <code>monthly</code></li> <li><code>start_date</code> defines the first date for which the query should be executed<ul> <li>Airflow will not automatically backfill older dates if <code>start_date</code> is set in the past, backfilling can be done via the Airflow web interface</li> </ul> </li> <li><code>email</code> lists email addresses alerts should be sent to in case of failures when running the query</li> </ul> </li> <li>Alternatively, new DAGs can also be created via the <code>bqetl</code> CLI by running <code>bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner=\"example@mozilla.com\" --start_date=\"2020-04-05\" --description=\"This DAG generates SSL ratios.\"</code></li> <li>To schedule a specific query, add a <code>metadata.yaml</code> file that includes a <code>scheduling</code> section, for example:   <pre><code>friendly_name: SSL ratios\n# ... more metadata, see Query Metadata section above\nscheduling:\n  dag_name: bqetl_ssl_ratios\n</code></pre><ul> <li>Additional scheduling options:<ul> <li><code>depends_on_past</code> keeps query from getting executed if the previous schedule for the query hasn't succeeded</li> <li><code>date_partition_parameter</code> - by default set to <code>submission_date</code>; can be set to <code>null</code> if query doesn't write to a partitioned table</li> <li><code>parameters</code> specifies a list of query parameters, e.g. <code>[\"n_clients:INT64:500\"]</code></li> <li><code>arguments</code> - a list of arguments passed when running the query, for example: <code>[\"--append_table\"]</code></li> <li><code>referenced_tables</code> - manually curated list of tables a Python or BigQuery script depends on; for <code>query.sql</code> files dependencies will get determined automatically and should only be overwritten manually if really necessary</li> <li><code>multipart</code> indicates whether a query is split over multiple files <code>part1.sql</code>, <code>part2.sql</code>, ...</li> <li><code>depends_on</code> defines external dependencies in telemetry-airflow that are not detected automatically: <pre><code>depends_on:\n  - task_id: external_task\n    dag_name: external_dag\n    execution_delta: 1h\n</code></pre><ul> <li><code>task_id</code>: name of task query depends on</li> <li><code>dag_name</code>: name of the DAG the external task is part of</li> <li><code>execution_delta</code>: time difference between the <code>schedule_intervals</code> of the external DAG and the DAG the query is part of</li> </ul> </li> <li><code>depends_on_tables_existing</code> defines tables that the ETL will await the existence of via an Airflow sensor before running:   <pre><code>depends_on_tables_existing:\n  - task_id: wait_for_foo_bar_baz\n    table_id: 'foo.bar.baz_{{ ds_nodash }}'\n    poke_interval: 30m\n    timeout: 12h\n    retries: 1\n    retry_delay: 10m\n</code></pre><ul> <li><code>task_id</code>: ID to use for the generated Airflow sensor task.</li> <li><code>table_id</code>: Fully qualified ID of the table to wait for, including the project and dataset.</li> <li><code>poke_interval</code>: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default poke interval is 5 minutes).</li> <li><code>timeout</code>: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default timeout is 8 hours).</li> <li><code>retries</code>: The number of retries that should be performed if the sensor times out or otherwise fails.   This parameter is optional (the default depends on how the DAG is configured).</li> <li><code>retry_delay</code>: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default depends on how the DAG is configured).</li> </ul> </li> <li><code>depends_on_table_partitions_existing</code> defines table partitions that the ETL will await the existence of via an Airflow sensor before running:   <pre><code>depends_on_table_partitions_existing:\n  - task_id: wait_for_foo_bar_baz\n    table_id: foo.bar.baz\n    partition_id: '{{ ds_nodash }}'\n    poke_interval: 30m\n    timeout: 12h\n    retries: 1\n    retry_delay: 10m\n</code></pre><ul> <li><code>task_id</code>: ID to use for the generated Airflow sensor task.</li> <li><code>table_id</code>: Fully qualified ID of the table to check, including the project and dataset.   Note that the service account <code>airflow-access@moz-fx-data-shared-prod.iam.gserviceaccount.com</code> will need to have the BigQuery Job User role on the project and read access to the dataset.</li> <li><code>partition_id</code>: ID of the partition to wait for.</li> <li><code>poke_interval</code>: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default poke interval is 5 minutes).</li> <li><code>timeout</code>: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default timeout is 8 hours).</li> <li><code>retries</code>: The number of retries that should be performed if the sensor times out or otherwise fails.   This parameter is optional (the default depends on how the DAG is configured).</li> <li><code>retry_delay</code>: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default depends on how the DAG is configured).</li> </ul> </li> <li><code>trigger_rule</code>: The rule that determines when the airflow task that runs this query should run. The default is <code>all_success</code> (\"trigger this task when all directly upstream tasks have succeeded\"); other rules can allow a task to run even if not all preceding tasks have succeeded. See the Airflow docs for the list of trigger rule options.</li> <li><code>destination_table</code>: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the <code>submission_date</code> parameter manually</li> <li><code>external_downstream_tasks</code> defines external downstream dependencies for which <code>ExternalTaskMarker</code>s will be added to the generated DAG. These task markers ensure that when the task is cleared for triggering a rerun, all downstream tasks are automatically cleared as well. <pre><code>external_downstream_tasks:\n  - task_id: external_downstream_task\n    dag_name: external_dag\n    execution_delta: 1h\n</code></pre></li> </ul> </li> </ul> </li> <li>Queries can also be scheduled using the <code>bqetl</code> CLI: <code>./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios</code></li> <li>To generate all Airflow DAGs run <code>./bqetl dag generate</code><ul> <li>Generated DAGs are located in the <code>dags/</code> directory</li> <li>Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically</li> </ul> </li> <li>Specific DAGs can be generated by running <code>./bqetl dag generate bqetl_ssl_ratios</code></li> <li>Generated DAGs do not need to be checked into <code>main</code>. CI automatically generates DAGs and writes them to the telemetry-airflow-dags repo from where Airflow will pick them up</li> <li>Generated DAGs will be automatically detected and scheduled by Airflow<ul> <li>It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI</li> </ul> </li> <li>To generate tasks for importing data from Fivetran that an ETL task depends on add:   <pre><code>depends_on_fivetran:\n  - task_id: fivetran_import_1\n  - task_id: another_fivetran_import\n</code></pre><ul> <li>The Fivetran connector ID needs to be set as a variable <code>&lt;task_id&gt;_connector_id</code> in the Airflow admin interface for each import task</li> </ul> </li> </ul>"},{"location":"reference/stage-deploys-continuous-integration/","title":"Stage Deploys","text":""},{"location":"reference/stage-deploys-continuous-integration/#stage-deploys-in-continuous-integration","title":"Stage Deploys in Continuous Integration","text":"<p>Before changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks. </p> <p>Currently, the <code>bigquery-etl-integration-test</code> project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like <code>mozdata</code>, so stage artifacts cannot reference any other artifacts that live in production.</p> <p>Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the <code>generated-sql</code> branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the <code>generated-sql</code> branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment.     * This CI step runs after the <code>generate-sql</code> CI step to ensure that checks will also be executed on generated queries and to ensure <code>schema.yaml</code> files have been automatically created for queries.  2. The <code>bqetl</code> CLI has a command to run stage deploys, which is called in the CI: <code>./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS</code>     * <code>--dataset-suffix</code> will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions.     * Determining dependencies is only relevant for UDFs and views. For queries, available <code>schema.yaml</code> files will simply be deployed.      * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well.     * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage.     * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder <code>query.sql</code> file will be created     * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to.     * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to <code>sql/bigquery-etl-integration-test</code> into their corresponding temporary datasets.     * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a <code>bigquery-etl-integration-test</code> folder     * The deploy is executed in the order of: UDFs, tables, views     * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage     * By default the table expiration is set to 1 hour     * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)</p> <p>After CI checks have passed and the pull-request has been approved, changes can be merged to <code>main</code>. Once a new version of bigquery-etl has been published the changes can be deployed to production through the <code>bqetl_artifact_deployment</code> Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html</p>"},{"location":"reference/stage-deploys-continuous-integration/#local-deploys-to-stage","title":"Local Deploys to Stage","text":"<p>Local changes can be deployed to stage using the <code>./bqetl stage deploy</code> command:</p> <pre><code>./bqetl stage deploy \\\n  --dataset-suffix=test \\\n  --copy-sql-to-tmp-dir \\\n  sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n  sql/mozfun/map/sum/udf.sql\n</code></pre> <p>Files (for example ones with changes) that should be deployed to stage need to be specified. The <code>stage deploy</code> accepts the following parameters: * <code>--dataset-suffix</code> is an optional suffix that will be added to the datasets deployed to stage * <code>--copy-sql-to-tmp-dir</code> copies SQL stored in <code>sql/</code> to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running <code>git checkout -- sql/</code> * (optional) <code>--remove-updated-artifacts</code> removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.</p> <p>Deployed stage artifacts can be deleted from <code>bigquery-etl-integration-test</code> by running:</p> <pre><code>./bqetl stage clean --delete-expired --dataset-suffix=test\n</code></pre>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"<p>The <code>bqetl</code> command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.</p> <p>Running some commands, for example to create or query tables, will require Mozilla GCP access.</p>"},{"location":"bqetl/#installation","title":"Installation","text":"<p>Follow the Quick Start to set up bigquery-etl and the bqetl CLI.</p>"},{"location":"bqetl/#configuration","title":"Configuration","text":"<p><code>bqetl</code> can be configured via the <code>bqetl_project.yaml</code> file. See Configuration to find available configuration options.</p>"},{"location":"bqetl/#commands","title":"Commands","text":"<p>To list all available commands in the bqetl CLI:</p> <pre><code>$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n  CLI tools for working with bigquery-etl.\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  alchemer    Commands for importing alchemer data.\n  dag         Commands for managing DAGs.\n  dependency  Build and use query dependency graphs.\n  dryrun      Dry run SQL.\n  format      Format SQL.\n  glam        Tools for GLAM ETL.\n  mozfun      Commands for managing mozfun routines.\n  query       Commands for managing queries.\n  routine     Commands for managing routines.\n  stripe      Commands for Stripe ETL.\n  view        Commands for managing views.\n  backfill    Commands for managing backfills.\n</code></pre> <p>See help for any command:</p> <pre><code>$ ./bqetl [command] --help\n</code></pre>"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"<p>CLI autocomplete for <code>bqetl</code> can be enabled for bash and zsh shells using the <code>script/bqetl_complete</code> script:</p> <pre><code>source script/bqetl_complete\n</code></pre> <p>Then pressing tab after <code>bqetl</code> commands should print possible commands, e.g. for zsh: <pre><code>% bqetl query&lt;TAB&gt;&lt;TAB&gt;\nbackfill       -- Run a backfill for a query.\ncreate         -- Create a new query with name...\ninfo           -- Get information about all or specific...\ninitialize     -- Run a full backfill on the destination...\nrender         -- Render a query Jinja template.\nrun            -- Run a query.\n...\n</code></pre></p> <p><code>source script/bqetl_complete</code> can also be added to <code>~/.bashrc</code> or <code>~/.zshrc</code> to persist settings across shell instances.</p> <p>For more details on shell completion, see the click documentation.</p>"},{"location":"bqetl/#query","title":"<code>query</code>","text":"<p>Commands for managing queries.</p>"},{"location":"bqetl/#create","title":"<code>create</code>","text":"<p>Create a new query with name     ., for example: telemetry_derived.active_profiles.     Use the <code>--project_id</code> option to change the project the query is added to;     default is <code>moz-fx-data-shared-prod</code>. Views are automatically generated     in the publicly facing dataset. <p>Usage</p> <pre><code>$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query create telemetry_derived.deviations_v1 \\\n  --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n</code></pre>"},{"location":"bqetl/#schedule","title":"<code>schedule</code>","text":"<p>Schedule an existing query</p> <p>Usage</p> <pre><code>$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations \\\n  --task-name=deviations\n</code></pre>"},{"location":"bqetl/#info","title":"<code>info</code>","text":"<p>Get information about all or specific queries.</p> <p>Usage</p> <pre><code>$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n  --cost --last_updated\n</code></pre>"},{"location":"bqetl/#backfill","title":"<code>backfill</code>","text":"<p>Run a backfill for a query. Additional parameters will get passed to bq.</p> <p>Usage</p> <pre><code>$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n</code></pre> <p>Examples</p> <pre><code># Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n  --dry_run\n</code></pre>"},{"location":"bqetl/#run","title":"<code>run</code>","text":"<p>Run a query. Additional parameters will get passed to bq.     If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored.     If the <code>name</code> is not found within the <code>sql/</code> folder bqetl assumes it hasn't been generated yet     and will start the generating process for all <code>sql_generators/</code> files.     This generation process will take some time and run dryrun calls against BigQuery but this is expected.      Additional parameters (all parameters that are not specified in the Options) must come after the query-name.     Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.</p> <p>Usage</p> <pre><code>$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n</code></pre> <p>Examples</p> <pre><code># Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1         --project_id=moz-fx-data-shared-prod         --dataset_id=telemetry_derived         --destination_table=ssl_ratios_v1\n</code></pre>"},{"location":"bqetl/#run-multipart","title":"<code>run-multipart</code>","text":"<p>Run a multipart query.</p> <p>Usage</p> <pre><code>$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n</code></pre> <p>Examples</p> <pre><code># Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n</code></pre>"},{"location":"bqetl/#validate","title":"<code>validate</code>","text":"<p>Validate a query.     Checks formatting, scheduling information and dry runs the query.</p> <p>Usage</p> <pre><code>$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n  --use_cloud_function=false \\\n  --project_id=moz-fx-data-marketing-prod \\\n  ga_derived.blogs_goals_v1\n</code></pre>"},{"location":"bqetl/#initialize","title":"<code>initialize</code>","text":"<p>Run a full backfill on the destination table for the query.        Using this command will:         - Create the table if it doesn't exist and run a full backfill.         - Run a full backfill if the table exists and is empty.         - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query.        It supports <code>query.sql</code> files that use the is_init() pattern.        To run in parallel per sample_id, include a @sample_id parameter in the query.</p> <p>Usage</p> <pre><code>$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts, otherwise initialization is run for empty tables.\n--force: Run the initialization even if the destination table contains data.\n</code></pre> <p>Examples</p> <pre><code>Examples:\n   - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n   - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n</code></pre>"},{"location":"bqetl/#render","title":"<code>render</code>","text":"<p>Render a query Jinja template.</p> <p>Usage</p> <pre><code>$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code>./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n  --output-dir=/tmp\n</code></pre>"},{"location":"bqetl/#schema","title":"<code>schema</code>","text":"<p>Commands for managing query schemas.</p>"},{"location":"bqetl/#update","title":"<code>update</code>","text":"<p>Update the query schema based on the destination table schema and the query schema.     If no schema.yaml file exists for a query, one will be created.</p> <p>Usage</p> <pre><code>$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n</code></pre>"},{"location":"bqetl/#deploy","title":"<code>deploy</code>","text":"<p>Deploy the query schema.</p> <p>Usage</p> <pre><code>$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.  Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema deploy telemetry_derived.clients_daily_v6\n</code></pre>"},{"location":"bqetl/#validate_1","title":"<code>validate</code>","text":"<p>Validate the query schema</p> <p>Usage</p> <pre><code>$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n</code></pre> <p>Examples</p> <pre><code>./bqetl query schema validate telemetry_derived.clients_daily_v6\n</code></pre>"},{"location":"bqetl/#dag","title":"<code>dag</code>","text":"<p>Commands for managing DAGs.</p>"},{"location":"bqetl/#info_1","title":"<code>info</code>","text":"<p>Get information about available DAGs.</p> <p>Usage</p> <pre><code>$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n</code></pre> <p>Examples</p> <pre><code># Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n</code></pre>"},{"location":"bqetl/#create_1","title":"<code>create</code>","text":"<p>Create a new DAG with name bqetl_, for example: bqetl_search     When creating new DAGs, the DAG name must have a <code>bqetl_</code> prefix.     Created DAGs are added to the <code>dags.yaml</code> file. <p>Usage</p> <pre><code>$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n</code></pre> <p>Examples</p> <pre><code>./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n</code></pre>"},{"location":"bqetl/#generate","title":"<code>generate</code>","text":"<p>Generate Airflow DAGs from DAG definitions.</p> <p>Usage</p> <pre><code>$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n</code></pre> <p>Examples</p> <pre><code># Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n</code></pre>"},{"location":"bqetl/#remove","title":"<code>remove</code>","text":"<p>Remove a DAG.     This will also remove the scheduling information from the queries that were scheduled     as part of the DAG.</p> <p>Usage</p> <pre><code>$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n</code></pre> <p>Examples</p> <pre><code># Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n</code></pre>"},{"location":"bqetl/#dependency","title":"<code>dependency</code>","text":"<p>Build and use query dependency graphs.</p>"},{"location":"bqetl/#show","title":"<code>show</code>","text":"<p>Show table references in sql files.</p> <p>Usage</p> <pre><code>$ ./bqetl dependency show [OPTIONS] [paths]\n</code></pre>"},{"location":"bqetl/#record","title":"<code>record</code>","text":"<p>Record table references in metadata. Fails if metadata already contains references section.</p> <p>Usage</p> <pre><code>$ ./bqetl dependency record [OPTIONS] [paths]\n</code></pre>"},{"location":"bqetl/#dryrun","title":"<code>dryrun</code>","text":"<p>Dry run SQL.         Uses the dryrun Cloud Function by default which only has access to shared-prod.         To dryrun queries accessing tables in another project use set         <code>--use-cloud-function=false</code> and ensure that the command line has access to a         GCP service account.</p> <p>Usage</p> <pre><code>$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n</code></pre> <p>Examples</p> <pre><code>Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n</code></pre>"},{"location":"bqetl/#format","title":"<code>format</code>","text":"<p>Format SQL files.</p> <p>Usage</p> <pre><code>$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n</code></pre> <p>Examples</p> <pre><code># Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n</code></pre>"},{"location":"bqetl/#routine","title":"<code>routine</code>","text":"<p>Commands for managing routines for internal use.</p>"},{"location":"bqetl/#create_2","title":"<code>create</code>","text":"<p>Create a new routine. Specify whether the routine is a UDF or     stored procedure by adding a --udf or --stored_prodecure flag.</p> <p>Usage</p> <pre><code>$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n</code></pre> <p>Examples</p> <pre><code># Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n</code></pre>"},{"location":"bqetl/#info_2","title":"<code>info</code>","text":"<p>Get routine information.</p> <p>Usage</p> <pre><code>$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n</code></pre> <p>Examples</p> <pre><code># Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n</code></pre>"},{"location":"bqetl/#validate_2","title":"<code>validate</code>","text":"<p>Validate formatting of routines and run tests.</p> <p>Usage</p> <pre><code>$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n</code></pre> <p>Examples</p> <pre><code># Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n</code></pre>"},{"location":"bqetl/#publish","title":"<code>publish</code>","text":"<p>Publish routines to BigQuery. Requires service account access.</p> <p>Usage</p> <pre><code>$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n</code></pre> <p>Examples</p> <pre><code># Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n</code></pre>"},{"location":"bqetl/#rename","title":"<code>rename</code>","text":"<p>Rename routine or routine dataset. Replaces all usages in queries with     the new name.</p> <p>Usage</p> <pre><code>$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n</code></pre>"},{"location":"bqetl/#mozfun","title":"<code>mozfun</code>","text":"<p>Commands for managing public mozfun routines.</p>"},{"location":"bqetl/#create_3","title":"<code>create</code>","text":"<p>Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the <code>mozfun</code> project.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n</code></pre> <p>Examples</p> <pre><code># Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n</code></pre>"},{"location":"bqetl/#info_3","title":"<code>info</code>","text":"<p>Get mozfun routine information.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n</code></pre> <p>Examples</p> <pre><code># Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n</code></pre>"},{"location":"bqetl/#validate_3","title":"<code>validate</code>","text":"<p>Validate formatting of mozfun routines and run tests.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n</code></pre> <p>Examples</p> <pre><code># Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n</code></pre>"},{"location":"bqetl/#publish_1","title":"<code>publish</code>","text":"<p>Publish mozfun routines. This command is used by Airflow only.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n</code></pre>"},{"location":"bqetl/#rename_1","title":"<code>rename</code>","text":"<p>Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.</p> <p>Usage</p> <pre><code>$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n</code></pre>"},{"location":"bqetl/#backfill_1","title":"<code>backfill</code>","text":"<p>Commands for managing backfills.</p>"},{"location":"bqetl/#create_4","title":"<code>create</code>","text":"<p>Create a new backfill entry in the backfill.yaml file.  Create     a backfill.yaml file if it does not already exist.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n</code></pre> <p>Examples</p> <pre><code>./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n</code></pre>"},{"location":"bqetl/#validate_4","title":"<code>validate</code>","text":"<p>Validate backfill.yaml file format and content.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code>./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n    ./bqetl backfill validate\n</code></pre>"},{"location":"bqetl/#info_4","title":"<code>info</code>","text":"<p>Get backfill(s) information from all or specific table(s).</p> <p>Usage</p> <pre><code>$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n</code></pre> <p>Examples</p> <pre><code># Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n</code></pre>"},{"location":"bqetl/#scheduled","title":"<code>scheduled</code>","text":"<p>Get information on backfill(s) that require processing.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n</code></pre> <p>Examples</p> <pre><code># Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n</code></pre>"},{"location":"bqetl/#initiate","title":"<code>initiate</code>","text":"<p>Process entry in backfill.yaml with Initiate status that has not yet been processed.</p> <p>Usage</p> <pre><code>$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n</code></pre>"},{"location":"bqetl/#complete","title":"<code>complete</code>","text":"<p>Complete entry in backfill.yaml with Complete status that has not yet been processed..</p> <p>Usage</p> <pre><code>$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n</code></pre> <p>Examples</p> <pre><code># Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n</code></pre>"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"<p>This is a quick guide of how to perform common workflows in bigquery-etl using the <code>bqetl</code> CLI.</p> <p>For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the <code>bqetl</code> CLI needs to be installed by running <code>./bqetl bootstrap</code>.</p>"},{"location":"cookbooks/common_workflows/#adding-a-new-scheduled-query","title":"Adding a new scheduled query","text":"<p>The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.</p> <ol> <li>Run <code>./bqetl query create &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code><ol> <li>Specify the desired destination dataset and table name for <code>&lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code></li> <li>Directories and files are generated automatically</li> </ol> </li> <li>Open <code>query.sql</code> file that has been created in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/</code> to write the query</li> <li>[Optional] Run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to generate the <code>schema.yaml</code> file<ul> <li>Optionally add column descriptions to <code>schema.yaml</code></li> </ul> </li> <li>Open the <code>metadata.yaml</code> file in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/</code><ul> <li>Add a description of the query</li> <li>Add BigQuery information such as table partitioning or clustering<ul> <li>See clients_daily_v6 for reference</li> </ul> </li> </ul> </li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to dry run and format the query</li> <li>To schedule the query, first select a DAG from the <code>./bqetl dag info</code> list or create a new DAG <code>./bqetl dag create &lt;bqetl_new_dag&gt;</code></li> <li>Run <code>./bqetl query schedule &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt; --dag &lt;bqetl_dag&gt;</code> to schedule the query</li> <li>Create a pull request</li> <li>PR gets reviewed and eventually approved</li> <li>Merge pull-request</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to create new datasets earlier</li> </ul> </li> <li>Backfill data<ul> <li>Option 1: via Airflow interface</li> <li>Option 2: <code>./bqetl query backfill --project-id &lt;project id&gt; &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code></li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#update-an-existing-query","title":"Update an existing query","text":"<ol> <li>Open the <code>query.sql</code> file of the query to be updated and make changes</li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to dry run and format the query</li> <li>If the query scheduling metadata has changed, run <code>./bqetl dag generate &lt;bqetl_dag&gt;</code> to update the DAG file</li> <li>If the query adds new columns, run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;_&lt;version&gt;</code> to make local <code>schema.yaml</code> updates</li> <li>Open PR with changes</li> <li>PR reviewed and approved</li> <li>Merge pull-request</li> <li>Table deploys (including schema changes) happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#formatting-sql","title":"Formatting SQL","text":"<p>We enforce consistent SQL formatting as part of CI. After adding or changing a query, use <code>./bqetl format</code> to apply formatting rules.</p> <p>Directories and files passed as arguments to <code>./bqetl format</code> will be formatted in place, with directories recursively searched for files with a <code>.sql</code> extension, e.g.:</p> <pre><code>$ echo 'SELECT 1,2,3' &gt; test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n  1,\n  2,\n  3\n</code></pre> <p>If no arguments are specified the script will read from stdin and write to stdout, e.g.:</p> <pre><code>$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n  1,\n  2,\n  3\n</code></pre> <p>To turn off sql formatting for a block of SQL, wrap it in <code>format:off</code> and <code>format:on</code> comments, like this:</p> <pre><code>SELECT\n  -- format:off\n  submission_date, sample_id, client_id\n  -- format:on\n</code></pre>"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"<p>Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.</p> <ol> <li>Open the <code>query.sql</code> file inside the <code>&lt;dataset&gt;.&lt;table&gt;</code> location and add the new definitions for the field.</li> <li>Run <code>./bqetl format &lt;path to the query&gt;</code> to format the query. Alternatively, run <code>./bqetl format $(git ls-tree -d HEAD --name-only)</code> validate the format of all queries that have been modified.</li> <li>Run <code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;</code> to dry run the query.<ul> <li>For data scientists (and anyone without <code>jobs.create</code> permissions in <code>moz-fx-data-shared-prod</code>), run:<ul> <li>(a) <code>gcloud auth login --update-adc   # to authenticate to GCP</code></li> <li>(b) <code>gcloud config set project mozdata    # to set the project</code></li> <li>(c) <code>./bqetl query validate --use-cloud-function=false --project-id=mozdata &lt;full path to the query file&gt;</code></li> </ul> </li> </ul> </li> <li>Run <code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt; --update_downstream</code> to make local schema.yaml updates and update schemas of downstream dependencies.<ul> <li> This requires GCP access.</li> <li> <code>--update_downstream</code> is optional as it takes longer. It is recommended when you know that there are downstream dependencies whose <code>schema.yaml</code> need to be updated, in which case, the update will happen automatically.</li> <li> <code>--force</code> should only be used in very specific cases, particularly the <code>clients_last_seen</code> tables. It skips some checks that would otherwise catch some error scenarios.</li> </ul> </li> <li>Open a new PR with these changes.</li> <li>PR reviewed and approved.</li> <li>Find and run again the CI pipeline for the PR.<ul> <li> Make sure all dry runs are successful.</li> </ul> </li> <li>Merge pull-request.</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol> <p>The following is an example to update a new field in <code>telemetry_derived.clients_daily_v6</code></p>"},{"location":"cookbooks/common_workflows/#example-add-a-new-field-to-clients_daily","title":"Example: Add a new field to clients_daily","text":"<ol> <li>Open the <code>clients_daily_v6</code> <code>query.sql</code> file and add new field definitions.</li> <li>Run <code>./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql</code></li> <li>Run <code>./bqetl query validate telemetry_derived.clients_daily_v6</code>.</li> <li>Authenticate to GCP: <code>gcloud auth login --update-adc</code></li> <li>Run <code>./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false</code>.<ul> <li> <code>schema.yaml</code> files of downstream dependencies, like <code>clients_last_seen_v1</code> are updated.</li> <li>If the schema has no changes, we do not run schema updates on any of its downstream dependencies.</li> <li><code>--use-cloud-function=false</code> is necessary when updating tables related to <code>clients_daily</code> but optional for other tables. The dry run cloud function times out when fetching the deployed table schema for some of <code>clients_daily</code>s downstream dependencies. Using GCP credentials instead works, however this means users need to have permissions to run queries in <code>moz-fx-data-shared-prod</code>.</li> </ul> </li> <li>Open a PR with these changes.</li> <li>PR is reviewed and approved.</li> <li>Merge pull-request.</li> <li>Table deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#remove-a-field-from-a-table-schema","title":"Remove a field from a table schema","text":"<p>Deleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.</p>"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"<ol> <li>Run <code>./bqetl mozfun create &lt;dataset&gt;.&lt;name&gt; --udf</code>.</li> <li>Navigate to the <code>udf.sql</code> file in <code>sql/mozfun/&lt;dataset&gt;/&lt;name&gt;/</code> and add UDF the definition and tests.</li> <li>Run <code>./bqetl mozfun validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests.<ul> <li>Before running the tests, you need to setup the access to the Google Cloud API.</li> </ul> </li> <li>Open a PR.</li> <li>PR gets reviewed, approved and merged.</li> <li>To publish UDF immediately:<ul> <li>Go to Airflow <code>mozfun</code> DAG and clear latest run.</li> <li>Or else it will get published within a day when mozfun is executed next.</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-a-new-internal-udf","title":"Adding a new internal UDF","text":"<p>Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a <code>mozfun</code> UDF.</p> <ol> <li>Run <code>./bqetl routine create &lt;dataset&gt;.&lt;name&gt; --udf</code></li> <li>Navigate to the <code>udf.sql</code> in <code>sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;name&gt;/</code> file and add UDF definition and tests</li> <li>Run <code>./bqetl routine validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests<ul> <li>Before running the tests, you need to setup the access to the Google Cloud API.</li> </ul> </li> <li>Open a PR</li> <li>PR gets reviewed and approved and merged</li> <li>UDF deploys happen on a nightly cadence through the <code>bqetl_artifact_deployment</code> Airflow DAG<ul> <li>Clear the most recent DAG run once a new version of bigquery-etl has been deployed to apply changes earlier</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-a-stored-procedure","title":"Adding a stored procedure","text":"<p>The same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute <code>./bqetl mozfun create &lt;dataset&gt;.&lt;name&gt; --stored_procedure</code> or <code>./bqetl routine create &lt;dataset&gt;.&lt;name&gt; --stored_procedure</code> for internal stored procedures.</p>"},{"location":"cookbooks/common_workflows/#updating-an-existing-udf","title":"Updating an existing UDF","text":"<ol> <li>Navigate to the <code>udf.sql</code> file and make updates</li> <li>Run <code>./bqetl mozfun validate &lt;dataset&gt;.&lt;name&gt;</code> or <code>./bqetl routine validate &lt;dataset&gt;.&lt;name&gt;</code> for formatting and running tests</li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#renaming-an-existing-udf","title":"Renaming an existing UDF","text":"<ol> <li>Run <code>./bqetl mozfun rename &lt;dataset&gt;.&lt;name&gt; &lt;new_dataset&gt;.&lt;new_name&gt;</code><ul> <li>References in queries to the UDF are automatically updated</li> </ul> </li> <li>Open a PR</li> <li>PR gets reviews, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#using-a-private-internal-udf","title":"Using a private internal UDF","text":"<ol> <li>Follow the steps for Adding a new internal UDF above to create a stub of the private UDF. Note this should not contain actual private UDF code or logic. The directory name and function parameters should match the private UDF.</li> <li>Do Not publish the stub UDF. This could result in incorrect results for other users of the private UDF.</li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged</li> </ol>"},{"location":"cookbooks/common_workflows/#creating-a-new-bigquery-dataset","title":"Creating a new BigQuery Dataset","text":"<p>To provision a new BigQuery dataset for holding tables, you'll need to create a <code>dataset_metadata.yaml</code> which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.</p> <p>The <code>bqetl query create</code> command will automatically generate a skeleton <code>dataset_metadata.yaml</code> file if the query name contains a dataset that is not yet defined.</p> <p>See example with commentary for <code>telemetry_derived</code>:</p> <pre><code>friendly_name: Telemetry Derived\ndescription: |-\n  Derived data based on pings from legacy Firefox telemetry, plus many other\n  general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n#   \"derived\" for `_derived` datasets that contain concrete tables\n#   \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n  members:\n  - workgroup:mozilla-confidential\n</code></pre>"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"<p>See also the reference for Public Data.</p> <ol> <li>Get a data review by following the data publishing process</li> <li>Update the <code>metadata.yaml</code> file of the query to be published<ul> <li>Set <code>public_bigquery: true</code> and optionally <code>public_json: true</code></li> <li>Specify the <code>review_bugs</code></li> </ul> </li> <li>If an internal dataset already exists, move it to <code>mozilla-public-data</code></li> <li>If an <code>init.sql</code> file exists for the query, change the destination project for the created table to <code>mozilla-public-data</code></li> <li>Open a PR</li> <li>PR gets reviewed, approved and merged<ul> <li>Once, ETL is running a view will get automatically published to <code>moz-fx-data-shared-prod</code> referencing the public dataset</li> </ul> </li> </ol>"},{"location":"cookbooks/common_workflows/#adding-new-python-requirements","title":"Adding new Python requirements","text":"<p>When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either <code>requirements.txt</code> or <code>constraints.txt</code>, but not both.</p> <pre><code># Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 &gt;&gt; requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n</code></pre>"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"<p>When opening a pull-request to merge a fork, the <code>manual-trigger-required-for-fork</code> CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <code>&lt;username&gt;:&lt;branch&gt;</code> of the fork as parameter. The parameter will also show up in the logs of the <code>manual-trigger-required-for-fork</code> CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.</p>"},{"location":"cookbooks/common_workflows/#building-the-documentation","title":"Building the Documentation","text":"<p>The repository documentation is built using MkDocs. To generate and check the docs locally:</p> <ol> <li>Run <code>./bqetl docs generate --output_dir generated_docs</code></li> <li>Navigate to the <code>generated_docs</code> directory</li> <li>Run <code>mkdocs serve</code> to start a local <code>mkdocs</code> server.</li> </ol>"},{"location":"cookbooks/common_workflows/#setting-up-change-control-to-code-files","title":"Setting up change control to code files","text":"<p>Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.</p> <p>Here is a sample PR with the implementation of change control for contextual services data.</p> <ol> <li>Select or create a Github team or identity and add the GitHub emails of the query codeowners. A GitHub identity is particularly useful when you need to include non @mozilla emails or to randomly assign PR reviewers from the team members. This team requires edit permissions to bigquery-etl, to achieve this, inherit the team from one that has the required permissions e.g. <code>mozilla &gt; telemetry</code>.</li> <li>Open the <code>metadata.yaml</code> for the query where you want to apply change control:<ul> <li>In the section <code>owners</code>, add the selected GitHub identity, along with the list of owners' emails.</li> <li>In the section <code>labels</code>, add <code>change_controlled: true</code>. This enables identifying change controlled data in the BigQuery console and in the Data Catalog.</li> </ul> </li> <li>Setup the <code>CODEOWNERS</code>:<ul> <li>Open the <code>CODEOWNERS</code> file located in the root of the repo.</li> <li>Add a new row with the path and owners for the query. You can place it in the corresponding section or create a new section in the file, e.g. <code>/sql_generators/active_users/templates/ @mozilla/kpi_table_reviewers</code>.</li> </ul> </li> <li>The queries labeled change_controlled are automatically validated in the CI. To run the validation locally:<ul> <li>Run the command <code>script/bqetl query validate &lt;query_path&gt;</code>.</li> <li>If the query is generated using the <code>/sql-generators</code>, first run <code>./script/bqetl generate &lt;path&gt;</code> and then run <code>script/bqetl query validate &lt;query_path&gt;</code>.</li> </ul> </li> </ol>"},{"location":"cookbooks/creating_a_derived_dataset/","title":"A quick guide to creating a derived dataset with BigQuery-ETL and how to set it up as a public dataset","text":"<p>This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.</p> <p>This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.</p> <p>To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.</p> <p>If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"<p>Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.</p> <p>In this example, we will create a table of aggregated metrics related to <code>mozregression</code>, that will be used in dashboards to help prioritize feature development inside Mozilla.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#initial-steps","title":"Initial steps","text":"<p>Set up bigquery-etl on your system per the instructions in the README.md.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"<p>The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it <code>org_mozilla_mozregression_derived.mozregression_aggregates</code>.</p> <p>The <code>org_mozilla_mozregression_derived</code> part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the <code>_derived</code> postfix to hold derived tables like this one.</p> <p>Run: <pre><code>./bqetl query create &lt;dataset&gt;.&lt;table_name&gt;\n</code></pre> In our example:</p> <pre><code>./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n</code></pre> <p>This command does three things:</p> <ul> <li>Generate the template files <code>metadata.yaml</code> and <code>query.sql</code> representing the query to build the dataset in <code>sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1</code></li> <li>Generate a \"view\" of the dataset in <code>sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates</code>.</li> <li>Add the scheduling information in the metadata, required to create a task in Airflow DAG <code>bqetl_internal_tooling</code>.<ul> <li>When the dag name is not given, the query is scheduled by default in DAG <code>bqetl_default</code>.</li> <li>When the option <code>--no-schedule</code> is used, queries are not schedule. This option is available for queries that run once or should be scheduled at a later time. The query can be manually scheduled at a later time.</li> </ul> </li> </ul> <p>We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the <code>mozdata</code> project.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-yaml","title":"Fill out the YAML","text":"<p>The next step is to modify the generated <code>metadata.yaml</code> and <code>query.sql</code> sections with specific information.</p> <p>Let's look at what the <code>metadata.yaml</code> file for our example looks like. Make sure to adapt this file for your own dataset.</p> <pre><code>friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n</code></pre> <p>Most of the fields are self-explanatory. <code>incremental</code> means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.</p> <p>For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the <code>date</code> field and a clustering on <code>app_used</code> and <code>os</code>.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#the-yaml-file-structure-for-a-public-dataset","title":"The YAML file structure for a public dataset","text":"<p>Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: <code>public_json</code>, <code>public_bigquery</code> and <code>review_bugs</code> which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.</p> <pre><code>friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1691105\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n</code></pre>"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"<p>Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a <code>@submission_date</code> parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.</p> <p>Test your query and add it to the <code>query.sql</code> file.</p> <p>In our example, the query is tested in <code>sql.telemetry.mozilla.org</code>, and the <code>query.sql</code> file looks like this:</p> <pre><code>SELECT\n  DATE(submission_timestamp) AS date,\n  client_info.app_display_version AS mozregression_version,\n  metrics.string.usage_variant AS mozregression_variant,\n  metrics.string.usage_app AS app_used,\n  normalized_os AS os,\n  mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n  count(DISTINCT(client_info.client_id)) AS distinct_clients,\n  count(*) AS total_uses\nFROM\n  `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n  DATE(submission_timestamp) = @submission_date\n  AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n  date,\n  mozregression_version,\n  mozregression_variant,\n  app_used,\n  os,\n  os_version;\n</code></pre> <p>We use the <code>truncate_version</code> UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.</p> <p>We also have a short clause (<code>client_info.app_display_version NOT LIKE '%.dev%'</code>) to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#formatting-and-validating-the-query","title":"Formatting and validating the query","text":"<p>Now that we've written our query, we can format it and validate it. Once that's done, we run:</p> <p><pre><code>./bqetl query validate &lt;dataset&gt;.&lt;table&gt;\n</code></pre> For our example: <pre><code>./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre> If there are no problems, you should see no output.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"<p>Use bqetl to set up the schema that will be used to create the table.</p> <p>Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.</p> <pre><code>./bqetl query schema update &lt;dataset&gt;.&lt;table&gt;\n</code></pre> <p>For our example: <pre><code>./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre></p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"<p>BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).</p> <p>Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in <code>dags.yaml</code> at the root or run <code>./bqetl dag info</code>. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:</p> <pre><code>./bqetl dag create &lt;dag_name&gt; --schedule-interval \"0 4 * * *\" --owner &lt;email_for_notifications&gt; --description \"Add a clear description of the DAG here\" --start-date &lt;YYYY-MM-DD&gt; --tag impact/&lt;tier&gt;\n</code></pre> <p>For our example, the starting date is <code>2020-06-01</code> and we use a schedule interval of <code>0 4 \\* \\* \\*</code> (4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.</p> <p>The <code>--tag impact/tier3</code> parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.</p> <p>When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as <code>--tag triage/no_triage</code>. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the <code>triage/no_triage</code> tag can be removed and problems will addressed during the Airflow Triage process.</p> <pre><code>./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n</code></pre>"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"<p>Queries are automatically scheduled during creation in the DAG set using the option <code>--dag</code>, or in the default DAG <code>bqetl_default</code> when this option is not used.</p> <p>If the query was created with <code>--no-schedule</code>, it is possible to manually schedule the query via the <code>bqetl</code> tool:</p> <pre><code>./bqetl query schedule &lt;dataset&gt;.&lt;table&gt; --dag &lt;dag_name&gt; --task-name &lt;task_name&gt;\n</code></pre> <p>Here is the command for our example. Notice the name of the table as created with the suffix _v1. <pre><code>./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n</code></pre></p> <p>Note that we are scheduling the generation of the underlying table which is <code>org_mozilla_mozregression_derived.mozregression_aggregates_v1</code> rather than the view.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#get-data-review","title":"Get Data Review","text":"<p>This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.</p> <p>Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.</p> <p>The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"<p>Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:</p> <pre><code>git checkout -b &lt;new_branch_name&gt;\ngit add dags.yaml dags/&lt;dag_name&gt;.py sql/moz-fx-data-shared-prod/telemetry/&lt;view&gt; sql/moz-fx-data-shared-prod/&lt;dataset&gt;/&lt;table&gt;\ngit commit\ngit push origin &lt;new_branch_name&gt;\n</code></pre> <p>And next is the workflow for our specific example:</p> <pre><code>git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n</code></pre> <p>Then create your pull request, either from the GitHub web interface or the command line, per your preference.</p> <p>Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.</p> <p>This example assumes that <code>origin</code> points to your fork. Adjust the last push invocation appropriately if you have a different remote set.</p> <p>Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"<p>Once the PR has been approved, deploy the schema to bqetl using this command:</p> <pre><code>./bqetl query schema deploy &lt;schema&gt;.&lt;table&gt;\n</code></pre> <p>For our example: <pre><code>./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n</code></pre></p>"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"<p>Note For large sets of data, follow the recommended practices for backfills.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"<ol> <li> <p>Create a backfill schedule entry to (re)-process data in your table:</p> <pre><code>bqetl backfill create &lt;project&gt;.&lt;dataset&gt;.&lt;table&gt; --start_date=&lt;YYYY-MM-DD&gt; --end_date=&lt;YYYY-MM-DD&gt;\n</code></pre> <ul> <li>If the backfill requires shredder_mitigation to maintain metrics stable, use the <code>--shredder_mitigation</code> parameter in the backfill command:</li> </ul> <pre><code>bqetl backfill create &lt;project&gt;.&lt;dataset&gt;.&lt;table&gt; --start_date=&lt;YYYY-MM-DD&gt; --end_date=&lt;YYYY-MM-DD&gt; --shredder_mitigation\n</code></pre> </li> <li> <p>Fill out the missing details:</p> <ul> <li>Watchers: Mozilla Emails for users that should be notified via Slack about backfill progress.</li> <li>Reason: Why are you backfilling this table?</li> </ul> </li> <li> <p>Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.</p> </li> <li> <p>Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.</p> </li> </ol>"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-backfill","title":"Completing the backfill:","text":"<ol> <li> <p>Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)</p> </li> <li> <p>If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.</p> </li> <li> <p>You will be notified when swapping is complete.</p> </li> </ol> <p>Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.</p>"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"<p>At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.</p>"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"<p>This repository uses <code>pytest</code>:</p> <pre><code># create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n</code></pre> <p>To provide authentication credentials for the Google Cloud API the <code>GOOGLE_APPLICATION_CREDENTIALS</code> environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.</p>"},{"location":"cookbooks/testing/#how-to-configure-a-udf-test","title":"How to Configure a UDF Test","text":"<p>Include a comment like <code>-- Tests</code> followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.</p> <p>Each test must use the UDF and throw an error to fail. Assert functions defined in <code>sql/mozfun/assert/</code> may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like <code>#xfail</code>, similar to a SQL dialect prefix in the BigQuery Cloud Console.</p> <p>For example:</p> <pre><code>CREATE TEMP FUNCTION udf_example(option INT64) AS (\n  CASE\n  WHEN option &gt; 0 then TRUE\n  WHEN option = 0 then FALSE\n  ELSE ERROR(\"invalid option\")\n  END\n);\n-- Tests\nSELECT\n  mozfun.assert.true(udf_example(1)),\n  mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n  udf_example(-1);\n#xfail\nSELECT\n  udf_example(NULL);\n</code></pre>"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"<p>Queries are tested by running the <code>query.sql</code> with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named <code>tests/sql/{project}/{dataset}/{table}/{test_name}/</code>,    e.g. <code>tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day</code>    - <code>table</code> must match a directory named like <code>{dataset}/{table}</code>, e.g.      <code>telemetry_derived/clients_last_seen_v1</code>    - <code>test_name</code> should start with <code>test_</code>, e.g. <code>test_single_day</code>    - If <code>test_name</code> is <code>test_init</code> or <code>test_script</code>, then the query with <code>is_init()</code> set to <code>true</code>      or <code>script.sql</code> respectively; otherwise, the test will run <code>query.sql</code> 1. Add <code>.yaml</code> files for input tables, e.g. <code>clients_daily_v6.yaml</code>    - Include the dataset prefix if it's set in the tested query,      e.g. <code>analysis.clients_last_seen_v1.yaml</code>    - Include the project prefix if it's set in the tested query,      e.g. <code>moz-fx-other-data.new_dataset.table_1.yaml</code>      - This will result in the dataset prefix being removed from the query,        e.g. <code>query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\")</code> 1. Add <code>.sql</code> files for input view queries, e.g. <code>main_summary_v4.sql</code>    - Don't include a <code>CREATE ... AS</code> clause    - Fully qualify table names as <code>`{project}.{dataset}.table`</code>    - Include the dataset prefix if it's set in the tested query,      e.g. <code>telemetry.main_summary_v4.sql</code>      - This will result in the dataset prefix being removed from the query,        e.g. <code>query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\")</code> 1. Add <code>expect.yaml</code> to validate the result    - <code>DATE</code> and <code>DATETIME</code> type columns in the result are coerced to strings      using <code>.isoformat()</code>    - Columns named <code>generated_time</code> are removed from the result before      comparing to <code>expect</code> because they should not be static    - <code>NULL</code> values should be omitted in <code>expect.yaml</code>. If a column is expected to be <code>NULL</code> don't add it to <code>expect.yaml</code>.      (Be careful with spreading previous rows (<code>-&lt;&lt;: *base</code>) here) 1. Optionally add <code>.schema.json</code> files for input table schemas to the table directory, e.g.    <code>tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json</code>.    These tables will be available for every test in the suite.    The <code>schema.json</code> file need to match the table name in the <code>query.sql</code> file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add <code>query_params.yaml</code> to define query parameters    - <code>query_params</code> must be a list</p>"},{"location":"cookbooks/testing/#init-tests","title":"Init Tests","text":"<p>Tests of <code>is_init()</code> statements are supported, similarly to other generated tests. Simply name the test <code>test_init</code>. The other guidelines still apply.</p>"},{"location":"cookbooks/testing/#additional-guidelines-and-options","title":"Additional Guidelines and Options","text":"<ul> <li>If the destination table is also an input table then <code>generated_time</code> should   be a required <code>DATETIME</code> field to ensure minimal validation</li> <li>Input table files<ul> <li>All of the formats supported by <code>bq load</code> are supported</li> <li><code>yaml</code> and <code>json</code> format are supported and must contain an array of rows   which are converted in memory to <code>ndjson</code> before loading</li> <li>Preferred formats are <code>yaml</code> for readability or <code>ndjson</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li><code>expect.yaml</code><ul> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred formats are <code>yaml</code> for readability or <code>ndjson</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li>Schema files<ul> <li>Setting the description of a top level field to <code>time_partitioning_field</code>   will cause the table to use it for time partitioning</li> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred formats are <code>yaml</code> for readability or <code>json</code> for compatiblity   with <code>bq load</code></li> </ul> </li> <li>Query parameters<ul> <li>Scalar query params should be defined as a dict with keys <code>name</code>, <code>type</code> or   <code>type_</code>, and <code>value</code></li> <li><code>query_parameters.yaml</code> may be used instead of <code>query_params.yaml</code>, but   they are mutually exclusive</li> <li>File extensions <code>yaml</code>, <code>json</code> and <code>ndjson</code> are supported</li> <li>Preferred format is <code>yaml</code> for readability</li> </ul> </li> </ul>"},{"location":"cookbooks/testing/#how-to-run-circleci-locally","title":"How to Run CircleCI Locally","text":"<ul> <li>Install the CircleCI Local CI</li> <li>Download GCP service account keys<ul> <li>Integration tests will only successfully run with service account keys   that belong to the <code>circleci</code> service account in the <code>biguqery-etl-integration-test</code> project</li> </ul> </li> <li>Run <code>circleci build</code> and set required environment variables <code>GOOGLE_PROJECT_ID</code> and   <code>GCLOUD_SERVICE_KEY</code>:</li> </ul> <pre><code>gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64, n INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"<p>Given a map of representing activity for STRING <code>key</code>s, this function returns an array of which <code>key</code>s were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_seen_bits_map ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, start_offset INT64, n_bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"<p>This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days.  On the first of each month, a new month is appended, and the first month is dropped.  If the date is not the first of the month, the new entry is added to the last element in the array.  For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]:  On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24]  This happens for every aggregate (searches, ad clicks, etc.)</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;, curr STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"<p>Adds together two engine searches structs. Each engine searches struct has a MAP[engine -&gt; search_counts_struct]. We want to add add together the prev and curr's values for a certain engine.  This allows us to be flexible with the number of engines we're using.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;, curr ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>value\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"<p>Return sums of each search type grouped by the index. Results are ordered by index.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>searches ARRAY&lt;STRUCT&lt;total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"<p>This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY&gt;, i.e. the output of <code>SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4</code>, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addons ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"<p>Returns an aggregated map with all the keys and the first corresponding value from the given maps</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>maps ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>search_counts ARRAY&lt;STRUCT&lt;engine STRING, source STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"<p>Aggregates the total counts of the given search counters</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine_searches_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"<p>An array of 11 zeroes, followed by a supplied value</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>val INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"<p>Drop the first element of an array, and append the given element.  Result is an array with the same length as the input.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arr ANY TYPE, append ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"<p>An array of 12 zeroes</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arr ANY TYPE, start_index INT64, end_index INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"<p>This function counts the 1s in lowest 7 bits of an INT64</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"<p>A bitmask for 365 bits</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( 0x0FFFFFFF\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( 0x7F\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"<p>Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask.  The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start_ordinal INT64, _length INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;DATE&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"<p>Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: <code>sql bits_from_offsets(   ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL)             IGNORE NULLS) )</code> The resulting value can be cast to an INT64 representing the most recent 64 days via: <code>sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits &gt;&gt; i, 4))) AS INT64)</code> Or representing the most recent 28 days (compatible with bits28 functions) via: <code>sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits &gt;&gt; i, 4))) AS INT64) &lt;&lt; 36 &gt;&gt; 36</code></p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_25","title":"Parameters","text":"<p>INPUTS</p> <pre><code>offsets ARRAY&lt;INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"<p>Given a BYTE and an INT64, return whether the user was active that many weeks ago.  NULL input returns NULL output.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, n INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"<p>Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"<p>Given a BYTES, return the number of days since the client was first seen.  If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0.  Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS ((     SELECT MAX(n)     FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n     WHERE BIT_COUNT(SUBSTR(b &gt;&gt; n, -1) &amp; b'\\x01') &gt; 0)); See also: bits_to_days_since_seen.sql</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"<p>Given a BYTES, return the number of days since the client was last seen.  If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0.  Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION   udf.bits_to_days_since_seen(b BYTES) AS ((     SELECT MIN(n)     FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n     WHERE BIT_COUNT(SUBSTR(b &gt;&gt; n, -1) &amp; b'\\x01') &gt; 0)); See also: bits_to_days_since_first_seen.sql</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"<p>Convert a boolean to 365 bit byte array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"<p>INPUTS</p> <pre><code>val BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"<p>Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"<p>We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev INT64, curr INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"<p>Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev BYTES, curr BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"<p>Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev INT64, curr INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"<p>INPUTS</p> <pre><code>prev BYTES, curr BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"<p>The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See <code>udf.combine_experiment_days</code> for a more specific example of this approach.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_36","title":"Parameters","text":"<p>INPUTS</p> <pre><code>-- prev ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, -- curr ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"<p>The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns.  This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"<p>INPUTS</p> <pre><code>-- prev ARRAY&lt;STRUCT&lt;experiment STRING, branch STRING, bits INT64&gt;&gt;, -- curr ARRAY&lt;STRUCT&lt;experiment STRING, branch STRING, bits INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"<p>For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji.  This is just for fun. See:  - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country_code string\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"<p>Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_seen_bytes BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"<p>Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_created_profile INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"<p>Rename struct fields in anonymous event tuples to meaningful names.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"<p>INPUTS</p> <pre><code>tuple STRUCT&lt;f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"<p>INPUTS</p> <pre><code>raw BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"<p>Return an array containing only distinct values of the given array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) RETURNS STRING AS ( 'helloworld'\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"<p>INPUTS</p> <pre><code>code_points ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;INT64&gt;\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"<p>Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metric ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"<p>Extract the document type from a table name e.g. _TABLE_SUFFIX.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"<p>INPUTS</p> <pre><code>table_name STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"<p>Extract the document version from a table name e.g. _TABLE_SUFFIX.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"<p>INPUTS</p> <pre><code>table_name STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"<p>This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed.  It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"<p>Return a path derived from an error message in <code>payload_bytes_error</code></p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_53","title":"Parameters","text":"<p>INPUTS</p> <pre><code>error_message STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"<p>Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.</p> <p>Fenix originally used an 8-digit app_build format&gt;</p> <p>In short it is <code>yDDDHHmm</code>:</p> <ul> <li>y is years since 2018</li> <li>DDD is day of year, 0-padded, 001-366</li> <li>HH is hour of day, 00-23</li> <li>mm is minute of hour, 00-59</li> </ul> <p>The last date seen with an 8-digit build ID is 2020-08-10.</p> <p>Newer builds use a 10-digit format&gt; where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.</p> <p>This function tolerates both formats.</p> <p>After using this you may wish to <code>DATETIME_TRUNC(result, DAY)</code> for grouping by build date.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_54","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"<p>INPUTS</p> <pre><code>device_category STRING, browser STRING, operating_system STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"<p>INPUTS</p> <pre><code>silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"<p>Determine if a browser in a Google Analytics data is produced by Mozilla</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"<p>INPUTS</p> <pre><code>browser STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"<p>Convert geoip lookup fields to a struct, replacing '??' with NULL.  Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"<p>Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"<p>Given a gzip compressed byte string, extract the uncompressed size from the footer.  WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"<p>INPUTS</p> <pre><code>compressed BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"<p>Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE, percentiles ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING, threshold INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"<p>Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation  This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"<p>INPUTS</p> <pre><code>key BYTES, message BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"<p>Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_histogram ANY TYPE, target_key STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"<p>Returns a JSON string which has the <code>pair</code> appended to the provided <code>input</code> JSON string. NULL is also valid for <code>input</code>. Examples:    udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)])    '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)])    '{\"baz\": \"boo\"}'</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_79","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING, arr ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"<p>Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT&gt;&gt;. To use this with that representation, it should be as <code>udf.kv_array_to_json_string(struct.key_value)</code>."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"<p>INPUTS</p> <pre><code>kv_arr ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"<p>INPUTS</p> <pre><code>processes ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"<p>For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"<p>INPUTS</p> <pre><code>country STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"<p>Only for use by the LTV Revenue join.  Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"<p>This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"<p>INPUTS</p> <pre><code>medium STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"<p>Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"<p>INPUTS</p> <pre><code>aggs ARRAY&lt;STRUCT&lt;metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"<p>This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"<p>INPUTS</p> <pre><code>dividend BYTES, divisor INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"<p>Stub monetized_search UDF for tests</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING, country STRING, distribution_id STRING, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"<p>This struct represents the past year's worth of searches. Each month has its own entry, hence 12.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( STRUCT( udf.array_of_12_zeroes(\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"<p>Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"<p>INPUTS</p> <pre><code>telemetry_sdk_build STRING, metrics ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_info ANY TYPE, metrics ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"<p>Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"<p>INPUTS</p> <pre><code>payload ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metadata ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"<p>Sum up the monthy search count arrays by normalized engine</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine_searches ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;total_searches ARRAY&lt;INT64&gt;, tagged_searches ARRAY&lt;INT64&gt;, search_with_ads ARRAY&lt;INT64&gt;, ad_click ARRAY&lt;INT64&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"<p>Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"<p>INPUTS</p> <pre><code>engine STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"<p>Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"<p>One represented as a byte array of 365 bits</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( CONCAT(REPEAT(b'\\x00', 45\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"<p>This is a UDF to help distinguish desktop client attribution as being organic or paid</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"<p>INPUTS</p> <pre><code>medium STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"<p>This is a UDF to help distinguish mobile client attribution as being organic or paid</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_properties ANY TYPE, indices ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre>"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"<p>Sum an array from a parquet-derived field. These are lists of an <code>element</code> that contain the field value.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_106","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"<p>Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"<p>INPUTS</p> <pre><code>uri STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"<p>Take a ISO 8601 date or date and time string and return a DATE.  Return null if parse fails.  Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"<p>INPUTS</p> <pre><code>date_str STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"<p>This is a stub implementation for use with tests; real implementation is in private-bigquery-etl</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"<p>INPUTS</p> <pre><code>distribution_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"<p>Returns the 0-based index of the first set bit. No set bits returns NULL.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"<p>INPUTS</p> <pre><code>i INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"<p>Identical to bits28_days_since_seen.  Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0).  To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"<p>Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"<p>INPUTS</p> <pre><code>legacy_app_name STRING, normalized_os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"<p>Pseudonymize Ad IDs, handling opt-outs.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"<p>INPUTS</p> <pre><code>hashed_ad_id STRING, key BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"<p>This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"<p>Floor a timestamp object to the given minute interval.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timestamp_expression TIMESTAMP, minute INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"<p>Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes.  This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"<p>Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"<p>Calculate the sums of search counts per source and engine</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ARRAY&lt;STRUCT&lt;engine STRING, source STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"<p>Shift input bits one day left and drop any bits beyond 28 days.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"<p>Shift input bits one day left and drop any bits beyond 365 days.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"<p>Returns the bitfield shifted by one day, 0 for NULL</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"<p>INPUTS</p> <pre><code>x INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"<p>Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bit_arrays ARRAY&lt;STRUCT&lt;days_created_profile_bits INT64, days_active_bits INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"<p>This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"<p>INPUTS</p> <pre><code>a ARRAY&lt;INT64&gt;, b ARRAY&lt;INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"<p>Zero represented as a 365-bit byte array</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"<p>INPUTS</p> <pre><code>) AS ( REPEAT(b'\\x00', 46\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"<p>Generates an array if all zeroes, of arbitrary length</p>"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"<p>INPUTS</p> <pre><code>len INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"<p>Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>percentiles ARRAY&lt;INT64&gt;, histogram STRUCT&lt;values ARRAY&lt;STRUCT&lt;key FLOAT64, value FLOAT64&gt;&gt;&gt;, metric STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"<p>Calculate the CRC-32 hash of an input string.  The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted.  Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>data STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64 DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"<p>URL decodes the raw firefox_installer.install.attribution string to a STRUCT.  The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted.  If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>attribution STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"<p>Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from <code>payload</code> which is in bytes.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>payload BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"<p>Unzips a GZIP string.  This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input BYTES\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements.  Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, values_per_bucket ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;low FLOAT64, high FLOAT64, pm FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"<p>Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>percentile FLOAT64, histogram STRUCT&lt;values ARRAY&lt;STRUCT&lt;key FLOAT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;low FLOAT64, high FLOAT64, percentile FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example:  WITH bucketed AS (   SELECT     submission_date, SUM(active_days_in_week) AS active_days_in_week,     SUM(wau) AS wau FROM     mytable   GROUP BY     submission_date,     bucket_id ) SELECT   submission_date,   udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM   bucketed GROUP BY submission_date</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, values_per_bucket ARRAY&lt;STRUCT&lt;numerator FLOAT64, denominator FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY&lt;STRUCT&lt;numerator FLOAT64, denominator FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"<p>Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling  Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements.  Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.  Example:  WITH bucketed AS (   SELECT     submission_date,     SUM(dau) AS dau_sum   FROM     mytable GROUP BY     submission_date,     bucket_id ) SELECT   submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM   bucketed GROUP BY   submission_date</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>n_buckets INT64, counts_per_bucket ARRAY&lt;INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;total INT64, low INT64, high INT64, pm INT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"<p>Returns a parsed struct from a JSON string representing a histogram.  This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"<p>Returns an array of parsed structs from a JSON string representing a keyed histogram.  This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"<p>Extract missing columns from additional properties.  More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...].  param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node.  Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram   (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements']  See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING, indicates_node ARRAY&lt;STRING&gt;, known_nodes ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"<p>Add fields from additional_attributes to active_addons in main pings.  Return an array instead of a \"map\" for backwards compatibility.  The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL.  The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addons ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING&gt;&gt;&gt;, active_addons_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"<p>Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;keyed_boolean_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value BOOL&gt;&gt;&gt;&gt;, keyed_uint_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;&gt;&gt;, string_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;, keyed_string_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;&gt;&gt;, uint_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, boolean_addon_scalars ARRAY&lt;STRUCT&lt;key STRING, value BOOL&gt;&gt;&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"<p>Report the ids of the addons which are in the addonDetails but not in the activeAddons.  They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>active_addon_ids ARRAY&lt;STRING&gt;, addon_details_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"<p>Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>params STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"<p>Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"<p>This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824</p>"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/about/","title":"mozfun","text":"<p><code>mozfun</code> is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.</p>"},{"location":"mozfun/addons/","title":"Addons","text":""},{"location":"mozfun/addons/#is_adblocker-udf","title":"is_adblocker (UDF)","text":"<p>Returns whether a given Addon ID is an adblocker.</p> <p>Determine if a given Addon ID is for an adblocker.</p> <p>As an example, this query will give the number of users who have an adblocker installed. <pre><code>SELECT\n    submission_date,\n    COUNT(DISTINCT client_id) AS dau,\nFROM\n    mozdata.telemetry.addons\nWHERE\n    mozfun.addons.is_adblocker(addon_id)\n    AND submission_date &gt;= \"2023-01-01\"\nGROUP BY\n    submission_date\n</code></pre></p>"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>addon_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>name STRING, expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"<p>Like map_equals but error message contains only the offending entry</p>"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre>"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"<p>Compare SQL Strings for equality</p>"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>expected ANY TYPE, actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>actual ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/","title":"bits28","text":"<p>The <code>bits28</code> functions provide an API for working with \"bit pattern\" INT64 fields, as used in the <code>clients_last_seen</code> dataset for desktop Firefox and similar datasets for other applications.</p> <p>A powerful feature of the <code>clients_last_seen</code> methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.</p> <p>The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.</p>"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"<p>Return a boolean indicating if any bits are set in the specified range of a bit pattern. The <code>start_offset</code> must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at <code>start_offset</code>.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"<p>Return the position of the rightmost set bit in an INT64 bit pattern.</p> <p>To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  mozfun.bits28.days_since_seen(18)\n-- &gt;&gt; 1\n</code></pre>"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"<p>Convert a string representing individual bits into an INT64.</p> <p>Implementation based on https://stackoverflow.com/a/51600210/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"<p>Return an INT64 representing a range of bits from a source bit pattern.</p> <p>The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.</p> <p>n_bits is the number of bits to consider, counting right from the bit at start_offset.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n  mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n  mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n  `mozdata.telemetry.clients_last_seen`\nWHERE\n  submission_date &gt; '2020-01-01'\n</code></pre>"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, start_offset INT64, n_bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"<p>Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.</p>"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"<p>Convert a bit pattern into an array of the dates is represents.</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p>"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64, submission_date DATE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;DATE&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"<p>Convert an INT64 field into a 28-character string representing the individual bits.</p> <p>Implementation based on https://stackoverflow.com/a/51600210/1260237</p> <p>See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference</p> <pre><code>SELECT\n  [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- &gt;&gt;&gt; ['0000000000000000000000000001',\n--      '0000000000000000000000000010',\n--      '0000000000000000000000000011']\n</code></pre>"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bits INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"<p>Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.</p>"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>bit_pos INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"<p>Extract bits from a byte array. Roughly matches substr with three arguments:   b: bytes - The byte string we need to extract from   start: int - The position of the first bit we want to extract.     Can be negative to start from the end of the byte array.     One-indexed, like substring.   length: int - The number of bits we want to extract</p> <p>The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.</p> <p>Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'</p>"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, `begin` INT64, length INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"<p>Zero bits on the right of byte</p>"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>b BYTES, length INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>BYTES\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"<p>These functions are specific for use with the <code>events_daily</code> and <code>event_types</code> tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.</p> <p>The <code>events_daily</code> table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the     compressed format, that clients' event history for that day.     The characters are ordered by the timestamp which they appeared     that day.</p> <p>The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.</p>"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"<p>Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.</p>"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>match_strings ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"<p>Generate the SQL statement that can be used to create an easily queryable view on events data.</p>"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"<p>Create a view that queries the <code>events_daily</code> table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.</p>"},{"location":"mozfun/event_analysis/#usage","title":"Usage","text":"<pre><code>create_events_view(\n    view_name STRING,\n    project STRING,\n    dataset STRING,\n    funnels ARRAY&lt;STRUCT&lt;\n        funnel_name STRING,\n        funnel ARRAY&lt;STRUCT&lt;\n            step_name STRING,\n            events ARRAY&lt;STRUCT&lt;\n                category STRING,\n                event_name STRING&gt;&gt;&gt;&gt;&gt;&gt;,\n    counts ARRAY&lt;STRUCT&lt;\n        count_name STRING,\n        events ARRAY&lt;STRUCT&lt;\n            category STRING,\n            event_name STRING&gt;&gt;&gt;&gt;\n  )\n</code></pre> <ul> <li><code>view_name</code>: The name of the view that will be created. This view     will be in the shared-prod project, in the analysis bucket,     and so will be queryable at:         <pre><code>`moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre></li> </ul> <ul> <li><code>project</code>: The project where the <code>dataset</code> is located.</li> </ul> <ul> <li><code>dataset</code>: The dataset that must contain both the <code>events_daily</code> and     <code>event_types</code> tables.</li> </ul> <ul> <li><code>funnels</code>: An array of funnels that will be created. Each funnel has     two parts:     1. <code>funnel_name</code>: The name of the funnel is what the column representing         the funnel will be named in the view. For example, with the value         <code>\"onboarding\"</code>, the view can be selected as follows:         <pre><code>SELECT onboarding\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>     2. <code>funnel</code>: The ordered series of steps that make up a funnel.         Each step also has:         1. <code>step_name</code>: Used to name the column             within the funnel and represents whether the user completed             that step on that day. For example, within <code>onboarding</code> a user may             have <code>completed_first_card</code> as a step; this can be queried at             <pre><code>SELECT onboarding.completed_first_step\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>         2. <code>events</code>: The set of events which indicate the user completed             that step of the funnel. Most of the time this is a single event.             Each event has a <code>category</code> and <code>event_name</code>.</li> </ul> <ul> <li><code>counts</code>: An array of counts. Each count has two parts, similar to funnel steps:     1. <code>count_name</code>: Used to name the column representing the event count. E.g.         <code>\"clicked_settings_count\"</code> would be queried at         <pre><code>SELECT clicked_settings_count\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n</code></pre>     2. <code>events</code>: The set of events you want to count. Each event has         a <code>category</code> and <code>event_name</code>.</li> </ul>"},{"location":"mozfun/event_analysis/#recommended-pattern","title":"Recommended Pattern","text":"<p>Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.</p> <p>This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.</p>"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"<p>The view will be created at  <pre><code>`moz-fx-data-shared-prod`.analysis.{event_name}.\n</code></pre></p> <p>The view will have a schema roughly matching the following: <pre><code>root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n |  |-- {funnel_step_1_name} boolean\n |  |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n |  |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n</code></pre></p>"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"<p>Each funnel will be a <code>STRUCT</code> with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.</p> <pre><code>STRUCT(\n    completed_step_1 BOOLEAN,\n    completed_step_2 BOOLEAN,\n    ...\n) AS funnel_name\n</code></pre> <p>With one row per-user per-day, you can use <code>COUNTIF(funnel_name.completed_step_N)</code> to query these fields. See below for an example.</p>"},{"location":"mozfun/event_analysis/#event-counts","title":"Event Counts","text":"<p>Each event count is simply an <code>INT64</code> representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.</p> <pre><code>event_count_name INT64\n</code></pre>"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"<p>The following creates a few fields: - <code>collection_flow</code> is a funnel for those that started creating     a collection within Fenix, and then finished, either by adding     those tabs to an existing collection or saving it as a new     collection. - <code>collection_flow_saved</code> represents users who started the collection     flow then saved it as a new collection. - <code>number_of_collections_created</code> is the number of collections created - <code>number_of_collections_deleted</code> is the number of collections deleted</p> <pre><code>CALL mozfun.event_analysis.create_events_view(\n  'fenix_collection_funnels',\n  'moz-fx-data-shared-prod',\n  'org_mozilla_firefox',\n\n  -- Funnels\n  [\n    STRUCT(\n      \"collection_flow\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"completed_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name),\n        STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n    ] AS funnel),\n\n    STRUCT(\n      \"collection_flow_saved\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"saved_collection\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n    ] AS funnel)\n  ],\n\n  -- Event Counts\n  [\n    STRUCT(\n      \"number_of_collections_created\" AS count_name,\n      [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n    ),\n    STRUCT(\n      \"number_of_collections_deleted\" AS count_name,\n      [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n    )\n  ]\n);\n</code></pre> <p>From there, you can query a few things. For example, the fraction  of users who completed each step of the collection flow over time: <pre><code>SELECT\n    submission_date,\n    COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n    COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date &gt;= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n</code></pre></p> <p>Or you can see the number of collections created and deleted: <pre><code>SELECT\n    submission_date,\n    SUM(number_of_collections_created) AS number_of_collections_created,\n    SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date &gt;= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n</code></pre></p>"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>view_name STRING, project STRING, dataset STRING, funnels ARRAY&lt;STRUCT&lt;funnel_name STRING, funnel ARRAY&lt;STRUCT&lt;step_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;&gt;&gt;, counts ARRAY&lt;STRUCT&lt;count_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"<p>Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.</p>"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>step_regexes ARRAY&lt;STRING&gt;, intermediate_steps BOOLEAN\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"<p>Generate the SQL statement that can be used to create an easily queryable view on events data.</p>"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, funnel ARRAY&lt;STRUCT&lt;list ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"<p>Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.</p>"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"<p>Given an event index string, create a match string that is an exact match in the events_daily table.</p>"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>index STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"<p>Given an event index and property index from an <code>event_types</code> table, returns a regular expression to match corresponding events within an <code>events_daily</code> table's <code>events</code> string that aren't missing the specified property.</p>"},{"location":"mozfun/event_analysis/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_index STRING, property_index INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"<p>Given an event index, property index, and property value from an <code>event_types</code> table, returns a regular expression to match corresponding events within an <code>events_daily</code> table's <code>events</code> string.</p>"},{"location":"mozfun/event_analysis/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>event_index STRING, property_index INTEGER, property_value STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"<p>Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.</p>"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"<pre><code>extract_event_counts(\n    events STRING\n)\n</code></pre> <p><code>events</code> - A comma-separated events string, where each event is represented as a string of unicode chars.</p>"},{"location":"mozfun/event_analysis/#example","title":"Example","text":"<p>See this dashboard for example usage.</p>"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>events STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;index STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"<p>Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.</p>"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"<pre><code>extract_event_counts_with_properties(\n    events STRING\n)\n</code></pre> <p><code>events</code> - A comma-separated events string, where each event is represented as a string of unicode chars.</p>"},{"location":"mozfun/event_analysis/#example_1","title":"Example","text":"<p>See this query for example usage.</p>"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"<p>This function extracts both counts for events with each property, and for all events without their properties.</p> <p>This allows us to include both total counts for an event (with any property value), and events that don't have properties.</p>"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>events STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;event_index STRING, property_index INT64, property_value_index STRING, count INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"<p>For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.</p>"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, count_name STRING, events ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>count_sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"<p>For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.</p>"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>project STRING, dataset STRING, funnel_name STRING, funnel ARRAY&lt;STRUCT&lt;step_name STRING, list ARRAY&lt;STRUCT&lt;category STRING, event_name STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>funnel_sql STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"<p>Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"</p> <p>UDF for handling empty Google Analytics data.</p>"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>s STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"<p>Parses the custom build id used for Fenix builds in GLAM to a datetime.</p>"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>build_hour STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>DATETIME\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"<p>Returns a custom build id generated from the build seconds of a FOG build.</p>"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>build_hour STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"<p>Returns a custom build id generated from the build hour of a Fenix build.</p>"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;STRING&gt;, val FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"<p>Cast histogram buckets into a string array.</p>"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"<p>Cast a histogram into a JSON blob.</p>"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"<p>Cast a String-based JSON histogram to an Array of Structs</p>"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>json_str STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;KEY STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"<p>Interpolate missing histogram buckets with empty buckets.</p>"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input_map ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, buckets ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"<p>Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.</p> <p>See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg</p> <p>To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.</p> <p>This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).</p> <p>Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).</p> <p>Calculate the total number of buckets K, as well as the total number of profiles <code>N reporting</code></p> <p>Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }</p>"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input_map ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, buckets ARRAY&lt;STRING&gt;, total_users INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"<p>Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.</p>"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>aggs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"<p>Create an empty histogram from an array of buckets.</p>"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>buckets ARRAY&lt;STRING&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"<p>Generate exponential buckets for a histogram.</p>"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min FLOAT64, max FLOAT64, nBuckets FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"<p>Generate functional buckets for a histogram. This is specific to Glean.</p> <p>See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs</p> <p>A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:</p> <p>i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$</p> <p>In other words, there are n buckets for each power of <code>base</code> magnitude.</p>"},{"location":"mozfun/glam/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>log_base INT64, buckets_per_magnitude INT64, range_max INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"<p>Generate linear buckets for a histogram.</p>"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min FLOAT64, max FLOAT64, nBuckets FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"<p>Generate scalar buckets for a histogram using a fixed number of buckets.</p>"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"<p>Compute the normalized sum of an array of histograms.</p>"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arrs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, weight FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"<p>Compute the normalized and the non-normalized sum of an array of histograms.</p>"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>arrs ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;, weight FLOAT64\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64, non_norm_value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"<p>INPUTS</p> <pre><code>required ARRAY&lt;FLOAT64&gt;, `values` ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"<p>INPUTS</p> <pre><code>required ARRAY&lt;FLOAT64&gt;, `values` ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"<p>Get the value of the approximate CDF at the given percentile.</p>"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"<p>INPUTS</p> <pre><code>pct FLOAT64, histogram ARRAY&lt;STRUCT&lt;key STRING, value FLOAT64&gt;&gt;, type STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>FLOAT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/","title":"glean","text":"<p>Functions for working with Glean data.</p>"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"<p>Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects</p> <p>This UDF transforms the <code>ping_info.experiments</code> field from Glean pings into the format for <code>experiments</code> used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.</p> <p>If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.</p>"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info__experiments ARRAY&lt;STRUCT&lt;key STRING, value STRUCT&lt;branch STRING, extra STRUCT&lt;type STRING, enrollment_id STRING&gt;&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"<p>Parses a Glean datetime metric string value as a BigQuery timestamp.</p> <p>See https://mozilla.github.io/glean/book/reference/metrics/datetime.html</p>"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>datetime_string STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>TIMESTAMP\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"<p>Returns the number of nanoseconds represented by a Glean timespan struct.</p> <p>See https://mozilla.github.io/glean/book/user/metrics/timespan.html</p>"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"<p>Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.</p> <p>See https://mozilla.github.io/glean/book/user/metrics/timespan.html</p>"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>timespan STRUCT&lt;time_unit STRING, value INT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"<p>Extract Segments from a campaign name. Includes region, country_code, and language.</p>"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>campaign_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;campaign_region STRING, campaign_country_code STRING, campaign_language STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"<p>Functions for use with Google Search Console data.</p>"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"<p>Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".</p>"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>site_domain_name STRING, query STRING, search_type STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"<p>Extract the country code from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"<p>Extract the domain name from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"<p>Extract the language code from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"<p>Extract the locale from a URL if it's present.</p>"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"<p>Extract the path from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"<p>Extract a particular path segment from a URL.</p>"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING, segment_number INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/","title":"hist","text":"<p>Functions for working with string encodings of histograms from desktop telemetry.</p>"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"<p>Given histogram h, return the count of all measurements across all buckets.</p> <p>Given histogram h, return the count of all measurements across all buckets.</p> <p>Extracts the values from the histogram and sums them, returning the total_count.</p>"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"<p>Return a parsed struct from a string-encoded histogram.</p> <p>We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.</p> <p>The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.</p> <p>For more on desktop telemetry histogram structure, see:</p> <ul> <li>https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/collection/histograms.html</li> </ul> <p>The compact encodings were originally proposed in:</p> <ul> <li>https://docs.google.com/document/d/1k_ji_1DB6htgtXnPpMpa7gX0klm-DGV5NMY7KkvVB00/edit#</li> </ul> <pre><code>SELECT\n  mozfun.hist.extract(\n    '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n  ).sum\n-- 1\n</code></pre> <pre><code>SELECT\n  mozfun.hist.extract('5').sum\n-- 5\n</code></pre>"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"<p>Extract a histogram sum from a JSON str representation</p>"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"<p>Sum of a keyed histogram, across all keys it contains.</p>"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"<p>Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is <code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;</code></p> <p>The return type is <code>INT64</code>.</p> <p>The <code>key</code> field will be ignored, and the `value is expected to be the compact histogram representation.</p>"},{"location":"mozfun/hist/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_histogram ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"<p>Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.</p> <p>https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307</p>"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;sum INT64, VALUES ARRAY&lt;STRUCT&lt;value INT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"<p>Merge an array of histograms into a single histogram.</p> <ul> <li>The histogram values will be summed per-bucket</li> <li>The count will be summed</li> <li>Other fields will take the mode_last</li> </ul>"},{"location":"mozfun/hist/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram_list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"<p>Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.</p>"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value INT64&gt;&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY&lt;INT64&gt;, `values` ARRAY&lt;STRUCT&lt;key INT64, value FLOAT64&gt;&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"<p>Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.</p>"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram ANY TYPE, percentiles ARRAY&lt;FLOAT64&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;percentile FLOAT64, value INT64&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"<p>Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.</p>"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"<p>Return the number of recorded observations greater than threshold for the histogram.  CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.</p> <p>https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239</p>"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>histogram STRING, threshold INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"<p>Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.</p>"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start DATETIME, `end` DATETIME\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;`interval` STRING, interval_count INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"<p>Used to parse <code>data</code> field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions</p>"},{"location":"mozfun/iap/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"<p>Used to parse <code>data</code> field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171</p>"},{"location":"mozfun/iap/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"<p>Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody</p>"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>provider_receipt_json STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY&lt;STRUCT&lt;cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64&gt;&gt;, pending_renewal_info ARRAY&lt;STRUCT&lt;auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING&gt;&gt;, receipt STRUCT&lt;adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY&lt;STRUCT&lt;cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64&gt;&gt;, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64&gt;, status INT64&gt;DETERMINISTIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"<p>Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.</p>"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>apple_receipt ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;environment STRING, active_period STRUCT&lt;start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64&gt;, trial_period STRUCT&lt;start_time TIMESTAMP, end_time TIMESTAMP&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/","title":"json","text":"<p>Functions for parsing Mozilla-specific JSON data types.</p>"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"<p>Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.</p> <p>This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.</p>"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"<p>Converts a standard  \"map\" like datastructure <code>array&lt;struct&lt;key, value&gt;&gt;</code> into a JSON value.</p> <p>Convert the standard <code>Array&lt;Struct&lt;key, value&gt;&gt;</code> style maps to <code>JSON</code> values.</p>"},{"location":"mozfun/json/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input JSON\n</code></pre> <p>OUTPUTS</p> <pre><code>json\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"<p>Converts a nested JSON object with repeated key/value pairs into a nested JSON object.</p> <p>Convert a JSON object like <code>{ \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] }</code> to a <code>JSON</code> object like <code>{ \"metric\": { \"key\": 2 } }</code>.</p> <p>This only works on JSON types.</p>"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"<p>OUTPUTS</p> <pre><code>json\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"<p>Returns an array of key/value structs from a string representing a JSON map.</p> <p>BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.</p> <p>Non-string non-null values are encoded as json.</p>"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>input STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.</p>"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"</p>"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"</p>"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"<p>LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"</p> <p>These states include whether a client was paid or organic.</p>"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"<p>Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.</p>"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"<p>LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date</p>"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"<p>LTV states for iOS.</p>"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/","title":"map","text":"<p>Functions for working with arrays of key/value structs.</p>"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"<p>Sums all values in a keyed scalar.</p>"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"<p>Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is <code>ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;</code></p> <p>The return type is <code>INT64</code>.</p> <p>The <code>key</code> field will be ignored.</p>"},{"location":"mozfun/map/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keyed_scalar ARRAY&lt;STRUCT&lt;key STRING, value INT64&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"<p>Create a map from two arrays (like zipping)</p>"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>keys ANY TYPE, `values` ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"<p>Fetch the value associated with a given key from an array of key/value structs.</p> <p>Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.</p>"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"<p>Fetch the value associated with a given key from an array of key/value structs.</p> <p>Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.</p>"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, k ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.</p>"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"<p>Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.</p> <p><code>map.set_key</code></p> <p>Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: <code>ARRAY&lt;STRUCT&lt;key ANY TYPE, value ANY TYPE&gt;&gt;</code>.</p> <p>The type of the key and value you are setting must match the types in the map itself.</p>"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"<p>Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY&gt;, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>entries ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"<p>Please provide a description for the routine</p>"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"<p>This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.</p> <p>We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p> <p>See also: <code>marketing.parse_campaign_name</code>, which does the same, but for campaign names.</p>"},{"location":"mozfun/marketing/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ad_group_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"<p>Parse a campaign name. Extracts things like region, country_code, and language.</p>"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"<p>This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.</p> <p>We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p>"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>campaign_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"<p>Parse segments from a creative name.</p>"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"<p>This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.</p> <p>We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.</p> <p>For a history of this naming scheme, see the original proposal.</p> <p>See also: <code>marketing.parse_campaign_name</code>, which does the same, but for campaign names.</p>"},{"location":"mozfun/marketing/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>creative_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>ARRAY&lt;STRUCT&lt;key STRING, value STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"<p>Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).</p>"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"<p>Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here</p>"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_name STRING, os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;normalized_app_name STRING, normalized_app_name_os STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/","title":"norm","text":"<p>Functions for normalizing data.</p>"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"<p>Adds metadata related to the browser version in a struct.</p> <p>This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.</p>"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>version_string STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"<p>Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start =&gt; '2021-01-01', end =&gt; '2021-03-01', grace_period =&gt; INTERVAL 0 day, inclusive =&gt; FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start =&gt; '2021-01-01', end =&gt; '2021-03-02 00:00:00.000001', grace_period =&gt; INTERVAL 1 DAY, inclusive =&gt; FALSE) returns 2, because start plus two months plus grace period is less than end.</p>"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"<p>Extracts numeric version data from a version string like <code>&lt;major&gt;.&lt;minor&gt;.&lt;patch&gt;</code>.</p> <p>Note: Non-zero minor and patch versions will be floating point <code>Numeric</code>.</p> <p>Usage:</p> <pre><code>SELECT\n    mozfun.norm.extract_version(version_string, 'major') as major_version,\n    mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n    mozfun.norm.extract_version(version_string, 'patch') as patch_version\n</code></pre> <p>Example using <code>\"96.05.01\"</code>:</p> <pre><code>SELECT\n    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1\n</code></pre>"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>version_string STRING, extraction_level STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>NUMERIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"<p>Returns canonical, human-understandable identification info for Fenix sources.</p> <p>The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId  values associated with different builds of Fenix,  each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names  (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.</p> <p>This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.</p> <p>The first argument (<code>app_id</code>) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.</p> <p>The second argument (<code>app_build_id</code>) should be the value in client_info.app_build.</p> <p>The function returns a <code>STRUCT</code> that contains the logical <code>app_name</code> and <code>channel</code> as well as the Play Store <code>app_id</code> in the canonical form which would appear in Play Store URLs.</p> <p>Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using <code>telemetry.fenix_clients_last_seen</code> which already handles the union. Otherwise, see the example below as a template for how construct a custom union.</p> <p>Mapping of channels to datasets:</p> <ul> <li>release: <code>org_mozilla_firefox</code></li> <li>beta: <code>org_mozilla_firefox_beta</code> (current) and <code>org_mozilla_fenix</code></li> <li>nightly: <code>org_mozilla_fenix</code> (current), <code>org_mozilla_fennec_aurora</code>, and <code>org_mozilla_fenix_nightly</code></li> </ul> <pre><code>-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n  (\n    SELECT AS STRUCT\n      m.submission_timestamp,\n      m.metrics.string.geckoview_version,\n      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n  )\n);\n\nWITH base AS (\n  SELECT\n    extract_fields('org_mozilla_firefox_beta', m).*\n  FROM\n    `mozdata.org_mozilla_firefox_beta.metrics` AS m\n  UNION ALL\n  SELECT\n    extract_fields('org_mozilla_fenix', m).*\n  FROM\n    `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n  DATE(submission_timestamp) AS submission_date,\n  geckoview_version,\n  COUNT(*)\nFROM\n  base\nWHERE\n  app_name = 'Fenix'  -- excludes 'Firefox Preview'\n  AND channel = 'beta'\n  AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n  submission_date,\n  geckoview_version\n</code></pre>"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_id STRING, app_build_id STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, channel STRING, app_id STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"<p>Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.</p> <p>Fenix originally used an 8-digit app_build format</p> <p>In short it is <code>yDDDHHmm</code>:</p> <ul> <li>y is years since 2018</li> <li>DDD is day of year, 0-padded, 001-366</li> <li>HH is hour of day, 00-23</li> <li>mm is minute of hour, 00-59</li> </ul> <p>The last date seen with an 8-digit build ID is 2020-08-10.</p> <p>Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.</p> <p>This function tolerates both formats.</p> <p>After using this you may wish to <code>DATETIME_TRUNC(result, DAY)</code> for grouping by build date.</p>"},{"location":"mozfun/norm/#parameters_4","title":"Parameters","text":"<p>INPUTS</p> <pre><code>app_build STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>INT64\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"<p>Map Fenix package name to the channel name</p>"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"<p>INPUTS</p> <pre><code>package_name STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"<p>This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.</p> <p>The function will return the first value pair in the input array, that is not null and has the earliest timestamp.</p> <p>Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.</p> <p>Usage:</p> <pre><code>SELECT\n   mozfun.norm.get_earliest_value(ARRAY&lt;STRUCT&lt;value STRING, value_source STRING, value_date DATETIME&gt;&gt;) AS &lt;alias&gt;\n</code></pre>"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"<p>INPUTS</p> <pre><code>value_set ARRAY&lt;STRUCT&lt;value STRING, value_source STRING, value_date DATETIME&gt;&gt;\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"<p>Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"<p>This function is primarily designed to parse the field <code>os_version</code> in table <code>mozdata.default_browser_agent.default_browser</code>. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the <code>os_version</code> values collected in table <code>mozdata.default_browser_agent.default_browser</code>.</p>"},{"location":"mozfun/norm/#status-as-of-november-2022","title":"Status as of November 2022","text":"<p>As of November 2022, the expected valid values of <code>os_version</code> are either <code>x.y.z</code> or <code>w.x.y.z</code> where <code>w</code>, <code>x</code>, <code>y</code>, and <code>z</code> are integers.</p> <p>As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of <code>z</code> in <code>x.y.z</code> are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of <code>z</code> in <code>w.x.y.z</code> are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.</p> <p>As of November 2022, the parser currently handles just over 99.89% of data in the field <code>os_version</code> in table <code>mozdata.default_browser_agent.default_browser</code>.</p>"},{"location":"mozfun/norm/#build-number-conventions","title":"Build number conventions","text":"<p>Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number <code>22621.900</code> for version <code>22621</code>. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.</p>"},{"location":"mozfun/norm/#example-usage","title":"Example usage","text":"<pre><code>SELECT\n  `os_version`,\n  mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` &gt; (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n</code></pre>"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os_version STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;name STRING, version_name STRING, version_number DECIMAL, build_number INT64&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"<p>Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.</p>"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"<p>INPUTS</p> <pre><code>client_info ANY TYPE, metrics ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>string\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"<p>Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.</p>"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"<p>INPUTS</p> <pre><code>ping_info ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"<p>Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.</p>"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"<p>INPUTS</p> <pre><code>metadata ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>`date`, CAST(NULL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"<p>Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".</p> <p>This is a reimplementation of logic used in the data pipeline   to populate <code>normalized_os</code>.</p>"},{"location":"mozfun/norm/#parameters_11","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"<p>Returns a normalized <code>app_name</code> and <code>canonical_app_name</code> for a product based on <code>legacy_app_name</code> and <code>normalized_os</code> values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.</p> <p>As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The <code>probeinfo</code> API is the single source of truth for metadata about applications sending Glean telemetry; the values for <code>app_name</code> and <code>canonical_app_name</code> returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.</p> <p>For legacy telemetry pings like <code>main</code> ping for desktop and <code>core</code> ping for mobile products, the <code>legacy_app_name</code> given as input to this function should come from the submission URI (stored as <code>metadata.uri.app_name</code> in BigQuery ping tables). For Glean pings, we have invented <code>product</code> values that can be passed in to this function as the <code>legacy_app_name</code> parameter.</p> <p>The returned <code>app_name</code> values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. <code>product</code> is a deprecated field that was similar in intent.</p> <p>The returned <code>canonical_app_name</code> is more verbose and is suited for displaying in visualizations. <code>canonical_name</code> is a synonym that we provide for historical compatibility with previous versions of this function.</p> <p>The returned struct also contains boolean <code>contributes_to_2021_kpi</code> as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.</p> <p>The <code>normalized_os</code> value that's passed in should be the top-level <code>normalized_os</code> value present in any ping table or you may want to wrap a raw value in <code>mozfun.norm.os</code> like <code>mozfun.norm.product_info(app_name, mozfun.norm.os(os))</code>.</p> <p>This function also tolerates passing in a <code>product</code> value as <code>legacy_app_name</code> so that this function is still useful for derived tables which have thrown away the raw <code>app_name</code> value from legacy pings.</p> <p>The mappings are as follows:</p> legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"<p>INPUTS</p> <pre><code>legacy_app_name STRING, normalized_os STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"<p>Convert urlbar result types into product-friendly names</p> <p>This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.</p>"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"<p>INPUTS</p> <pre><code>res STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"<p>Truncates a version string like <code>&lt;major&gt;.&lt;minor&gt;.&lt;patch&gt;</code> to either the major or minor version. The return value is <code>NUMERIC</code>, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).</p> <p>For example, \"5.1.0\" would be translated to <code>5.1</code> if the parameter is \"minor\" or <code>5</code> if the parameter is major.</p> <p>If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as <code>10</code> when run through this function, no matter what the arguments).</p> <p>This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.</p>"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os_version STRING, truncation_level STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>NUMERIC\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"<p>Accepts vpn attribution fields as input and returns a struct of normalized fields.</p>"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"<p>Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.</p> <p>Requires os, os_version and windows_build_number.</p> <p>E.G. from windows_build_number &gt;= 22000 return Windows 11</p>"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"<p>INPUTS</p> <pre><code>os STRING, os_version STRING, windows_build_number INT64\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/serp_events/","title":"serp_events","text":"<p>Functions for working with Glean SERP events.</p>"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"<p>Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.</p>"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>num_loaded INT, num_blocked INT\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"<p>Determine whether a SERP display component referenced in the serp events contains monetizable ads</p>"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>component STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>BOOL\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/stats/","title":"stats","text":"<p>Statistics functions.</p>"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"<p>Returns the most frequently occuring element in an array.</p> <p>In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: <code>stats.mode_last_retain_nulls</code>, which retains nulls.</p>"},{"location":"mozfun/stats/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"<p>Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.</p>"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>list ANY TYPE\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"<p>Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.</p> <p>Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.</p> <p>Use it like: <pre><code>DECLARE res ARRAY&lt;STRUCT&lt;i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING&gt;&gt;;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n</code></pre></p> <p>You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.</p>"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>query_a STRING, query_b STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>res ARRAY&lt;STRUCT&lt;i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING&gt;&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"<p>Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns. <p>This UDF extracts UTM parameters from a URL string.</p> <p>UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.</p>"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"<p>Extract the Path from a URL</p> <p>This UDF extracts path from a URL string.</p> <p>The path is everything after the host and before parameters. This function returns \"/\" if there is no path.</p>"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>url STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/","title":"vpn","text":"<p>Functions for processing VPN data.</p>"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"<p>Assign an acquisition channel based on utm parameters</p>"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"<p>Assign a channel group based on utm parameters</p>"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"<p>Normalize utm parameters to use the same NULL placeholders as Google Analytics</p>"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"<p>INPUTS</p> <pre><code>utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n</code></pre> <p>OUTPUTS</p> <pre><code>STRUCT&lt;utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING&gt;\n</code></pre> <p>Source  |  Edit</p>"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"<p>Combine the pricing and interval for a subscription plan into a single field</p>"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"<p>INPUTS</p> <pre><code>provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n</code></pre> <p>OUTPUTS</p> <pre><code>STRING\n</code></pre> <p>Source  |  Edit</p>"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"<p>Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.</p> <p>Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.</p> <p>More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org <code>proposals/</code> folder).</p>"},{"location":"reference/airflow_tags/#valid-tags","title":"Valid tags","text":""},{"location":"reference/airflow_tags/#impacttier-tag","title":"impact/tier tag","text":"<p>We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:</p> <ul> <li>impact/tier_1: Highest priority/impact/critical DAG. A job with this tag implies that many downstream processes are impacted and affects Mozilla\u2019s (many users across different teams and departments) ability to make decisions. A bug ticket must be created and the issue needs to be resolved as soon as possible.</li> <li>impact/tier_2:  Job of increased importance and impact, however, not critical and only limited impact on other processes. One team or group of people is affected and the pipeline does not generate any business critical metrics. A bug ticket must be created and should be addressed within a few working days.</li> <li>impact/tier_3: No impact on other processes and is not used to generate any metrics used by business users or to make any decisions. A bug ticket should be created and it\u2019s up to the job owner to fix this issue in whatever time frame they deem to be reasonable.</li> </ul>"},{"location":"reference/airflow_tags/#triage-tag","title":"triage/ tag","text":"<p>This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.</p> <ul> <li>triage/record_only: Failures should only be recorded and the job owner informed without taking any active steps to fix the failure.</li> </ul> <ul> <li>triage/no_triage: No triage should be performed on this job. Should only be used in a limited number of cases, like this is still WIP, where no production processes are affected.</li> </ul> <ul> <li>triage/confidential - Failures should be recorded by the triage engineer as normal, and bug should be marked Confidential.</li> </ul>"},{"location":"reference/configuration/","title":"Configuration","text":"<p>The behaviour of <code>bqetl</code> can be configured via the <code>bqetl_project.yaml</code> file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.</p> <p>The general structure of <code>bqetl_project.yaml</code> is as follows:</p> <pre><code>dry_run:\n  function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n  test_project: bigquery-etl-integration-test\n  skip:\n  - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n  - sql/**/apple_ads_external*/**/query.sql\n  # - ...\n\nviews:\n  skip_validation:\n  - sql/moz-fx-data-test-project/test/simple_view/view.sql\n  - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n  - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n  # - ...\n  skip_publishing:\n  - activity_stream/tile_id_types/view.sql\n  - pocket/pocket_reach_mau/view.sql\n  # - ...\n  non_user_facing_suffixes:\n  - _derived\n  - _external\n  # - ...\n\nschema:\n  skip_update:\n  - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n  # - ...\n  skip_prefixes:\n  - pioneer\n  - rally\n\nroutines:\n  skip_publishing:\n  - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n  skip:\n  - bigquery_etl/glam/templates/*.sql\n  - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n  - stored_procedures/safe_crc32_uuid.sql\n  # - ...\n</code></pre>"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"<p><code>ConfigLoader</code> can be used in the bigquery_etl tooling codebase to access configuration parameters. <code>bqetl_project.yaml</code> is automatically loaded in <code>ConfigLoader</code> and parameters can be accessed via a <code>get()</code> method:</p> <pre><code>from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n</code></pre> <p>The <code>ConfigLoader.get()</code> method allows multiple string parameters to reference a configuration value that is stored in a  nested structure. A <code>fallback</code> value can be optionally provided in case the configuration parameter is not set.</p>"},{"location":"reference/configuration/#adding-configuration-parameters","title":"Adding configuration parameters","text":"<p>New configuration parameters can simply be added to <code>bqetl_project.yaml</code>. <code>ConfigLoader.get()</code> allows for these new parameters simply to be referenced without needing to be changed or updated.</p>"},{"location":"reference/data_checks/","title":"bqetl Data Checks","text":"<p>Instructions on how to add data checks can be found in the Adding data checks section below.</p>"},{"location":"reference/data_checks/#background","title":"Background","text":"<p>To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.</p>"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"<p>bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.</p> <p>This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the <code>checks.sql</code> files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.</p> <p>It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.</p>"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"<p>Each check needs to be categorised with a marker, currently following markers are available:</p> <ul> <li><code>#fail</code> indicates that the ETL pipeline should stop if this check fails (circuit-breaker pattern) and a notification is sent out. This marker should be used for checks that indicate a serious data issue.</li> </ul> <ul> <li><code>#warn</code> indicates that the ETL pipeline should continue even if this check fails. These type of checks can be used to indicate potential issues that might require more manual investigation.</li> </ul> <p>Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.</p>"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"<p>Inside the query directory, which usually contains <code>query.sql</code> or <code>query.py</code>, <code>metadata.yaml</code> and <code>schema.yaml</code>, create a new file called <code>checks.sql</code> (unless already exists).</p> <p>Please make sure each check you add contains a marker (see: the Two categories of checks section above).</p> <p>Once checks have been added, we need to <code>regenerate the DAG</code> responsible for scheduling the query.</p>"},{"location":"reference/data_checks/#update-checkssql","title":"Update checks.sql","text":"<p>If <code>checks.sql</code> already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.</p> <p>When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.</p>"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"<p>All checks can be removed by deleting the <code>checks.sql</code> file and regenerating the DAG responsible for scheduling the query.</p> <p>Alternatively, specific checks can be removed by deleting them from the <code>checks.sql</code> file.</p>"},{"location":"reference/data_checks/#example-checkssql","title":"Example checks.sql","text":"<p>Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in <code>tests/checks</code> which may take different parameters used to generate the SQL check expression.</p> <p>Example of what a <code>checks.sql</code> may look like:</p> <pre><code>-- raw SQL checks\n#fail\nASSERT (\n  SELECT\n    COUNTIF(ISNULL(country)) / COUNT(*)\n    FROM telemetry.table_v1\n    WHERE submission_date = @submission_date\n  ) &gt; 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"<p>Usage: <pre><code>Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre></p> <p>Example:</p> <pre><code>#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n</code></pre>"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"<p>Usage:</p> <pre><code>Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n</code></pre> <p>Example:</p> <pre><code>#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n</code></pre> <p>Please keep in mind the below checks can be combined and specified in the same <code>checks.sql</code> file. For example:</p> <pre><code>#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n</code></pre>"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"<p>Compares the row count of the current partition to the average of <code>number_of_days</code> past partitions and checks if the row count is within the average +- <code>threshold_percentage</code> %</p> <p>Usage:</p> <pre><code>Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n</code></pre> <p>Example: <pre><code>#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"<p>Checks that the column has values of specific character length.</p> <p>Usage:</p> <pre><code>Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n</code></pre> <p>Example: <pre><code>#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"<p>Checks that the column values adhere to a pattern based on a regex expression.</p> <p>Usage:</p> <pre><code>Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n</code></pre> <p>Example: <pre><code>#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n</code></pre></p>"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"<p>To list all available commands in the bqetl data checks CLI:</p> <pre><code>$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n  Commands for managing and running bqetl data checks.\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n  IN ACTIVE DEVELOPMENT\n\n  The current progress can be found under:\n\n          https://mozilla-hub.atlassian.net/browse/DENG-919\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  render  Renders data check query using parameters provided (OPTIONAL).\n  run     Runs data checks defined for the dataset (checks.sql).\n</code></pre> <p>To see see how to use a specific command use:</p> <pre><code>$ ./bqetl check [command] --help\n</code></pre> <p><code>render</code></p>"},{"location":"reference/data_checks/#usage","title":"Usage","text":"<pre><code>$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --help                          Show this message and exit.\n</code></pre>"},{"location":"reference/data_checks/#example","title":"Example","text":"<pre><code>./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n</code></pre> <p><code>run</code></p>"},{"location":"reference/data_checks/#usage_1","title":"Usage","text":"<pre><code>$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --dry_run, --dry-run            To dry run the query to make sure it is\n                                  valid\n  --marker TEXT                   Marker to filter checks.\n  --help                          Show this message and exit.\n</code></pre>"},{"location":"reference/data_checks/#examples","title":"Examples","text":"<pre><code># to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n</code></pre>"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":"<ul> <li>BigQuery billing discounts for destination table partitions not modified in   the last 90 days</li> <li>May use dags.utils.gcp.bigquery_etl_query to simplify airflow configuration   e.g. see dags.main_summary.exact_mau28_by_dimensions</li> <li>May use script/generate_incremental_table to automate backfilling</li> <li>Should use <code>WRITE_TRUNCATE</code> mode or <code>bq query --replace</code> to replace   partitions atomically to prevent duplicate data</li> <li>Will have tooling to generate an optimized mostly materialized view that   only calculates the most recent partition</li> </ul>"},{"location":"reference/incremental/#properties","title":"Properties","text":"<ul> <li>Must accept a date via <code>@submission_date</code> query parameter<ul> <li>Must output a column named <code>submission_date</code> matching the query parameter</li> </ul> </li> <li>Must produce similar results when run multiple times<ul> <li>Should produce identical results when run multiple times</li> </ul> </li> <li>May depend on the previous partition<ul> <li>Should be impacted by values from a finite number of preceding partitions<ul> <li>This allows for backfilling in chunks instead of serially for all time   and limiting backfills to a certain number of days following updated data</li> <li>For example <code>sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql</code> can be run serially on any 28 day   period and the last day will be the same whether or not the partition   preceding the first day was missing because values are only impacted by   27 preceding days</li> </ul> </li> </ul> </li> </ul>"},{"location":"reference/public_data/","title":"Public Data","text":"<p>For background, see Accessing Public Data on <code>docs.telemetry.mozilla.org</code>.</p> <ul> <li>To make query results publicly available, the <code>public_bigquery</code> flag must be set in   <code>metadata.yaml</code><ul> <li>Tables will get published in the <code>mozilla-public-data</code> GCP project which is accessible   by everyone, also external users</li> </ul> </li> <li>To make query results publicly available as JSON, <code>public_json</code> flag must be set in   <code>metadata.yaml</code><ul> <li>Data will be accessible under https://public-data.telemetry.mozilla.org<ul> <li>A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json</li> </ul> </li> <li>For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json</li> <li>Output JSON files have a maximum size of 1GB, data can be split up into multiple files (<code>000000000000.json</code>, <code>000000000001.json</code>, ...)</li> <li><code>incremental_export</code> controls how data should be exported as JSON:<ul> <li><code>false</code>: all data of the source table gets exported to a single location<ul> <li>https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json</li> </ul> </li> <li><code>true</code>: only data that matches the <code>submission_date</code> parameter is exported as JSON to a separate directory for this date<ul> <li>https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json</li> </ul> </li> </ul> </li> </ul> </li> <li>For each dataset, a <code>metadata.json</code> gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json</li> <li>The timestamp when the dataset was last updated is recorded in <code>last_updated</code>, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated</li> </ul>"},{"location":"reference/recommended_practices/","title":"Recommended practices","text":""},{"location":"reference/recommended_practices/#queries","title":"Queries","text":"<ul> <li>Should be defined in files named as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/query.sql</code> e.g.<ul> <li><code>&lt;project&gt;</code> defines both where the destination table resides and in which project the query job runs   <code>sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql</code></li> <li>Queries that populate tables should always be named with a version suffix;   we assume that future optimizations to the data representation may require   schema-incompatible changes such as dropping columns</li> </ul> </li> <li>May be generated using a python script that prints the query to stdout<ul> <li>Should save output as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;_&lt;version&gt;/query.sql</code> as above</li> <li>Should be named as <code>sql/&lt;project&gt;/query_type.sql.py</code> e.g. <code>sql/moz-fx-data-shared-prod/clients_daily.sql.py</code></li> <li>May use options to generate queries for different destination tables e.g.   using <code>--source telemetry_core_parquet_v3</code> to generate   <code>sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql</code> and using <code>--source main_summary_v4</code> to   generate <code>sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql</code></li> <li>Should output a header indicating options used e.g.   <pre><code>-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet\n</code></pre></li> </ul> </li> <li>For tables in <code>moz-fx-data-shared-prod</code> the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)</li> <li>Should be incremental</li> <li>Should filter input tables on partition and clustering columns</li> <li>Should use <code>_</code> prefix in generated column names not meant for output</li> <li>Should use <code>_bits</code> suffix for any integer column that represents a bit pattern</li> <li>Should not use <code>DATETIME</code> type, due to incompatibility with   spark-bigquery-connector</li> <li>Should read from <code>*_stable</code> tables instead of including custom deduplication<ul> <li>Should use the earliest row for each <code>document_id</code> by <code>submission_timestamp</code>   where filtering duplicates is necessary</li> </ul> </li> <li>Should not refer to views in the <code>mozdata</code> project which are duplicates of views in another project   (commonly <code>moz-fx-data-shared-prod</code>). Refer to the original view instead.</li> <li>Should escape identifiers that match keywords, even if they aren't reserved keywords</li> <li>Queries are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#querying-metrics","title":"Querying Metrics","text":"<ul> <li>Queries, views and UDFs can reference metrics and data sources that have been defined in metric-hub<ul> <li>To reference metrics use <code>{{ metrics.calculate() }}</code>:   <pre><code>SELECT\n  *\nFROM\n  {{ metrics.calculate(\n    metrics=['days_of_use', 'active_hours'],\n    platform='firefox_desktop',\n    group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},\n    where='submission_date = \"2023-01-01\"'\n  ) }}\n\n-- this translates to\nSELECT\n  *\nFROM\n  (\n    WITH clients_daily AS (\n      SELECT\n        client_id AS client_id,\n        submission_date AS submission_date,\n        COALESCE(SUM(active_hours_sum), 0) AS active_hours,\n        COUNT(submission_date) AS days_of_use,\n      FROM\n        mozdata.telemetry.clients_daily\n      GROUP BY\n        client_id,\n        submission_date\n    )\n    SELECT\n      clients_daily.client_id,\n      clients_daily.submission_date,\n      active_hours,\n      days_of_use,\n    FROM\n      clients_daily\n  )\n</code></pre><ul> <li><code>metrics</code>: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)</li> <li><code>platform</code>: platform to compute metrics for (e.g. <code>firefox_desktop</code>, <code>firefox_ios</code>, <code>fenix</code>, ...)</li> <li><code>group_by</code>: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; <code>GROUP BY</code> always includes the configured <code>client_id</code> and <code>submission_date</code> fields</li> <li><code>where</code>: SQL filter clause</li> <li><code>group_by_client_id</code>: Whether the field configured as <code>client_id</code> (defined as part of the data source specification in metric-hub) should be part of the <code>GROUP BY</code>. <code>True</code> by default</li> <li><code>group_by_submission_date</code>: Whether the field configured as <code>submission_date</code> (defined as part of the data source specification in metric-hub) should be part of the <code>GROUP BY</code>. <code>True</code> by default</li> </ul> </li> <li>To reference data source definitions use <code>{{ metrics.data_source() }}</code>:   <pre><code>SELECT\n  *\nFROM\n  {{ metrics.data_source(\n    data_source='main',\n    platform='firefox_desktop',\n    where='submission_date = \"2023-01-01\"'\n  ) }}\n\n-- this translates to\nSELECT\n  *\nFROM\n  (\n    SELECT *\n    FROM `mozdata.telemetry.main`\n    WHERE submission_date = \"2023-01-01\"\n  )\n</code></pre></li> </ul> </li> <li>To render queries that use Jinja expressions or statements use <code>./bqetl query render path/to/query.sql</code></li> <li>The <code>generated-sql</code> branch has rendered queries/views/UDFs</li> <li><code>./bqetl query run</code> does support running Jinja queries</li> </ul>"},{"location":"reference/recommended_practices/#query-metadata","title":"Query Metadata","text":"<ul> <li>For each query, a <code>metadata.yaml</code> file should be created in the same directory</li> <li>This file contains a description, owners and labels. As an example:</li> </ul> <pre><code>friendly_name: SSL Ratios\ndescription: &gt;\n  Percentages of page loads Firefox users have performed that were\n  conducted over SSL broken down by country.\nowners:\n  - example@mozilla.com\nlabels:\n  application: firefox\n  incremental: true # incremental queries add data to existing tables\n  schedule: daily # scheduled in Airflow to run daily\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1414839 # Bugzilla bug ID of data review\n  incremental_export: false # non-incremental JSON export writes all data to a single location\n</code></pre> <ul> <li>only labels where value types are eithers integers or strings are published, all other values types are being skipped</li> </ul>"},{"location":"reference/recommended_practices/#views","title":"Views","text":"<ul> <li>Should be defined in files named as <code>sql/&lt;project&gt;/&lt;dataset&gt;/&lt;table&gt;/view.sql</code> e.g.   <code>sql/moz-fx-data-shared-prod/telemetry/core/view.sql</code><ul> <li>Views should generally not be named with a version suffix; a view represents a   stable interface for users and whenever possible should maintain compatibility   with existing queries; if the view logic cannot be adapted to changes in underlying   tables, breaking changes must be communicated to <code>fx-data-dev@mozilla.org</code></li> </ul> </li> <li>Must specify project and dataset in all table names<ul> <li>Should default to using the <code>moz-fx-data-shared-prod</code> project;   the <code>scripts/publish_views</code> tooling can handle parsing the definitions to publish   to other projects such as <code>derived-datasets</code></li> </ul> </li> <li>Should not refer to views in the <code>mozdata</code> project which are duplicates of views in another project   (commonly <code>moz-fx-data-shared-prod</code>). Refer to the original view instead.</li> <li>Views are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#udfs","title":"UDFs","text":"<ul> <li>Should limit the number of expression subqueries to avoid: <code>BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.</code></li> <li>Should be used to avoid code duplication</li> <li>Must be named in files with lower snake case names ending in <code>.sql</code>   e.g. <code>mode_last.sql</code><ul> <li>Each file must only define effectively private helper functions and one   public function which must be defined last<ul> <li>Helper functions must not conflict with function names in other files</li> </ul> </li> <li>SQL UDFs must be defined in the <code>udf/</code> directory and JS UDFs must be defined   in the <code>udf_js</code> directory<ul> <li>The <code>udf_legacy/</code> directory is an exception which must only contain   compatibility functions for queries migrated from Athena/Presto.</li> </ul> </li> </ul> </li> <li>Functions must be defined as persistent UDFs   using <code>CREATE OR REPLACE FUNCTION</code> syntax<ul> <li>Function names must be prefixed with a dataset of <code>&lt;dir_name&gt;.</code> so, for example,   all functions in <code>udf/*.sql</code> are part of the <code>udf</code> dataset<ul> <li>The final syntax for creating a function in a file will look like   <code>CREATE OR REPLACE FUNCTION &lt;dir_name&gt;.&lt;file_name&gt;</code></li> </ul> </li> <li>We provide tooling in <code>scripts/publish_persistent_udfs</code> for   publishing these UDFs to BigQuery<ul> <li>Changes made to UDFs need to be published manually in order for the   dry run CI task to pass</li> </ul> </li> </ul> </li> <li>Should use <code>SQL</code> over <code>js</code> for performance</li> <li>UDFs are interpreted as Jinja templates, so it is possible to use Jinja statements and expressions</li> </ul>"},{"location":"reference/recommended_practices/#large-backfills","title":"Large Backfills","text":"<ul> <li>Should be documented and reviewed by a peer using a   new bug that describes   the context that required the backfill and the command or script used.</li> <li>Frequent backfills should be avoided<ul> <li>Backfills may increase storage costs for a table for 90 days by moving   data from long-term storage to short-term storage and requiring a production snapshot.</li> <li>Should combine multiple backfills happening around the same time</li> <li>Should delay column deletes until the next other backfill<ul> <li>Should use <code>NULL</code> for new data and <code>EXCEPT</code> to exclude from views until   dropped</li> </ul> </li> </ul> </li> <li>After running the backfill, it is important to validate that the job(s) ran without errors   and the execution times and bytes processed are as expected.   Here is a query you may use for this purpose:   <pre><code>SELECT\n  job_type,\n  state,\n  submission_date,\n  destination_dataset_id,\n  destination_table_id,\n  total_terabytes_billed,\n  total_slot_ms,\n  error_location,\n  error_reason,\n  error_message\nFROM\n  moz-fx-data-shared-prod.monitoring.bigquery_usage\nWHERE\n  submission_date &lt;= CURRENT_DATE()\n  AND destination_dataset_id LIKE \"%backfills_staging_derived%\"\n  AND destination_table_id LIKE \"%{{ your table name }}%\"\nORDER BY\n  submission_date DESC\n</code></pre></li> </ul>"},{"location":"reference/scheduling/","title":"Scheduling Queries in Airflow","text":"<ul> <li>bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries</li> <li>To be scheduled, a query must be assigned to a DAG that is specified in <code>dags.yaml</code><ul> <li>New DAGs can be configured in <code>dags.yaml</code>, e.g., by adding the following: <pre><code>bqetl_ssl_ratios: # name of the DAG; must start with bqetl_\n  schedule_interval: 0 2 * * * # query schedule\n  description: The DAG schedules SSL ratios queries.\n  default_args:\n    owner: example@mozilla.com\n    start_date: \"2020-04-05\" # YYYY-MM-DD\n    email: [\"example@mozilla.com\"]\n    retries: 2 # number of retries if the query execution fails\n    retry_delay: 30m\n</code></pre></li> <li>All DAG names need to have <code>bqetl_</code> as prefix.</li> <li><code>schedule_interval</code> is either defined as a CRON expression or alternatively as one of the following CRON presets: <code>once</code>, <code>hourly</code>, <code>daily</code>, <code>weekly</code>, <code>monthly</code></li> <li><code>start_date</code> defines the first date for which the query should be executed<ul> <li>Airflow will not automatically backfill older dates if <code>start_date</code> is set in the past, backfilling can be done via the Airflow web interface</li> </ul> </li> <li><code>email</code> lists email addresses alerts should be sent to in case of failures when running the query</li> </ul> </li> <li>Alternatively, new DAGs can also be created via the <code>bqetl</code> CLI by running <code>bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner=\"example@mozilla.com\" --start_date=\"2020-04-05\" --description=\"This DAG generates SSL ratios.\"</code></li> <li>To schedule a specific query, add a <code>metadata.yaml</code> file that includes a <code>scheduling</code> section, for example:   <pre><code>friendly_name: SSL ratios\n# ... more metadata, see Query Metadata section above\nscheduling:\n  dag_name: bqetl_ssl_ratios\n</code></pre><ul> <li>Additional scheduling options:<ul> <li><code>depends_on_past</code> keeps query from getting executed if the previous schedule for the query hasn't succeeded</li> <li><code>date_partition_parameter</code> - by default set to <code>submission_date</code>; can be set to <code>null</code> if query doesn't write to a partitioned table</li> <li><code>parameters</code> specifies a list of query parameters, e.g. <code>[\"n_clients:INT64:500\"]</code></li> <li><code>arguments</code> - a list of arguments passed when running the query, for example: <code>[\"--append_table\"]</code></li> <li><code>referenced_tables</code> - manually curated list of tables a Python or BigQuery script depends on; for <code>query.sql</code> files dependencies will get determined automatically and should only be overwritten manually if really necessary</li> <li><code>multipart</code> indicates whether a query is split over multiple files <code>part1.sql</code>, <code>part2.sql</code>, ...</li> <li><code>depends_on</code> defines external dependencies in telemetry-airflow that are not detected automatically: <pre><code>depends_on:\n  - task_id: external_task\n    dag_name: external_dag\n    execution_delta: 1h\n</code></pre><ul> <li><code>task_id</code>: name of task query depends on</li> <li><code>dag_name</code>: name of the DAG the external task is part of</li> <li><code>execution_delta</code>: time difference between the <code>schedule_intervals</code> of the external DAG and the DAG the query is part of</li> </ul> </li> <li><code>depends_on_tables_existing</code> defines tables that the ETL will await the existence of via an Airflow sensor before running:   <pre><code>depends_on_tables_existing:\n  - task_id: wait_for_foo_bar_baz\n    table_id: 'foo.bar.baz_{{ ds_nodash }}'\n    poke_interval: 30m\n    timeout: 12h\n    retries: 1\n    retry_delay: 10m\n</code></pre><ul> <li><code>task_id</code>: ID to use for the generated Airflow sensor task.</li> <li><code>table_id</code>: Fully qualified ID of the table to wait for, including the project and dataset.</li> <li><code>poke_interval</code>: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default poke interval is 5 minutes).</li> <li><code>timeout</code>: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default timeout is 8 hours).</li> <li><code>retries</code>: The number of retries that should be performed if the sensor times out or otherwise fails.   This parameter is optional (the default depends on how the DAG is configured).</li> <li><code>retry_delay</code>: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default depends on how the DAG is configured).</li> </ul> </li> <li><code>depends_on_table_partitions_existing</code> defines table partitions that the ETL will await the existence of via an Airflow sensor before running:   <pre><code>depends_on_table_partitions_existing:\n  - task_id: wait_for_foo_bar_baz\n    table_id: foo.bar.baz\n    partition_id: '{{ ds_nodash }}'\n    poke_interval: 30m\n    timeout: 12h\n    retries: 1\n    retry_delay: 10m\n</code></pre><ul> <li><code>task_id</code>: ID to use for the generated Airflow sensor task.</li> <li><code>table_id</code>: Fully qualified ID of the table to check, including the project and dataset.   Note that the service account <code>airflow-access@moz-fx-data-shared-prod.iam.gserviceaccount.com</code> will need to have the BigQuery Job User role on the project and read access to the dataset.</li> <li><code>partition_id</code>: ID of the partition to wait for.</li> <li><code>poke_interval</code>: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default poke interval is 5 minutes).</li> <li><code>timeout</code>: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default timeout is 8 hours).</li> <li><code>retries</code>: The number of retries that should be performed if the sensor times out or otherwise fails.   This parameter is optional (the default depends on how the DAG is configured).</li> <li><code>retry_delay</code>: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\".   This parameter is optional (the default depends on how the DAG is configured).</li> </ul> </li> <li><code>trigger_rule</code>: The rule that determines when the airflow task that runs this query should run. The default is <code>all_success</code> (\"trigger this task when all directly upstream tasks have succeeded\"); other rules can allow a task to run even if not all preceding tasks have succeeded. See the Airflow docs for the list of trigger rule options.</li> <li><code>destination_table</code>: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the <code>submission_date</code> parameter manually</li> <li><code>external_downstream_tasks</code> defines external downstream dependencies for which <code>ExternalTaskMarker</code>s will be added to the generated DAG. These task markers ensure that when the task is cleared for triggering a rerun, all downstream tasks are automatically cleared as well. <pre><code>external_downstream_tasks:\n  - task_id: external_downstream_task\n    dag_name: external_dag\n    execution_delta: 1h\n</code></pre></li> </ul> </li> </ul> </li> <li>Queries can also be scheduled using the <code>bqetl</code> CLI: <code>./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios</code></li> <li>To generate all Airflow DAGs run <code>./bqetl dag generate</code><ul> <li>Generated DAGs are located in the <code>dags/</code> directory</li> <li>Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically</li> </ul> </li> <li>Specific DAGs can be generated by running <code>./bqetl dag generate bqetl_ssl_ratios</code></li> <li>Generated DAGs do not need to be checked into <code>main</code>. CI automatically generates DAGs and writes them to the telemetry-airflow-dags repo from where Airflow will pick them up</li> <li>Generated DAGs will be automatically detected and scheduled by Airflow<ul> <li>It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI</li> </ul> </li> <li>To generate tasks for importing data from Fivetran that an ETL task depends on add:   <pre><code>depends_on_fivetran:\n  - task_id: fivetran_import_1\n  - task_id: another_fivetran_import\n</code></pre><ul> <li>The Fivetran connector ID needs to be set as a variable <code>&lt;task_id&gt;_connector_id</code> in the Airflow admin interface for each import task</li> </ul> </li> </ul>"},{"location":"reference/stage-deploys-continuous-integration/","title":"Stage Deploys","text":""},{"location":"reference/stage-deploys-continuous-integration/#stage-deploys-in-continuous-integration","title":"Stage Deploys in Continuous Integration","text":"<p>Before changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks. </p> <p>Currently, the <code>bigquery-etl-integration-test</code> project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like <code>mozdata</code>, so stage artifacts cannot reference any other artifacts that live in production.</p> <p>Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the <code>generated-sql</code> branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the <code>generated-sql</code> branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment.     * This CI step runs after the <code>generate-sql</code> CI step to ensure that checks will also be executed on generated queries and to ensure <code>schema.yaml</code> files have been automatically created for queries.  2. The <code>bqetl</code> CLI has a command to run stage deploys, which is called in the CI: <code>./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS</code>     * <code>--dataset-suffix</code> will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions.     * Determining dependencies is only relevant for UDFs and views. For queries, available <code>schema.yaml</code> files will simply be deployed.      * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well.     * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage.     * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder <code>query.sql</code> file will be created     * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to.     * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to <code>sql/bigquery-etl-integration-test</code> into their corresponding temporary datasets.     * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a <code>bigquery-etl-integration-test</code> folder     * The deploy is executed in the order of: UDFs, tables, views     * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage     * By default the table expiration is set to 1 hour     * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)</p> <p>After CI checks have passed and the pull-request has been approved, changes can be merged to <code>main</code>. Once a new version of bigquery-etl has been published the changes can be deployed to production through the <code>bqetl_artifact_deployment</code> Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html</p>"},{"location":"reference/stage-deploys-continuous-integration/#local-deploys-to-stage","title":"Local Deploys to Stage","text":"<p>Local changes can be deployed to stage using the <code>./bqetl stage deploy</code> command:</p> <pre><code>./bqetl stage deploy \\\n  --dataset-suffix=test \\\n  --copy-sql-to-tmp-dir \\\n  sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n  sql/mozfun/map/sum/udf.sql\n</code></pre> <p>Files (for example ones with changes) that should be deployed to stage need to be specified. The <code>stage deploy</code> accepts the following parameters: * <code>--dataset-suffix</code> is an optional suffix that will be added to the datasets deployed to stage * <code>--copy-sql-to-tmp-dir</code> copies SQL stored in <code>sql/</code> to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running <code>git checkout -- sql/</code> * (optional) <code>--remove-updated-artifacts</code> removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.</p> <p>Deployed stage artifacts can be deleted from <code>bigquery-etl-integration-test</code> by running:</p> <pre><code>./bqetl stage clean --delete-expired --dataset-suffix=test\n</code></pre>"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 9e000f5347fd8caa856fb8e292b3e03bbb414032..7607e6e63288faf0a885e256820b6f15379cc7ae 100644
GIT binary patch
delta 13
Ucmb=gXP58h;9#h*o5)@P02tc?h5!Hn

delta 13
Ucmb=gXP58h;Al{@oycAR02)05vj6}9