diff --git a/bqetl/index.html b/bqetl/index.html index ed705bc5d3b..4eff55a2779 100644 --- a/bqetl/index.html +++ b/bqetl/index.html @@ -2621,10 +2621,10 @@

initialize

--sql_dir: Path to directory which contains queries. --project_id: GCP project ID ---billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project. ---dry_run: Dry run the initialization ---parallelism: Number of threads for parallel processing ---skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist. +--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project. +--dry_run: Dry run the initialization +--parallelism: Number of threads for parallel processing +--skip_existing: Skip initialization for existing artifacts, otherwise initialization is run for empty tables. --force: Run the initialization even if the destination table contains data.

Examples

diff --git a/search/search_index.json b/search/search_index.json index 3618588eee2..c9c3cc26456 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"

The bqetl command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.

Running some commands, for example to create or query tables, will require Mozilla GCP access.

"},{"location":"bqetl/#installation","title":"Installation","text":"

Follow the Quick Start to set up bigquery-etl and the bqetl CLI.

"},{"location":"bqetl/#configuration","title":"Configuration","text":"

bqetl can be configured via the bqetl_project.yaml file. See Configuration to find available configuration options.

"},{"location":"bqetl/#commands","title":"Commands","text":"

To list all available commands in the bqetl CLI:

$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n  CLI tools for working with bigquery-etl.\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  alchemer    Commands for importing alchemer data.\n  dag         Commands for managing DAGs.\n  dependency  Build and use query dependency graphs.\n  dryrun      Dry run SQL.\n  format      Format SQL.\n  glam        Tools for GLAM ETL.\n  mozfun      Commands for managing mozfun routines.\n  query       Commands for managing queries.\n  routine     Commands for managing routines.\n  stripe      Commands for Stripe ETL.\n  view        Commands for managing views.\n  backfill    Commands for managing backfills.\n

See help for any command:

$ ./bqetl [command] --help\n
"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"

CLI autocomplete for bqetl can be enabled for bash and zsh shells using the script/bqetl_complete script:

source script/bqetl_complete\n

Then pressing tab after bqetl commands should print possible commands, e.g. for zsh:

% bqetl query<TAB><TAB>\nbackfill       -- Run a backfill for a query.\ncreate         -- Create a new query with name...\ninfo           -- Get information about all or specific...\ninitialize     -- Run a full backfill on the destination...\nrender         -- Render a query Jinja template.\nrun            -- Run a query.\n...\n

source script/bqetl_complete can also be added to ~/.bashrc or ~/.zshrc to persist settings across shell instances.

For more details on shell completion, see the click documentation.

"},{"location":"bqetl/#query","title":"query","text":"

Commands for managing queries.

"},{"location":"bqetl/#create","title":"create","text":"

Create a new query with name ., for example: telemetry_derived.active_profiles. Use the --project_id option to change the project the query is added to; default is moz-fx-data-shared-prod. Views are automatically generated in the publicly facing dataset.

Usage

$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n

Examples

./bqetl query create telemetry_derived.deviations_v1 \\\n  --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n
"},{"location":"bqetl/#schedule","title":"schedule","text":"

Schedule an existing query

Usage

$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n

Examples

./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations \\\n  --task-name=deviations\n
"},{"location":"bqetl/#info","title":"info","text":"

Get information about all or specific queries.

Usage

$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n  --cost --last_updated\n
"},{"location":"bqetl/#backfill","title":"backfill","text":"

Run a backfill for a query. Additional parameters will get passed to bq.

Usage

$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n

Examples

# Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n  --dry_run\n
"},{"location":"bqetl/#run","title":"run","text":"

Run a query. Additional parameters will get passed to bq. If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored. If the name is not found within the sql/ folder bqetl assumes it hasn't been generated yet and will start the generating process for all sql_generators/ files. This generation process will take some time and run dryrun calls against BigQuery but this is expected. Additional parameters (all parameters that are not specified in the Options) must come after the query-name. Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.

Usage

$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n

Examples

# Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1         --project_id=moz-fx-data-shared-prod         --dataset_id=telemetry_derived         --destination_table=ssl_ratios_v1\n
"},{"location":"bqetl/#run-multipart","title":"run-multipart","text":"

Run a multipart query.

Usage

$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n

Examples

# Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n
"},{"location":"bqetl/#validate","title":"validate","text":"

Validate a query. Checks formatting, scheduling information and dry runs the query.

Usage

$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n

Examples

./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n  --use_cloud_function=false \\\n  --project_id=moz-fx-data-marketing-prod \\\n  ga_derived.blogs_goals_v1\n
"},{"location":"bqetl/#initialize","title":"initialize","text":"

Run a full backfill on the destination table for the query. Using this command will: - Create the table if it doesn't exist and run a full backfill. - Run a full backfill if the table exists and is empty. - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query. It supports query.sql files that use the is_init() pattern. To run in parallel per sample_id, include a @sample_id parameter in the query.

Usage

$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist.\n--force: Run the initialization even if the destination table contains data.\n

Examples

Examples:\n   - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n   - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n
"},{"location":"bqetl/#render","title":"render","text":"

Render a query Jinja template.

Usage

$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n

Examples

./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n  --output-dir=/tmp\n
"},{"location":"bqetl/#schema","title":"schema","text":"

Commands for managing query schemas.

"},{"location":"bqetl/#update","title":"update","text":"

Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.

Usage

$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n

Examples

./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n
"},{"location":"bqetl/#deploy","title":"deploy","text":"

Deploy the query schema.

Usage

$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.  Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n

Examples

./bqetl query schema deploy telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#validate_1","title":"validate","text":"

Validate the query schema

Usage

$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n

Examples

./bqetl query schema validate telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#dag","title":"dag","text":"

Commands for managing DAGs.

"},{"location":"bqetl/#info_1","title":"info","text":"

Get information about available DAGs.

Usage

$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n

Examples

# Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n
"},{"location":"bqetl/#create_1","title":"create","text":"

Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_ prefix. Created DAGs are added to the dags.yaml file.

Usage

$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n

Examples

./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n
"},{"location":"bqetl/#generate","title":"generate","text":"

Generate Airflow DAGs from DAG definitions.

Usage

$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n

Examples

# Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n
"},{"location":"bqetl/#remove","title":"remove","text":"

Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.

Usage

$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n

Examples

# Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n
"},{"location":"bqetl/#dependency","title":"dependency","text":"

Build and use query dependency graphs.

"},{"location":"bqetl/#show","title":"show","text":"

Show table references in sql files.

Usage

$ ./bqetl dependency show [OPTIONS] [paths]\n
"},{"location":"bqetl/#record","title":"record","text":"

Record table references in metadata. Fails if metadata already contains references section.

Usage

$ ./bqetl dependency record [OPTIONS] [paths]\n
"},{"location":"bqetl/#dryrun","title":"dryrun","text":"

Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false and ensure that the command line has access to a GCP service account.

Usage

$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n

Examples

Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n
"},{"location":"bqetl/#format","title":"format","text":"

Format SQL files.

Usage

$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n

Examples

# Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n
"},{"location":"bqetl/#routine","title":"routine","text":"

Commands for managing routines for internal use.

"},{"location":"bqetl/#create_2","title":"create","text":"

Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.

Usage

$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n

Examples

# Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n
"},{"location":"bqetl/#info_2","title":"info","text":"

Get routine information.

Usage

$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n

Examples

# Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n
"},{"location":"bqetl/#validate_2","title":"validate","text":"

Validate formatting of routines and run tests.

Usage

$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n

Examples

# Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#publish","title":"publish","text":"

Publish routines to BigQuery. Requires service account access.

Usage

$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n

Examples

# Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#rename","title":"rename","text":"

Rename routine or routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n
"},{"location":"bqetl/#mozfun","title":"mozfun","text":"

Commands for managing public mozfun routines.

"},{"location":"bqetl/#create_3","title":"create","text":"

Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun project.

Usage

$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n

Examples

# Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n
"},{"location":"bqetl/#info_3","title":"info","text":"

Get mozfun routine information.

Usage

$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n

Examples

# Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n
"},{"location":"bqetl/#validate_3","title":"validate","text":"

Validate formatting of mozfun routines and run tests.

Usage

$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n

Examples

# Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n
"},{"location":"bqetl/#publish_1","title":"publish","text":"

Publish mozfun routines. This command is used by Airflow only.

Usage

$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
"},{"location":"bqetl/#rename_1","title":"rename","text":"

Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n
"},{"location":"bqetl/#backfill_1","title":"backfill","text":"

Commands for managing backfills.

"},{"location":"bqetl/#create_4","title":"create","text":"

Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.

Usage

$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n

Examples

./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n
"},{"location":"bqetl/#validate_4","title":"validate","text":"

Validate backfill.yaml file format and content.

Usage

$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n    ./bqetl backfill validate\n
"},{"location":"bqetl/#info_4","title":"info","text":"

Get backfill(s) information from all or specific table(s).

Usage

$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n

Examples

# Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n
"},{"location":"bqetl/#scheduled","title":"scheduled","text":"

Get information on backfill(s) that require processing.

Usage

$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n

Examples

# Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n
"},{"location":"bqetl/#initiate","title":"initiate","text":"

Process entry in backfill.yaml with Initiate status that has not yet been processed.

Usage

$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"bqetl/#complete","title":"complete","text":"

Complete entry in backfill.yaml with Complete status that has not yet been processed..

Usage

$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"

This is a quick guide of how to perform common workflows in bigquery-etl using the bqetl CLI.

For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the bqetl CLI needs to be installed by running ./bqetl bootstrap.

"},{"location":"cookbooks/common_workflows/#adding-a-new-scheduled-query","title":"Adding a new scheduled query","text":"

The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.

  1. Run ./bqetl query create <dataset>.<table>_<version>
    1. Specify the desired destination dataset and table name for <dataset>.<table>_<version>
    2. Directories and files are generated automatically
  2. Open query.sql file that has been created in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/ to write the query
  3. [Optional] Run ./bqetl query schema update <dataset>.<table>_<version> to generate the schema.yaml file
  4. Open the metadata.yaml file in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
  5. Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
  6. To schedule the query, first select a DAG from the ./bqetl dag info list or create a new DAG ./bqetl dag create <bqetl_new_dag>
  7. Run ./bqetl query schedule <dataset>.<table>_<version> --dag <bqetl_dag> to schedule the query
  8. Create a pull request
  9. PR gets reviewed and eventually approved
  10. Merge pull-request
  11. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
  12. Backfill data
"},{"location":"cookbooks/common_workflows/#update-an-existing-query","title":"Update an existing query","text":"
  1. Open the query.sql file of the query to be updated and make changes
  2. Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
  3. If the query scheduling metadata has changed, run ./bqetl dag generate <bqetl_dag> to update the DAG file
  4. If the query adds new columns, run ./bqetl query schema update <dataset>.<table>_<version> to make local schema.yaml updates
  5. Open PR with changes
  6. PR reviewed and approved
  7. Merge pull-request
  8. Table deploys (including schema changes) happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#formatting-sql","title":"Formatting SQL","text":"

We enforce consistent SQL formatting as part of CI. After adding or changing a query, use ./bqetl format to apply formatting rules.

Directories and files passed as arguments to ./bqetl format will be formatted in place, with directories recursively searched for files with a .sql extension, e.g.:

$ echo 'SELECT 1,2,3' > test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n  1,\n  2,\n  3\n

If no arguments are specified the script will read from stdin and write to stdout, e.g.:

$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n  1,\n  2,\n  3\n

To turn off sql formatting for a block of SQL, wrap it in format:off and format:on comments, like this:

SELECT\n  -- format:off\n  submission_date, sample_id, client_id\n  -- format:on\n
"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"

Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.

  1. Open the query.sql file inside the <dataset>.<table> location and add the new definitions for the field.
  2. Run ./bqetl format <path to the query> to format the query. Alternatively, run ./bqetl format $(git ls-tree -d HEAD --name-only) validate the format of all queries that have been modified.
  3. Run ./bqetl query validate <dataset>.<table> to dry run the query.
  4. Run ./bqetl query schema update <dataset>.<table> --update_downstream to make local schema.yaml updates and update schemas of downstream dependencies.
  5. Open a new PR with these changes.
  6. PR reviewed and approved.
  7. Find and run again the CI pipeline for the PR.
  8. Merge pull-request.
  9. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG

The following is an example to update a new field in telemetry_derived.clients_daily_v6

"},{"location":"cookbooks/common_workflows/#example-add-a-new-field-to-clients_daily","title":"Example: Add a new field to clients_daily","text":"
  1. Open the clients_daily_v6 query.sql file and add new field definitions.
  2. Run ./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql
  3. Run ./bqetl query validate telemetry_derived.clients_daily_v6.
  4. Authenticate to GCP: gcloud auth login --update-adc
  5. Run ./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false.
  6. Open a PR with these changes.
  7. PR is reviewed and approved.
  8. Merge pull-request.
  9. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#remove-a-field-from-a-table-schema","title":"Remove a field from a table schema","text":"

Deleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.

"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"
  1. Run ./bqetl mozfun create <dataset>.<name> --udf.
  2. Navigate to the udf.sql file in sql/mozfun/<dataset>/<name>/ and add UDF the definition and tests.
  3. Run ./bqetl mozfun validate <dataset>.<name> for formatting and running tests.
  4. Open a PR.
  5. PR gets reviewed, approved and merged.
  6. To publish UDF immediately:
"},{"location":"cookbooks/common_workflows/#adding-a-new-internal-udf","title":"Adding a new internal UDF","text":"

Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a mozfun UDF.

  1. Run ./bqetl routine create <dataset>.<name> --udf
  2. Navigate to the udf.sql in sql/moz-fx-data-shared-prod/<dataset>/<name>/ file and add UDF definition and tests
  3. Run ./bqetl routine validate <dataset>.<name> for formatting and running tests
  4. Open a PR
  5. PR gets reviewed and approved and merged
  6. UDF deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#adding-a-stored-procedure","title":"Adding a stored procedure","text":"

The same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute ./bqetl mozfun create <dataset>.<name> --stored_procedure or ./bqetl routine create <dataset>.<name> --stored_procedure for internal stored procedures.

"},{"location":"cookbooks/common_workflows/#updating-an-existing-udf","title":"Updating an existing UDF","text":"
  1. Navigate to the udf.sql file and make updates
  2. Run ./bqetl mozfun validate <dataset>.<name> or ./bqetl routine validate <dataset>.<name> for formatting and running tests
  3. Open a PR
  4. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#renaming-an-existing-udf","title":"Renaming an existing UDF","text":"
  1. Run ./bqetl mozfun rename <dataset>.<name> <new_dataset>.<new_name>
  2. Open a PR
  3. PR gets reviews, approved and merged
"},{"location":"cookbooks/common_workflows/#using-a-private-internal-udf","title":"Using a private internal UDF","text":"
  1. Follow the steps for Adding a new internal UDF above to create a stub of the private UDF. Note this should not contain actual private UDF code or logic. The directory name and function parameters should match the private UDF.
  2. Do Not publish the stub UDF. This could result in incorrect results for other users of the private UDF.
  3. Open a PR
  4. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#creating-a-new-bigquery-dataset","title":"Creating a new BigQuery Dataset","text":"

To provision a new BigQuery dataset for holding tables, you'll need to create a dataset_metadata.yaml which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.

The bqetl query create command will automatically generate a skeleton dataset_metadata.yaml file if the query name contains a dataset that is not yet defined.

See example with commentary for telemetry_derived:

friendly_name: Telemetry Derived\ndescription: |-\n  Derived data based on pings from legacy Firefox telemetry, plus many other\n  general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n#   \"derived\" for `_derived` datasets that contain concrete tables\n#   \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n  members:\n  - workgroup:mozilla-confidential\n
"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"

See also the reference for Public Data.

  1. Get a data review by following the data publishing process
  2. Update the metadata.yaml file of the query to be published
  3. If an internal dataset already exists, move it to mozilla-public-data
  4. If an init.sql file exists for the query, change the destination project for the created table to mozilla-public-data
  5. Open a PR
  6. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#adding-new-python-requirements","title":"Adding new Python requirements","text":"

When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt or constraints.txt, but not both.

# Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 >> requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n
"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"

When opening a pull-request to merge a fork, the manual-trigger-required-for-fork CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.

"},{"location":"cookbooks/common_workflows/#building-the-documentation","title":"Building the Documentation","text":"

The repository documentation is built using MkDocs. To generate and check the docs locally:

  1. Run ./bqetl docs generate --output_dir generated_docs
  2. Navigate to the generated_docs directory
  3. Run mkdocs serve to start a local mkdocs server.
"},{"location":"cookbooks/common_workflows/#setting-up-change-control-to-code-files","title":"Setting up change control to code files","text":"

Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.

Here is a sample PR with the implementation of change control for contextual services data.

  1. Select or create a Github team or identity and add the GitHub emails of the query codeowners. A GitHub identity is particularly useful when you need to include non @mozilla emails or to randomly assign PR reviewers from the team members. This team requires edit permissions to bigquery-etl, to achieve this, inherit the team from one that has the required permissions e.g. mozilla > telemetry.
  2. Open the metadata.yaml for the query where you want to apply change control:
  3. Setup the CODEOWNERS:
  4. The queries labeled change_controlled are automatically validated in the CI. To run the validation locally:
"},{"location":"cookbooks/creating_a_derived_dataset/","title":"A quick guide to creating a derived dataset with BigQuery-ETL and how to set it up as a public dataset","text":"

This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.

This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.

To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.

If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.

"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"

Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.

In this example, we will create a table of aggregated metrics related to mozregression, that will be used in dashboards to help prioritize feature development inside Mozilla.

"},{"location":"cookbooks/creating_a_derived_dataset/#initial-steps","title":"Initial steps","text":"

Set up bigquery-etl on your system per the instructions in the README.md.

"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"

The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it org_mozilla_mozregression_derived.mozregression_aggregates.

The org_mozilla_mozregression_derived part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the _derived postfix to hold derived tables like this one.

Run:

./bqetl query create <dataset>.<table_name>\n
In our example:

./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n

This command does three things:

We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the mozdata project.

"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-yaml","title":"Fill out the YAML","text":"

The next step is to modify the generated metadata.yaml and query.sql sections with specific information.

Let's look at what the metadata.yaml file for our example looks like. Make sure to adapt this file for your own dataset.

friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n

Most of the fields are self-explanatory. incremental means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.

For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the date field and a clustering on app_used and os.

"},{"location":"cookbooks/creating_a_derived_dataset/#the-yaml-file-structure-for-a-public-dataset","title":"The YAML file structure for a public dataset","text":"

Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: public_json, public_bigquery and review_bugs which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.

friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1691105\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n
"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"

Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a @submission_date parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.

Test your query and add it to the query.sql file.

In our example, the query is tested in sql.telemetry.mozilla.org, and the query.sql file looks like this:

SELECT\n  DATE(submission_timestamp) AS date,\n  client_info.app_display_version AS mozregression_version,\n  metrics.string.usage_variant AS mozregression_variant,\n  metrics.string.usage_app AS app_used,\n  normalized_os AS os,\n  mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n  count(DISTINCT(client_info.client_id)) AS distinct_clients,\n  count(*) AS total_uses\nFROM\n  `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n  DATE(submission_timestamp) = @submission_date\n  AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n  date,\n  mozregression_version,\n  mozregression_variant,\n  app_used,\n  os,\n  os_version;\n

We use the truncate_version UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.

We also have a short clause (client_info.app_display_version NOT LIKE '%.dev%') to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.

"},{"location":"cookbooks/creating_a_derived_dataset/#formatting-and-validating-the-query","title":"Formatting and validating the query","text":"

Now that we've written our query, we can format it and validate it. Once that's done, we run:

./bqetl query validate <dataset>.<table>\n
For our example:
./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
If there are no problems, you should see no output.

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"

Use bqetl to set up the schema that will be used to create the table.

Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.

./bqetl query schema update <dataset>.<table>\n

For our example:

./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"

BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).

Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in dags.yaml at the root or run ./bqetl dag info. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:

./bqetl dag create <dag_name> --schedule-interval \"0 4 * * *\" --owner <email_for_notifications> --description \"Add a clear description of the DAG here\" --start-date <YYYY-MM-DD> --tag impact/<tier>\n

For our example, the starting date is 2020-06-01 and we use a schedule interval of 0 4 \\* \\* \\* (4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.

The --tag impact/tier3 parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.

When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as --tag triage/no_triage. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the triage/no_triage tag can be removed and problems will addressed during the Airflow Triage process.

./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n
"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"

Queries are automatically scheduled during creation in the DAG set using the option --dag, or in the default DAG bqetl_default when this option is not used.

If the query was created with --no-schedule, it is possible to manually schedule the query via the bqetl tool:

./bqetl query schedule <dataset>.<table> --dag <dag_name> --task-name <task_name>\n

Here is the command for our example. Notice the name of the table as created with the suffix _v1.

./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n

Note that we are scheduling the generation of the underlying table which is org_mozilla_mozregression_derived.mozregression_aggregates_v1 rather than the view.

"},{"location":"cookbooks/creating_a_derived_dataset/#get-data-review","title":"Get Data Review","text":"

This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.

Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.

The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.

"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"

Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:

git checkout -b <new_branch_name>\ngit add dags.yaml dags/<dag_name>.py sql/moz-fx-data-shared-prod/telemetry/<view> sql/moz-fx-data-shared-prod/<dataset>/<table>\ngit commit\ngit push origin <new_branch_name>\n

And next is the workflow for our specific example:

git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n

Then create your pull request, either from the GitHub web interface or the command line, per your preference.

Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.

This example assumes that origin points to your fork. Adjust the last push invocation appropriately if you have a different remote set.

Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"

Once the PR has been approved, deploy the schema to bqetl using this command:

./bqetl query schema deploy <schema>.<table>\n

For our example:

./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n

"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"

Note For large sets of data, follow the recommended practices for backfills.

"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"
  1. Create a backfill schedule entry to (re)-process data in your table:

    bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD>\n
    bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD> --shredder_mitigation\n
  2. Fill out the missing details:

  3. Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.

  4. Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.

"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-backfill","title":"Completing the backfill:","text":"
  1. Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)

  2. If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.

  3. You will be notified when swapping is complete.

Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.

"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"

At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.

"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"

This repository uses pytest:

# create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n

To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.

"},{"location":"cookbooks/testing/#how-to-configure-a-udf-test","title":"How to Configure a UDF Test","text":"

Include a comment like -- Tests followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.

Each test must use the UDF and throw an error to fail. Assert functions defined in sql/mozfun/assert/ may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like #xfail, similar to a SQL dialect prefix in the BigQuery Cloud Console.

For example:

CREATE TEMP FUNCTION udf_example(option INT64) AS (\n  CASE\n  WHEN option > 0 then TRUE\n  WHEN option = 0 then FALSE\n  ELSE ERROR(\"invalid option\")\n  END\n);\n-- Tests\nSELECT\n  mozfun.assert.true(udf_example(1)),\n  mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n  udf_example(-1);\n#xfail\nSELECT\n  udf_example(NULL);\n
"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"

Queries are tested by running the query.sql with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named tests/sql/{project}/{dataset}/{table}/{test_name}/, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day - table must match a directory named like {dataset}/{table}, e.g. telemetry_derived/clients_last_seen_v1 - test_name should start with test_, e.g. test_single_day - If test_name is test_init or test_script, then the query with is_init() set to true or script.sql respectively; otherwise, the test will run query.sql 1. Add .yaml files for input tables, e.g. clients_daily_v6.yaml - Include the dataset prefix if it's set in the tested query, e.g. analysis.clients_last_seen_v1.yaml - Include the project prefix if it's set in the tested query, e.g. moz-fx-other-data.new_dataset.table_1.yaml - This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\") 1. Add .sql files for input view queries, e.g. main_summary_v4.sql - Don't include a CREATE ... AS clause - Fully qualify table names as `{project}.{dataset}.table` - Include the dataset prefix if it's set in the tested query, e.g. telemetry.main_summary_v4.sql - This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\") 1. Add expect.yaml to validate the result - DATE and DATETIME type columns in the result are coerced to strings using .isoformat() - Columns named generated_time are removed from the result before comparing to expect because they should not be static - NULL values should be omitted in expect.yaml. If a column is expected to be NULL don't add it to expect.yaml. (Be careful with spreading previous rows (-<<: *base) here) 1. Optionally add .schema.json files for input table schemas to the table directory, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json. These tables will be available for every test in the suite. The schema.json file need to match the table name in the query.sql file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add query_params.yaml to define query parameters - query_params must be a list

"},{"location":"cookbooks/testing/#init-tests","title":"Init Tests","text":"

Tests of is_init() statements are supported, similarly to other generated tests. Simply name the test test_init. The other guidelines still apply.

"},{"location":"cookbooks/testing/#additional-guidelines-and-options","title":"Additional Guidelines and Options","text":""},{"location":"cookbooks/testing/#how-to-run-circleci-locally","title":"How to Run CircleCI Locally","text":"
gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n
"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"

INPUTS

x INT64, n INT64\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"

Given a map of representing activity for STRING keys, this function returns an array of which keys were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_1","title":"Parameters","text":"

INPUTS

days_seen_bits_map ARRAY<STRUCT<key STRING, value INT64>>, start_offset INT64, n_bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"

This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days. On the first of each month, a new month is appended, and the first month is dropped. If the date is not the first of the month, the new entry is added to the last element in the array. For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]: On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24] This happens for every aggregate (searches, ad clicks, etc.)

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"

INPUTS

prev STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, curr STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"

Adds together two engine searches structs. Each engine searches struct has a MAP[engine -> search_counts_struct]. We want to add add together the prev and curr's values for a certain engine. This allows us to be flexible with the number of engines we're using.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"

INPUTS

prev ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, curr ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, submission_date DATE\n

OUTPUTS

value\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"

Return sums of each search type grouped by the index. Results are ordered by index.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"

INPUTS

searches ARRAY<STRUCT<total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"

This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY>, i.e. the output of SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"

INPUTS

active_addons ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"

Returns an aggregated map with all the keys and the first corresponding value from the given maps

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"

INPUTS

maps ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"

INPUTS

search_counts ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"

Aggregates the total counts of the given search counters

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"

INPUTS

engine_searches_list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"

An array of 11 zeroes, followed by a supplied value

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"

INPUTS

val INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"

Drop the first element of an array, and append the given element. Result is an array with the same length as the input.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"

INPUTS

arr ANY TYPE, append ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"

An array of 12 zeroes

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"

INPUTS

) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"

INPUTS

arr ANY TYPE, start_index INT64, end_index INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"

This function counts the 1s in lowest 7 bits of an INT64

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"

A bitmask for 365 bits

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"

INPUTS

) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"

INPUTS

) AS ( 0x0FFFFFFF\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"

INPUTS

) AS ( 0x7F\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"

Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask. The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"

INPUTS

start_ordinal INT64, _length INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"

INPUTS

bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"

INPUTS

s STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

OUTPUTS

ARRAY<DATE>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"

INPUTS

bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"

Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: sql bits_from_offsets( ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL) IGNORE NULLS) ) The resulting value can be cast to an INT64 representing the most recent 64 days via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) Or representing the most recent 28 days (compatible with bits28 functions) via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) << 36 >> 36

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_25","title":"Parameters","text":"

INPUTS

offsets ARRAY<INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"

Given a BYTE and an INT64, return whether the user was active that many weeks ago. NULL input returns NULL output.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"

INPUTS

b BYTES, n INT64\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"

Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"

Given a BYTES, return the number of days since the client was first seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0. Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS (( SELECT MAX(n) FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_seen.sql

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"

Given a BYTES, return the number of days since the client was last seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0. Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_seen(b BYTES) AS (( SELECT MIN(n) FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_first_seen.sql

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"

Convert a boolean to 365 bit byte array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"

INPUTS

val BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"

Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"

INPUTS

histogram STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"

We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"

INPUTS

prev INT64, curr INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"

Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"

INPUTS

prev BYTES, curr BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"

Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"

INPUTS

prev INT64, curr INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"

INPUTS

prev BYTES, curr BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"

The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See udf.combine_experiment_days for a more specific example of this approach.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_36","title":"Parameters","text":"

INPUTS

-- prev ARRAY<STRUCT<key STRING, value INT64>>, -- curr ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"

The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"

INPUTS

-- prev ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>, -- curr ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"

For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji. This is just for fun. See: - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"

INPUTS

country_code string\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"

Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"

INPUTS

days_seen_bytes BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"

Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"

INPUTS

days_since_created_profile INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"

Rename struct fields in anonymous event tuples to meaningful names.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"

INPUTS

tuple STRUCT<f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY<STRUCT<key STRING, value STRING>>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"

INPUTS

raw BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"

Return an array containing only distinct values of the given array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"

INPUTS

) RETURNS STRING AS ( 'helloworld'\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"

INPUTS

code_points ANY TYPE\n

OUTPUTS

ARRAY<INT64>\n
"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"

Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"

INPUTS

metric ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"

Extract the document type from a table name e.g. _TABLE_SUFFIX.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"

INPUTS

table_name STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"

Extract the document version from a table name e.g. _TABLE_SUFFIX.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"

INPUTS

table_name STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"

This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed. It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"

Return a path derived from an error message in payload_bytes_error

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_53","title":"Parameters","text":"

INPUTS

error_message STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format>

In short it is yDDDHHmm:

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format> where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_54","title":"Parameters","text":"

INPUTS

app_build STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"

INPUTS

os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"

INPUTS

device_category STRING, browser STRING, operating_system STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"

INPUTS

silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"

Determine if a browser in a Google Analytics data is produced by Mozilla

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"

INPUTS

browser STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"

Convert geoip lookup fields to a struct, replacing '??' with NULL. Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"

INPUTS

country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"

Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"

INPUTS

country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"

Given a gzip compressed byte string, extract the uncompressed size from the footer. WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"

INPUTS

compressed BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"

Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"

INPUTS

histogram STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"

INPUTS

histogram_list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"

INPUTS

histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n

OUTPUTS

STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"

INPUTS

histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"

INPUTS

histogram ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"

INPUTS

histogram STRING, threshold INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"

Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"

INPUTS

key BYTES, message BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"

INPUTS

value INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"

INPUTS

value INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"

Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"

INPUTS

keyed_histogram ANY TYPE, target_key STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"

Returns a JSON string which has the pair appended to the provided input JSON string. NULL is also valid for input. Examples: udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"baz\": \"boo\"}'

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_79","title":"Parameters","text":"

INPUTS

input STRING, arr ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"

Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT>>. To use this with that representation, it should be as udf.kv_array_to_json_string(struct.key_value)."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"

INPUTS

kv_arr ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"

INPUTS

processes ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"

For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"

INPUTS

country STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"

Only for use by the LTV Revenue join. Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"

INPUTS

engine STRING, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"

This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"

INPUTS

medium STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"

Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"

INPUTS

aggs ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n

OUTPUTS

ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"

This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"

INPUTS

dividend BYTES, divisor INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"

Stub monetized_search UDF for tests

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"

INPUTS

engine STRING, country STRING, distribution_id STRING, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"

This struct represents the past year's worth of searches. Each month has its own entry, hence 12.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"

INPUTS

) AS ( STRUCT( udf.array_of_12_zeroes(\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"

Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"

INPUTS

telemetry_sdk_build STRING, metrics ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"

INPUTS

client_info ANY TYPE, metrics ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"

INPUTS

ping_info ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"

INPUTS

payload ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"

INPUTS

metadata ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"

Sum up the monthy search count arrays by normalized engine

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"

INPUTS

engine_searches ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"

INPUTS

os STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"

Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"

INPUTS

engine STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"

Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"

One represented as a byte array of 365 bits

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"

INPUTS

) AS ( CONCAT(REPEAT(b'\\x00', 45\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"

This is a UDF to help distinguish desktop client attribution as being organic or paid

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"

INPUTS

medium STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"

This is a UDF to help distinguish mobile client attribution as being organic or paid

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"

INPUTS

adjust_network STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"

INPUTS

event_properties ANY TYPE, indices ANY TYPE\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n
"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"

Sum an array from a parquet-derived field. These are lists of an element that contain the field value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_106","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"

Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"

INPUTS

uri STRING\n

OUTPUTS

STRUCT<namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"

Take a ISO 8601 date or date and time string and return a DATE. Return null if parse fails. Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"

INPUTS

date_str STRING\n

OUTPUTS

DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"

INPUTS

) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"

Returns the 0-based index of the first set bit. No set bits returns NULL.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"

INPUTS

i INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"

Identical to bits28_days_since_seen. Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0). To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"

Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"

INPUTS

legacy_app_name STRING, normalized_os STRING\n

OUTPUTS

STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"

Pseudonymize Ad IDs, handling opt-outs.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"

INPUTS

hashed_ad_id STRING, key BYTES\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"

This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"

INPUTS

metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"

Floor a timestamp object to the given minute interval.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"

INPUTS

timestamp_expression TIMESTAMP, minute INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"

Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes. This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"

INPUTS

) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"

Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"

INPUTS

client_id STRING\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"

Calculate the sums of search counts per source and engine

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"

INPUTS

entries ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"

Shift input bits one day left and drop any bits beyond 28 days.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"

Shift input bits one day left and drop any bits beyond 365 days.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"

INPUTS

x BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"

Returns the bitfield shifted by one day, 0 for NULL

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"

Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"

INPUTS

bit_arrays ARRAY<STRUCT<days_created_profile_bits INT64, days_active_bits INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"

This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"

INPUTS

a ARRAY<INT64>, b ARRAY<INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"

Zero represented as a 365-bit byte array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"

INPUTS

) AS ( REPEAT(b'\\x00', 46\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"

Generates an array if all zeroes, of arbitrary length

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"

INPUTS

len INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"

Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"

INPUTS

percentiles ARRAY<INT64>, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>, metric STRING\n

OUTPUTS

ARRAY<STRUCT<metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"

Calculate the CRC-32 hash of an input string. The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"

INPUTS

data STRING\n

OUTPUTS

INT64 DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"

URL decodes the raw firefox_installer.install.attribution string to a STRUCT. The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted. If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"

INPUTS

attribution STRING\n

OUTPUTS

STRUCT<campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from payload which is in bytes.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_3","title":"Parameters","text":"

INPUTS

payload BYTES\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"

Unzips a GZIP string. This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"

INPUTS

input BYTES\n

OUTPUTS

STRING DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"

INPUTS

n_buckets INT64, values_per_bucket ARRAY<FLOAT64>\n

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, pm FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"

Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"

INPUTS

percentile FLOAT64, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>\n

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, percentile FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(active_days_in_week) AS active_days_in_week, SUM(wau) AS wau FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM bucketed GROUP BY submission_date

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"

INPUTS

n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n

OUTPUTS

intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(dau) AS dau_sum FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM bucketed GROUP BY submission_date

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"

INPUTS

n_buckets INT64, counts_per_bucket ARRAY<INT64>\n

OUTPUTS

STRUCT<total INT64, low INT64, high INT64, pm INT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY<STRUCT<key STRING, value STRING>>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"

Returns a parsed struct from a JSON string representing a histogram. This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

STRUCT<bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"

Returns an array of parsed structs from a JSON string representing a keyed histogram. This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"

Extract missing columns from additional properties. More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...]. param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node. Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements'] See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"

INPUTS

input STRING, indicates_node ARRAY<STRING>, known_nodes ARRAY<STRING>\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"

Add fields from additional_attributes to active_addons in main pings. Return an array instead of a \"map\" for backwards compatibility. The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL. The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"

INPUTS

active_addons ARRAY<STRUCT<key STRING, value STRUCT<app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING>>>, active_addons_json STRING\n

OUTPUTS

ARRAY<STRUCT<addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"

Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"

INPUTS

dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n

OUTPUTS

STRUCT<keyed_boolean_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value BOOL>>>>, keyed_uint_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value INT64>>>>, string_addon_scalars ARRAY<STRUCT<key STRING, value STRING>>, keyed_string_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value STRING>>>>, uint_addon_scalars ARRAY<STRUCT<key STRING, value INT64>>, boolean_addon_scalars ARRAY<STRUCT<key STRING, value BOOL>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"

Report the ids of the addons which are in the addonDetails but not in the activeAddons. They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"

INPUTS

active_addon_ids ARRAY<STRING>, addon_details_json STRING\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"

INPUTS

params STRING\n

OUTPUTS

STRUCT<`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"

Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"

INPUTS

client_id STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"

This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"

INPUTS

input ARRAY<STRING>\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/about/","title":"mozfun","text":"

mozfun is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.

"},{"location":"mozfun/addons/","title":"Addons","text":""},{"location":"mozfun/addons/#is_adblocker-udf","title":"is_adblocker (UDF)","text":"

Returns whether a given Addon ID is an adblocker.

Determine if a given Addon ID is for an adblocker.

As an example, this query will give the number of users who have an adblocker installed.

SELECT\n    submission_date,\n    COUNT(DISTINCT client_id) AS dau,\nFROM\n    mozdata.telemetry.addons\nWHERE\n    mozfun.addons.is_adblocker(addon_id)\n    AND submission_date >= \"2023-01-01\"\nGROUP BY\n    submission_date\n

"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"

INPUTS

addon_id STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n

Source | Edit

"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"

INPUTS

name STRING, expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"

Like map_equals but error message contains only the offending entry

"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n
"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"

Compare SQL Strings for equality

"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/bits28/","title":"bits28","text":"

The bits28 functions provide an API for working with \"bit pattern\" INT64 fields, as used in the clients_last_seen dataset for desktop Firefox and similar datasets for other applications.

A powerful feature of the clients_last_seen methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.

The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.

"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"

Return a boolean indicating if any bits are set in the specified range of a bit pattern. The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at start_offset.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"

Return the position of the rightmost set bit in an INT64 bit pattern.

To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  mozfun.bits28.days_since_seen(18)\n-- >> 1\n
"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"

Convert a string representing individual bits into an INT64.

Implementation based on https://stackoverflow.com/a/51600210/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"

Return an INT64 representing a range of bits from a source bit pattern.

The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.

n_bits is the number of bits to consider, counting right from the bit at start_offset.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n  mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n  mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n  `mozdata.telemetry.clients_last_seen`\nWHERE\n  submission_date > '2020-01-01'\n
"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"

Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.

"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

Source | Edit

"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"

Convert a bit pattern into an array of the dates is represents.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

OUTPUTS

ARRAY<DATE>\n

Source | Edit

"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"

Convert an INT64 field into a 28-character string representing the individual bits.

Implementation based on https://stackoverflow.com/a/51600210/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- >>> ['0000000000000000000000000001',\n--      '0000000000000000000000000010',\n--      '0000000000000000000000000011']\n
"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"

Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.

"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"

INPUTS

bit_pos INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"

Extract bits from a byte array. Roughly matches substr with three arguments: b: bytes - The byte string we need to extract from start: int - The position of the first bit we want to extract. Can be negative to start from the end of the byte array. One-indexed, like substring. length: int - The number of bits we want to extract

The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.

Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'

"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"

INPUTS

b BYTES, `begin` INT64, length INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"

Zero bits on the right of byte

"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"

INPUTS

b BYTES, length INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"

These functions are specific for use with the events_daily and event_types tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.

The events_daily table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the compressed format, that clients' event history for that day. The characters are ordered by the timestamp which they appeared that day.

The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.

"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"

Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.

"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"

INPUTS

match_strings ARRAY<STRING>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"

Generate the SQL statement that can be used to create an easily queryable view on events data.

"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n

OUTPUTS

sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"

Create a view that queries the events_daily table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.

"},{"location":"mozfun/event_analysis/#usage","title":"Usage","text":"
create_events_view(\n    view_name STRING,\n    project STRING,\n    dataset STRING,\n    funnels ARRAY<STRUCT<\n        funnel_name STRING,\n        funnel ARRAY<STRUCT<\n            step_name STRING,\n            events ARRAY<STRUCT<\n                category STRING,\n                event_name STRING>>>>>>,\n    counts ARRAY<STRUCT<\n        count_name STRING,\n        events ARRAY<STRUCT<\n            category STRING,\n            event_name STRING>>>>\n  )\n
"},{"location":"mozfun/event_analysis/#recommended-pattern","title":"Recommended Pattern","text":"

Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.

This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.

"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"

The view will be created at

`moz-fx-data-shared-prod`.analysis.{event_name}.\n

The view will have a schema roughly matching the following:

root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n |  |-- {funnel_step_1_name} boolean\n |  |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n |  |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n

"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"

Each funnel will be a STRUCT with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.

STRUCT(\n    completed_step_1 BOOLEAN,\n    completed_step_2 BOOLEAN,\n    ...\n) AS funnel_name\n

With one row per-user per-day, you can use COUNTIF(funnel_name.completed_step_N) to query these fields. See below for an example.

"},{"location":"mozfun/event_analysis/#event-counts","title":"Event Counts","text":"

Each event count is simply an INT64 representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.

event_count_name INT64\n
"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"

The following creates a few fields: - collection_flow is a funnel for those that started creating a collection within Fenix, and then finished, either by adding those tabs to an existing collection or saving it as a new collection. - collection_flow_saved represents users who started the collection flow then saved it as a new collection. - number_of_collections_created is the number of collections created - number_of_collections_deleted is the number of collections deleted

CALL mozfun.event_analysis.create_events_view(\n  'fenix_collection_funnels',\n  'moz-fx-data-shared-prod',\n  'org_mozilla_firefox',\n\n  -- Funnels\n  [\n    STRUCT(\n      \"collection_flow\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"completed_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name),\n        STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n    ] AS funnel),\n\n    STRUCT(\n      \"collection_flow_saved\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"saved_collection\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n    ] AS funnel)\n  ],\n\n  -- Event Counts\n  [\n    STRUCT(\n      \"number_of_collections_created\" AS count_name,\n      [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n    ),\n    STRUCT(\n      \"number_of_collections_deleted\" AS count_name,\n      [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n    )\n  ]\n);\n

From there, you can query a few things. For example, the fraction of users who completed each step of the collection flow over time:

SELECT\n    submission_date,\n    COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n    COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n

Or you can see the number of collections created and deleted:

SELECT\n    submission_date,\n    SUM(number_of_collections_created) AS number_of_collections_created,\n    SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n

"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"

INPUTS

view_name STRING, project STRING, dataset STRING, funnels ARRAY<STRUCT<funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>>>, counts ARRAY<STRUCT<count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"

Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.

"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"

INPUTS

step_regexes ARRAY<STRING>, intermediate_steps BOOLEAN\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"

Generate the SQL statement that can be used to create an easily queryable view on events data.

"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, funnel ARRAY<STRUCT<list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

OUTPUTS

sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"

Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.

"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"

Given an event index string, create a match string that is an exact match in the events_daily table.

"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"

INPUTS

index STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"

Given an event index and property index from an event_types table, returns a regular expression to match corresponding events within an events_daily table's events string that aren't missing the specified property.

"},{"location":"mozfun/event_analysis/#parameters_7","title":"Parameters","text":"

INPUTS

event_index STRING, property_index INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"

Given an event index, property index, and property value from an event_types table, returns a regular expression to match corresponding events within an events_daily table's events string.

"},{"location":"mozfun/event_analysis/#parameters_8","title":"Parameters","text":"

INPUTS

event_index STRING, property_index INTEGER, property_value STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"

Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.

"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"
extract_event_counts(\n    events STRING\n)\n

events - A comma-separated events string, where each event is represented as a string of unicode chars.

"},{"location":"mozfun/event_analysis/#example","title":"Example","text":"

See this dashboard for example usage.

"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"

INPUTS

events STRING\n

OUTPUTS

ARRAY<STRUCT<index STRING, count INT64>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"

Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.

"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"
extract_event_counts_with_properties(\n    events STRING\n)\n

events - A comma-separated events string, where each event is represented as a string of unicode chars.

"},{"location":"mozfun/event_analysis/#example_1","title":"Example","text":"

See this query for example usage.

"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"

This function extracts both counts for events with each property, and for all events without their properties.

This allows us to include both total counts for an event (with any property value), and events that don't have properties.

"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"

INPUTS

events STRING\n

OUTPUTS

ARRAY<STRUCT<event_index STRING, property_index INT64, property_value_index STRING, count INT64>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"

For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.

"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n

OUTPUTS

count_sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"

For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.

"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

OUTPUTS

funnel_sql STRING\n

Source | Edit

"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"

Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"

UDF for handling empty Google Analytics data.

"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"

Parses the custom build id used for Fenix builds in GLAM to a datetime.

"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"

INPUTS

build_hour STRING\n

OUTPUTS

DATETIME\n

Source | Edit

"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"

Returns a custom build id generated from the build seconds of a FOG build.

"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"

INPUTS

build_hour STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"

Returns a custom build id generated from the build hour of a Fenix build.

"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"

INPUTS

app_build_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"

INPUTS

buckets ARRAY<STRING>, val FLOAT64\n

OUTPUTS

FLOAT64\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"

Cast histogram buckets into a string array.

"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"

INPUTS

buckets ARRAY<INT64>\n

OUTPUTS

ARRAY<STRING>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"

Cast a histogram into a JSON blob.

"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"

INPUTS

histogram ARRAY<STRUCT<key STRING, value FLOAT64>>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"

Cast a String-based JSON histogram to an Array of Structs

"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"

INPUTS

json_str STRING\n

OUTPUTS

ARRAY<STRUCT<KEY STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"

Interpolate missing histogram buckets with empty buckets.

"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"

INPUTS

input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"

Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.

See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg

To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.

This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).

Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).

Calculate the total number of buckets K, as well as the total number of profiles N reporting

Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }

"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"

INPUTS

input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>, total_users INT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"

Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.

"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"

INPUTS

aggs ARRAY<STRUCT<key STRING, value INT64>>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"

Create an empty histogram from an array of buckets.

"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"

INPUTS

buckets ARRAY<STRING>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"

Generate exponential buckets for a histogram.

"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"

INPUTS

min FLOAT64, max FLOAT64, nBuckets FLOAT64\n

OUTPUTS

ARRAY<FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"

Generate functional buckets for a histogram. This is specific to Glean.

See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs

A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:

i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$

In other words, there are n buckets for each power of base magnitude.

"},{"location":"mozfun/glam/#parameters_12","title":"Parameters","text":"

INPUTS

log_base INT64, buckets_per_magnitude INT64, range_max INT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"

Generate linear buckets for a histogram.

"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"

INPUTS

min FLOAT64, max FLOAT64, nBuckets FLOAT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"

Generate scalar buckets for a histogram using a fixed number of buckets.

"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"

INPUTS

min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"

Compute the normalized sum of an array of histograms.

"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"

INPUTS

arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"

Compute the normalized and the non-normalized sum of an array of histograms.

"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"

INPUTS

arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64, non_norm_value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"

INPUTS

required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"

INPUTS

required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"

Get the value of the approximate CDF at the given percentile.

"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"

INPUTS

pct FLOAT64, histogram ARRAY<STRUCT<key STRING, value FLOAT64>>, type STRING\n

OUTPUTS

FLOAT64\n

Source | Edit

"},{"location":"mozfun/glean/","title":"glean","text":"

Functions for working with Glean data.

"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"

Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects

This UDF transforms the ping_info.experiments field from Glean pings into the format for experiments used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.

If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.

"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"

INPUTS

ping_info__experiments ARRAY<STRUCT<key STRING, value STRUCT<branch STRING, extra STRUCT<type STRING, enrollment_id STRING>>>>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"

Parses a Glean datetime metric string value as a BigQuery timestamp.

See https://mozilla.github.io/glean/book/reference/metrics/datetime.html

"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"

INPUTS

datetime_string STRING\n

OUTPUTS

TIMESTAMP\n

Source | Edit

"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"

Returns the number of nanoseconds represented by a Glean timespan struct.

See https://mozilla.github.io/glean/book/user/metrics/timespan.html

"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"

Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.

See https://mozilla.github.io/glean/book/user/metrics/timespan.html

"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"

Extract Segments from a campaign name. Includes region, country_code, and language.

"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"

INPUTS

campaign_name STRING\n

OUTPUTS

STRUCT<campaign_region STRING, campaign_country_code STRING, campaign_language STRING>\n

Source | Edit

"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"

Functions for use with Google Search Console data.

"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"

Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".

"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"

INPUTS

site_domain_name STRING, query STRING, search_type STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"

Extract the country code from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"

Extract the domain name from a URL.

"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"

Extract the language code from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"

Extract the locale from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"

Extract the path from a URL.

"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"

Extract a particular path segment from a URL.

"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"

INPUTS

url STRING, segment_number INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/hist/","title":"hist","text":"

Functions for working with string encodings of histograms from desktop telemetry.

"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"

Given histogram h, return the count of all measurements across all buckets.

Given histogram h, return the count of all measurements across all buckets.

Extracts the values from the histogram and sums them, returning the total_count.

"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"

INPUTS

histogram STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"

Return a parsed struct from a string-encoded histogram.

We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.

The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.

For more on desktop telemetry histogram structure, see:

The compact encodings were originally proposed in:

SELECT\n  mozfun.hist.extract(\n    '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n  ).sum\n-- 1\n
SELECT\n  mozfun.hist.extract('5').sum\n-- 5\n
"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"

Extract a histogram sum from a JSON str representation

"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"

Sum of a keyed histogram, across all keys it contains.

"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"

Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is ARRAY<STRUCT<key STRING, value STRING>>

The return type is INT64.

The key field will be ignored, and the `value is expected to be the compact histogram representation.

"},{"location":"mozfun/hist/#parameters_3","title":"Parameters","text":"

INPUTS

keyed_histogram ARRAY<STRUCT<key STRING, value STRING>>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"

Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.

https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307

"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"

INPUTS

histogram ANY TYPE\n

OUTPUTS

STRUCT<sum INT64, VALUES ARRAY<STRUCT<value INT64>>>\n

Source | Edit

"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"

Merge an array of histograms into a single histogram.

"},{"location":"mozfun/hist/#parameters_5","title":"Parameters","text":"

INPUTS

histogram_list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"

Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.

"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"

INPUTS

histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n

OUTPUTS

STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n

Source | Edit

"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"

Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.

"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"

INPUTS

histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<percentile FLOAT64, value INT64>>\n

Source | Edit

"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"

Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.

"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"

Return the number of recorded observations greater than threshold for the histogram. CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.

https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239

"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"

INPUTS

histogram STRING, threshold INT64\n

Source | Edit

"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"

Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.

"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"

INPUTS

start DATETIME, `end` DATETIME\n

OUTPUTS

STRUCT<`interval` STRING, interval_count INT64>\n

Source | Edit

"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"

Used to parse data field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions

"},{"location":"mozfun/iap/#parameters_1","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"

Used to parse data field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171

"},{"location":"mozfun/iap/#parameters_2","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"

Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody

"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"

INPUTS

provider_receipt_json STRING\n

OUTPUTS

STRUCT<environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, pending_renewal_info ARRAY<STRUCT<auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING>>, receipt STRUCT<adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64>, status INT64>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"

Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.

"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"

INPUTS

apple_receipt ANY TYPE\n

OUTPUTS

STRUCT<environment STRING, active_period STRUCT<start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64>, trial_period STRUCT<start_time TIMESTAMP, end_time TIMESTAMP>>\n

Source | Edit

"},{"location":"mozfun/json/","title":"json","text":"

Functions for parsing Mozilla-specific JSON data types.

"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"

Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.

This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.

"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"

Converts a standard \"map\" like datastructure array<struct<key, value>> into a JSON value.

Convert the standard Array<Struct<key, value>> style maps to JSON values.

"},{"location":"mozfun/json/#parameters_1","title":"Parameters","text":"

INPUTS

input JSON\n

OUTPUTS

json\n

Source | Edit

"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"

Converts a nested JSON object with repeated key/value pairs into a nested JSON object.

Convert a JSON object like { \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] } to a JSON object like { \"metric\": { \"key\": 2 } }.

This only works on JSON types.

"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"

OUTPUTS

json\n

Source | Edit

"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"

Returns an array of key/value structs from a string representing a JSON map.

BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.

Non-string non-null values are encoded as json.

"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"

Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.

"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"

"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"

"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"

These states include whether a client was paid or organic.

"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"

Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.

"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"

LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date

"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"

INPUTS

days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"

LTV states for iOS.

"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"

INPUTS

days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/","title":"map","text":"

Functions for working with arrays of key/value structs.

"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"

Sums all values in a keyed scalar.

"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"

Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is ARRAY<STRUCT<key STRING, value INT64>>

The return type is INT64.

The key field will be ignored.

"},{"location":"mozfun/map/#parameters","title":"Parameters","text":"

INPUTS

keyed_scalar ARRAY<STRUCT<key STRING, value INT64>>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"

Create a map from two arrays (like zipping)

"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"

INPUTS

keys ANY TYPE, `values` ANY TYPE\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"

Fetch the value associated with a given key from an array of key/value structs.

Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.

"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

Source | Edit

"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"

Fetch the value associated with a given key from an array of key/value structs.

Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.

"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"

Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.

"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"

Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.

map.set_key

Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: ARRAY<STRUCT<key ANY TYPE, value ANY TYPE>>.

The type of the key and value you are setting must match the types in the map itself.

"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"

INPUTS

map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"

Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY>, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"

Please provide a description for the routine

"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"

This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.

We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

See also: marketing.parse_campaign_name, which does the same, but for campaign names.

"},{"location":"mozfun/marketing/#parameters","title":"Parameters","text":"

INPUTS

ad_group_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"

Parse a campaign name. Extracts things like region, country_code, and language.

"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"

This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.

We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"

INPUTS

campaign_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"

Parse segments from a creative name.

"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"

This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.

We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

See also: marketing.parse_campaign_name, which does the same, but for campaign names.

"},{"location":"mozfun/marketing/#parameters_2","title":"Parameters","text":"

INPUTS

creative_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"

Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).

"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"

Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here

"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"

INPUTS

app_name STRING, os STRING\n

OUTPUTS

STRUCT<normalized_app_name STRING, normalized_app_name_os STRING>\n

Source | Edit

"},{"location":"mozfun/norm/","title":"norm","text":"

Functions for normalizing data.

"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"

Adds metadata related to the browser version in a struct.

This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.

"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"

INPUTS

version_string STRING\n

OUTPUTS

STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>\n

Source | Edit

"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"

Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.

"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"

INPUTS

start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n

Source | Edit

"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"

Extracts numeric version data from a version string like <major>.<minor>.<patch>.

Note: Non-zero minor and patch versions will be floating point Numeric.

Usage:

SELECT\n    mozfun.norm.extract_version(version_string, 'major') as major_version,\n    mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n    mozfun.norm.extract_version(version_string, 'patch') as patch_version\n

Example using \"96.05.01\":

SELECT\n    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1\n
"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"

INPUTS

version_string STRING, extraction_level STRING\n

OUTPUTS

NUMERIC\n

Source | Edit

"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"

Returns canonical, human-understandable identification info for Fenix sources.

The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.

This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.

The first argument (app_id) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.

The second argument (app_build_id) should be the value in client_info.app_build.

The function returns a STRUCT that contains the logical app_name and channel as well as the Play Store app_id in the canonical form which would appear in Play Store URLs.

Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using telemetry.fenix_clients_last_seen which already handles the union. Otherwise, see the example below as a template for how construct a custom union.

Mapping of channels to datasets:

-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n  (\n    SELECT AS STRUCT\n      m.submission_timestamp,\n      m.metrics.string.geckoview_version,\n      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n  )\n);\n\nWITH base AS (\n  SELECT\n    extract_fields('org_mozilla_firefox_beta', m).*\n  FROM\n    `mozdata.org_mozilla_firefox_beta.metrics` AS m\n  UNION ALL\n  SELECT\n    extract_fields('org_mozilla_fenix', m).*\n  FROM\n    `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n  DATE(submission_timestamp) AS submission_date,\n  geckoview_version,\n  COUNT(*)\nFROM\n  base\nWHERE\n  app_name = 'Fenix'  -- excludes 'Firefox Preview'\n  AND channel = 'beta'\n  AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n  submission_date,\n  geckoview_version\n
"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"

INPUTS

app_id STRING, app_build_id STRING\n

OUTPUTS

STRUCT<app_name STRING, channel STRING, app_id STRING>\n

Source | Edit

"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format

In short it is yDDDHHmm:

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

"},{"location":"mozfun/norm/#parameters_4","title":"Parameters","text":"

INPUTS

app_build STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"

Map Fenix package name to the channel name

"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"

INPUTS

package_name STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"

This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.

The function will return the first value pair in the input array, that is not null and has the earliest timestamp.

Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.

Usage:

SELECT\n   mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>\n
"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"

INPUTS

value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>\n

OUTPUTS

STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>\n

Source | Edit

"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"

Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"

This function is primarily designed to parse the field os_version in table mozdata.default_browser_agent.default_browser. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version values collected in table mozdata.default_browser_agent.default_browser.

"},{"location":"mozfun/norm/#status-as-of-november-2022","title":"Status as of November 2022","text":"

As of November 2022, the expected valid values of os_version are either x.y.z or w.x.y.z where w, x, y, and z are integers.

As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z in x.y.z are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z in w.x.y.z are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.

As of November 2022, the parser currently handles just over 99.89% of data in the field os_version in table mozdata.default_browser_agent.default_browser.

"},{"location":"mozfun/norm/#build-number-conventions","title":"Build number conventions","text":"

Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900 for version 22621. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.

"},{"location":"mozfun/norm/#example-usage","title":"Example usage","text":"
SELECT\n  `os_version`,\n  mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n
"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"

INPUTS

os_version STRING\n

OUTPUTS

STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>\n

Source | Edit

"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"

Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"

INPUTS

client_info ANY TYPE, metrics ANY TYPE\n

OUTPUTS

string\n

Source | Edit

"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"

Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"

INPUTS

ping_info ANY TYPE\n

Source | Edit

"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"

INPUTS

metadata ANY TYPE\n

OUTPUTS

`date`, CAST(NULL\n

Source | Edit

"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"

Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".

This is a reimplementation of logic used in the data pipeline to populate normalized_os.

"},{"location":"mozfun/norm/#parameters_11","title":"Parameters","text":"

INPUTS

os STRING\n

Source | Edit

"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"

Returns a normalized app_name and canonical_app_name for a product based on legacy_app_name and normalized_os values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.

As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name and canonical_app_name returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.

For legacy telemetry pings like main ping for desktop and core ping for mobile products, the legacy_app_name given as input to this function should come from the submission URI (stored as metadata.uri.app_name in BigQuery ping tables). For Glean pings, we have invented product values that can be passed in to this function as the legacy_app_name parameter.

The returned app_name values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product is a deprecated field that was similar in intent.

The returned canonical_app_name is more verbose and is suited for displaying in visualizations. canonical_name is a synonym that we provide for historical compatibility with previous versions of this function.

The returned struct also contains boolean contributes_to_2021_kpi as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.

The normalized_os value that's passed in should be the top-level normalized_os value present in any ping table or you may want to wrap a raw value in mozfun.norm.os like mozfun.norm.product_info(app_name, mozfun.norm.os(os)).

This function also tolerates passing in a product value as legacy_app_name so that this function is still useful for derived tables which have thrown away the raw app_name value from legacy pings.

The mappings are as follows:

legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"

INPUTS

legacy_app_name STRING, normalized_os STRING\n

OUTPUTS

STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n

Source | Edit

"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"

Convert urlbar result types into product-friendly names

This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.

"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"

INPUTS

res STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"

Truncates a version string like <major>.<minor>.<patch> to either the major or minor version. The return value is NUMERIC, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).

For example, \"5.1.0\" would be translated to 5.1 if the parameter is \"minor\" or 5 if the parameter is major.

If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as 10 when run through this function, no matter what the arguments).

This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.

"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"

INPUTS

os_version STRING, truncation_level STRING\n

OUTPUTS

NUMERIC\n

Source | Edit

"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"

Accepts vpn attribution fields as input and returns a struct of normalized fields.

"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>\n

Source | Edit

"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"

Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.

Requires os, os_version and windows_build_number.

E.G. from windows_build_number >= 22000 return Windows 11

"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"

INPUTS

os STRING, os_version STRING, windows_build_number INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/serp_events/","title":"serp_events","text":"

Functions for working with Glean SERP events.

"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"

Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.

"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"

INPUTS

num_loaded INT, num_blocked INT\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"

Determine whether a SERP display component referenced in the serp events contains monetizable ads

"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"

INPUTS

component STRING\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/stats/","title":"stats","text":"

Statistics functions.

"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"

Returns the most frequently occuring element in an array.

In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: stats.mode_last_retain_nulls, which retains nulls.

"},{"location":"mozfun/stats/#parameters","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"

Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.

"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"

Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.

Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.

Use it like:

DECLARE res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n

You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.

"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"

INPUTS

query_a STRING, query_b STRING\n

OUTPUTS

res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>\n

Source | Edit

"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"

Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.

This UDF extracts UTM parameters from a URL string.

UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.

"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRUCT<utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING>\n

Source | Edit

"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"

Extract the Path from a URL

This UDF extracts path from a URL string.

The path is everything after the host and before parameters. This function returns \"/\" if there is no path.

"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/","title":"vpn","text":"

Functions for processing VPN data.

"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"

Assign an acquisition channel based on utm parameters

"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"

Assign a channel group based on utm parameters

"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"

Normalize utm parameters to use the same NULL placeholders as Google Analytics

"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRUCT<utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING>\n

Source | Edit

"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"

Combine the pricing and interval for a subscription plan into a single field

"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"

INPUTS

provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"

Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.

Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.

More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org proposals/ folder).

"},{"location":"reference/airflow_tags/#valid-tags","title":"Valid tags","text":""},{"location":"reference/airflow_tags/#impacttier-tag","title":"impact/tier tag","text":"

We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:

"},{"location":"reference/airflow_tags/#triage-tag","title":"triage/ tag","text":"

This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.

"},{"location":"reference/configuration/","title":"Configuration","text":"

The behaviour of bqetl can be configured via the bqetl_project.yaml file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.

The general structure of bqetl_project.yaml is as follows:

dry_run:\n  function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n  test_project: bigquery-etl-integration-test\n  skip:\n  - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n  - sql/**/apple_ads_external*/**/query.sql\n  # - ...\n\nviews:\n  skip_validation:\n  - sql/moz-fx-data-test-project/test/simple_view/view.sql\n  - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n  - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n  # - ...\n  skip_publishing:\n  - activity_stream/tile_id_types/view.sql\n  - pocket/pocket_reach_mau/view.sql\n  # - ...\n  non_user_facing_suffixes:\n  - _derived\n  - _external\n  # - ...\n\nschema:\n  skip_update:\n  - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n  # - ...\n  skip_prefixes:\n  - pioneer\n  - rally\n\nroutines:\n  skip_publishing:\n  - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n  skip:\n  - bigquery_etl/glam/templates/*.sql\n  - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n  - stored_procedures/safe_crc32_uuid.sql\n  # - ...\n
"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"

ConfigLoader can be used in the bigquery_etl tooling codebase to access configuration parameters. bqetl_project.yaml is automatically loaded in ConfigLoader and parameters can be accessed via a get() method:

from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n

The ConfigLoader.get() method allows multiple string parameters to reference a configuration value that is stored in a nested structure. A fallback value can be optionally provided in case the configuration parameter is not set.

"},{"location":"reference/configuration/#adding-configuration-parameters","title":"Adding configuration parameters","text":"

New configuration parameters can simply be added to bqetl_project.yaml. ConfigLoader.get() allows for these new parameters simply to be referenced without needing to be changed or updated.

"},{"location":"reference/data_checks/","title":"bqetl Data Checks","text":"

Instructions on how to add data checks can be found in the Adding data checks section below.

"},{"location":"reference/data_checks/#background","title":"Background","text":"

To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.

"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"

bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.

This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the checks.sql files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.

It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.

"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"

Each check needs to be categorised with a marker, currently following markers are available:

Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.

"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"

Inside the query directory, which usually contains query.sql or query.py, metadata.yaml and schema.yaml, create a new file called checks.sql (unless already exists).

Please make sure each check you add contains a marker (see: the Two categories of checks section above).

Once checks have been added, we need to regenerate the DAG responsible for scheduling the query.

"},{"location":"reference/data_checks/#update-checkssql","title":"Update checks.sql","text":"

If checks.sql already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.

When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.

"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"

All checks can be removed by deleting the checks.sql file and regenerating the DAG responsible for scheduling the query.

Alternatively, specific checks can be removed by deleting them from the checks.sql file.

"},{"location":"reference/data_checks/#example-checkssql","title":"Example checks.sql","text":"

Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in tests/checks which may take different parameters used to generate the SQL check expression.

Example of what a checks.sql may look like:

-- raw SQL checks\n#fail\nASSERT (\n  SELECT\n    COUNTIF(ISNULL(country)) / COUNT(*)\n    FROM telemetry.table_v1\n    WHERE submission_date = @submission_date\n  ) > 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"

Usage:

Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n
"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"

Usage:

Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n

Please keep in mind the below checks can be combined and specified in the same checks.sql file. For example:

#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"

Compares the row count of the current partition to the average of number_of_days past partitions and checks if the row count is within the average +- threshold_percentage %

Usage:

Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n

Example:

#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n

"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"

Checks that the column has values of specific character length.

Usage:

Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n

Example:

#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n

"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"

Checks that the column values adhere to a pattern based on a regex expression.

Usage:

Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n

Example:

#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n

"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"

To list all available commands in the bqetl data checks CLI:

$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n  Commands for managing and running bqetl data checks.\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n  IN ACTIVE DEVELOPMENT\n\n  The current progress can be found under:\n\n          https://mozilla-hub.atlassian.net/browse/DENG-919\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  render  Renders data check query using parameters provided (OPTIONAL).\n  run     Runs data checks defined for the dataset (checks.sql).\n

To see see how to use a specific command use:

$ ./bqetl check [command] --help\n

render

"},{"location":"reference/data_checks/#usage","title":"Usage","text":"
$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --help                          Show this message and exit.\n
"},{"location":"reference/data_checks/#example","title":"Example","text":"
./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n

run

"},{"location":"reference/data_checks/#usage_1","title":"Usage","text":"
$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --dry_run, --dry-run            To dry run the query to make sure it is\n                                  valid\n  --marker TEXT                   Marker to filter checks.\n  --help                          Show this message and exit.\n
"},{"location":"reference/data_checks/#examples","title":"Examples","text":"
# to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n
"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":""},{"location":"reference/incremental/#properties","title":"Properties","text":""},{"location":"reference/public_data/","title":"Public Data","text":"

For background, see Accessing Public Data on docs.telemetry.mozilla.org.

"},{"location":"reference/recommended_practices/","title":"Recommended practices","text":""},{"location":"reference/recommended_practices/#queries","title":"Queries","text":""},{"location":"reference/recommended_practices/#querying-metrics","title":"Querying Metrics","text":""},{"location":"reference/recommended_practices/#query-metadata","title":"Query Metadata","text":"
friendly_name: SSL Ratios\ndescription: >\n  Percentages of page loads Firefox users have performed that were\n  conducted over SSL broken down by country.\nowners:\n  - example@mozilla.com\nlabels:\n  application: firefox\n  incremental: true # incremental queries add data to existing tables\n  schedule: daily # scheduled in Airflow to run daily\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1414839 # Bugzilla bug ID of data review\n  incremental_export: false # non-incremental JSON export writes all data to a single location\n
"},{"location":"reference/recommended_practices/#views","title":"Views","text":""},{"location":"reference/recommended_practices/#udfs","title":"UDFs","text":""},{"location":"reference/recommended_practices/#large-backfills","title":"Large Backfills","text":""},{"location":"reference/scheduling/","title":"Scheduling Queries in Airflow","text":""},{"location":"reference/stage-deploys-continuous-integration/","title":"Stage Deploys","text":""},{"location":"reference/stage-deploys-continuous-integration/#stage-deploys-in-continuous-integration","title":"Stage Deploys in Continuous Integration","text":"

Before changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks.

Currently, the bigquery-etl-integration-test project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like mozdata, so stage artifacts cannot reference any other artifacts that live in production.

Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the generated-sql branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the generated-sql branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment. * This CI step runs after the generate-sql CI step to ensure that checks will also be executed on generated queries and to ensure schema.yaml files have been automatically created for queries. 2. The bqetl CLI has a command to run stage deploys, which is called in the CI: ./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS * --dataset-suffix will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions. * Determining dependencies is only relevant for UDFs and views. For queries, available schema.yaml files will simply be deployed. * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well. * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage. * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder query.sql file will be created * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to. * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to sql/bigquery-etl-integration-test into their corresponding temporary datasets. * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a bigquery-etl-integration-test folder * The deploy is executed in the order of: UDFs, tables, views * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage * By default the table expiration is set to 1 hour * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)

After CI checks have passed and the pull-request has been approved, changes can be merged to main. Once a new version of bigquery-etl has been published the changes can be deployed to production through the bqetl_artifact_deployment Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html

"},{"location":"reference/stage-deploys-continuous-integration/#local-deploys-to-stage","title":"Local Deploys to Stage","text":"

Local changes can be deployed to stage using the ./bqetl stage deploy command:

./bqetl stage deploy \\\n  --dataset-suffix=test \\\n  --copy-sql-to-tmp-dir \\\n  sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n  sql/mozfun/map/sum/udf.sql\n

Files (for example ones with changes) that should be deployed to stage need to be specified. The stage deploy accepts the following parameters: * --dataset-suffix is an optional suffix that will be added to the datasets deployed to stage * --copy-sql-to-tmp-dir copies SQL stored in sql/ to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running git checkout -- sql/ * (optional) --remove-updated-artifacts removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.

Deployed stage artifacts can be deleted from bigquery-etl-integration-test by running:

./bqetl stage clean --delete-expired --dataset-suffix=test\n
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"

The bqetl command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.

Running some commands, for example to create or query tables, will require Mozilla GCP access.

"},{"location":"bqetl/#installation","title":"Installation","text":"

Follow the Quick Start to set up bigquery-etl and the bqetl CLI.

"},{"location":"bqetl/#configuration","title":"Configuration","text":"

bqetl can be configured via the bqetl_project.yaml file. See Configuration to find available configuration options.

"},{"location":"bqetl/#commands","title":"Commands","text":"

To list all available commands in the bqetl CLI:

$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n  CLI tools for working with bigquery-etl.\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  alchemer    Commands for importing alchemer data.\n  dag         Commands for managing DAGs.\n  dependency  Build and use query dependency graphs.\n  dryrun      Dry run SQL.\n  format      Format SQL.\n  glam        Tools for GLAM ETL.\n  mozfun      Commands for managing mozfun routines.\n  query       Commands for managing queries.\n  routine     Commands for managing routines.\n  stripe      Commands for Stripe ETL.\n  view        Commands for managing views.\n  backfill    Commands for managing backfills.\n

See help for any command:

$ ./bqetl [command] --help\n
"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"

CLI autocomplete for bqetl can be enabled for bash and zsh shells using the script/bqetl_complete script:

source script/bqetl_complete\n

Then pressing tab after bqetl commands should print possible commands, e.g. for zsh:

% bqetl query<TAB><TAB>\nbackfill       -- Run a backfill for a query.\ncreate         -- Create a new query with name...\ninfo           -- Get information about all or specific...\ninitialize     -- Run a full backfill on the destination...\nrender         -- Render a query Jinja template.\nrun            -- Run a query.\n...\n

source script/bqetl_complete can also be added to ~/.bashrc or ~/.zshrc to persist settings across shell instances.

For more details on shell completion, see the click documentation.

"},{"location":"bqetl/#query","title":"query","text":"

Commands for managing queries.

"},{"location":"bqetl/#create","title":"create","text":"

Create a new query with name ., for example: telemetry_derived.active_profiles. Use the --project_id option to change the project the query is added to; default is moz-fx-data-shared-prod. Views are automatically generated in the publicly facing dataset.

Usage

$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n

Examples

./bqetl query create telemetry_derived.deviations_v1 \\\n  --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n
"},{"location":"bqetl/#schedule","title":"schedule","text":"

Schedule an existing query

Usage

$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n

Examples

./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n  --dag=bqetl_deviations \\\n  --task-name=deviations\n
"},{"location":"bqetl/#info","title":"info","text":"

Get information about all or specific queries.

Usage

$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n  --cost --last_updated\n
"},{"location":"bqetl/#backfill","title":"backfill","text":"

Run a backfill for a query. Additional parameters will get passed to bq.

Usage

$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n

Examples

# Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n  --dry_run\n
"},{"location":"bqetl/#run","title":"run","text":"

Run a query. Additional parameters will get passed to bq. If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored. If the name is not found within the sql/ folder bqetl assumes it hasn't been generated yet and will start the generating process for all sql_generators/ files. This generation process will take some time and run dryrun calls against BigQuery but this is expected. Additional parameters (all parameters that are not specified in the Options) must come after the query-name. Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.

Usage

$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n

Examples

# Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1         --project_id=moz-fx-data-shared-prod         --dataset_id=telemetry_derived         --destination_table=ssl_ratios_v1\n
"},{"location":"bqetl/#run-multipart","title":"run-multipart","text":"

Run a multipart query.

Usage

$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n

Examples

# Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n
"},{"location":"bqetl/#validate","title":"validate","text":"

Validate a query. Checks formatting, scheduling information and dry runs the query.

Usage

$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n

Examples

./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n  --use_cloud_function=false \\\n  --project_id=moz-fx-data-marketing-prod \\\n  ga_derived.blogs_goals_v1\n
"},{"location":"bqetl/#initialize","title":"initialize","text":"

Run a full backfill on the destination table for the query. Using this command will: - Create the table if it doesn't exist and run a full backfill. - Run a full backfill if the table exists and is empty. - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query. It supports query.sql files that use the is_init() pattern. To run in parallel per sample_id, include a @sample_id parameter in the query.

Usage

$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts, otherwise initialization is run for empty tables.\n--force: Run the initialization even if the destination table contains data.\n

Examples

Examples:\n   - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n   - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n
"},{"location":"bqetl/#render","title":"render","text":"

Render a query Jinja template.

Usage

$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n

Examples

./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n  --output-dir=/tmp\n
"},{"location":"bqetl/#schema","title":"schema","text":"

Commands for managing query schemas.

"},{"location":"bqetl/#update","title":"update","text":"

Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.

Usage

$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n

Examples

./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n
"},{"location":"bqetl/#deploy","title":"deploy","text":"

Deploy the query schema.

Usage

$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.  Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n

Examples

./bqetl query schema deploy telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#validate_1","title":"validate","text":"

Validate the query schema

Usage

$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n

Examples

./bqetl query schema validate telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#dag","title":"dag","text":"

Commands for managing DAGs.

"},{"location":"bqetl/#info_1","title":"info","text":"

Get information about available DAGs.

Usage

$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n

Examples

# Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n
"},{"location":"bqetl/#create_1","title":"create","text":"

Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_ prefix. Created DAGs are added to the dags.yaml file.

Usage

$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n

Examples

./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n
"},{"location":"bqetl/#generate","title":"generate","text":"

Generate Airflow DAGs from DAG definitions.

Usage

$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n

Examples

# Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n
"},{"location":"bqetl/#remove","title":"remove","text":"

Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.

Usage

$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n

Examples

# Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n
"},{"location":"bqetl/#dependency","title":"dependency","text":"

Build and use query dependency graphs.

"},{"location":"bqetl/#show","title":"show","text":"

Show table references in sql files.

Usage

$ ./bqetl dependency show [OPTIONS] [paths]\n
"},{"location":"bqetl/#record","title":"record","text":"

Record table references in metadata. Fails if metadata already contains references section.

Usage

$ ./bqetl dependency record [OPTIONS] [paths]\n
"},{"location":"bqetl/#dryrun","title":"dryrun","text":"

Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false and ensure that the command line has access to a GCP service account.

Usage

$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n

Examples

Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n
"},{"location":"bqetl/#format","title":"format","text":"

Format SQL files.

Usage

$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n

Examples

# Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n
"},{"location":"bqetl/#routine","title":"routine","text":"

Commands for managing routines for internal use.

"},{"location":"bqetl/#create_2","title":"create","text":"

Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.

Usage

$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n

Examples

# Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n
"},{"location":"bqetl/#info_2","title":"info","text":"

Get routine information.

Usage

$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n

Examples

# Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n
"},{"location":"bqetl/#validate_2","title":"validate","text":"

Validate formatting of routines and run tests.

Usage

$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n

Examples

# Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#publish","title":"publish","text":"

Publish routines to BigQuery. Requires service account access.

Usage

$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n

Examples

# Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#rename","title":"rename","text":"

Rename routine or routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n
"},{"location":"bqetl/#mozfun","title":"mozfun","text":"

Commands for managing public mozfun routines.

"},{"location":"bqetl/#create_3","title":"create","text":"

Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun project.

Usage

$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n

Examples

# Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n
"},{"location":"bqetl/#info_3","title":"info","text":"

Get mozfun routine information.

Usage

$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n

Examples

# Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n
"},{"location":"bqetl/#validate_3","title":"validate","text":"

Validate formatting of mozfun routines and run tests.

Usage

$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n

Examples

# Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n
"},{"location":"bqetl/#publish_1","title":"publish","text":"

Publish mozfun routines. This command is used by Airflow only.

Usage

$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
"},{"location":"bqetl/#rename_1","title":"rename","text":"

Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.

Usage

$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n
"},{"location":"bqetl/#backfill_1","title":"backfill","text":"

Commands for managing backfills.

"},{"location":"bqetl/#create_4","title":"create","text":"

Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.

Usage

$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n

Examples

./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n  --start_date=2021-03-01 \\\n  --end_date=2021-03-31 \\\n  --exclude=2021-03-03 \\\n
"},{"location":"bqetl/#validate_4","title":"validate","text":"

Validate backfill.yaml file format and content.

Usage

$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n    ./bqetl backfill validate\n
"},{"location":"bqetl/#info_4","title":"info","text":"

Get backfill(s) information from all or specific table(s).

Usage

$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n

Examples

# Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n
"},{"location":"bqetl/#scheduled","title":"scheduled","text":"

Get information on backfill(s) that require processing.

Usage

$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n

Examples

# Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n
"},{"location":"bqetl/#initiate","title":"initiate","text":"

Process entry in backfill.yaml with Initiate status that has not yet been processed.

Usage

$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"bqetl/#complete","title":"complete","text":"

Complete entry in backfill.yaml with Complete status that has not yet been processed..

Usage

$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n

Examples

# Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"

This is a quick guide of how to perform common workflows in bigquery-etl using the bqetl CLI.

For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the bqetl CLI needs to be installed by running ./bqetl bootstrap.

"},{"location":"cookbooks/common_workflows/#adding-a-new-scheduled-query","title":"Adding a new scheduled query","text":"

The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.

  1. Run ./bqetl query create <dataset>.<table>_<version>
    1. Specify the desired destination dataset and table name for <dataset>.<table>_<version>
    2. Directories and files are generated automatically
  2. Open query.sql file that has been created in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/ to write the query
  3. [Optional] Run ./bqetl query schema update <dataset>.<table>_<version> to generate the schema.yaml file
  4. Open the metadata.yaml file in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
  5. Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
  6. To schedule the query, first select a DAG from the ./bqetl dag info list or create a new DAG ./bqetl dag create <bqetl_new_dag>
  7. Run ./bqetl query schedule <dataset>.<table>_<version> --dag <bqetl_dag> to schedule the query
  8. Create a pull request
  9. PR gets reviewed and eventually approved
  10. Merge pull-request
  11. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
  12. Backfill data
"},{"location":"cookbooks/common_workflows/#update-an-existing-query","title":"Update an existing query","text":"
  1. Open the query.sql file of the query to be updated and make changes
  2. Run ./bqetl query validate <dataset>.<table>_<version> to dry run and format the query
  3. If the query scheduling metadata has changed, run ./bqetl dag generate <bqetl_dag> to update the DAG file
  4. If the query adds new columns, run ./bqetl query schema update <dataset>.<table>_<version> to make local schema.yaml updates
  5. Open PR with changes
  6. PR reviewed and approved
  7. Merge pull-request
  8. Table deploys (including schema changes) happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#formatting-sql","title":"Formatting SQL","text":"

We enforce consistent SQL formatting as part of CI. After adding or changing a query, use ./bqetl format to apply formatting rules.

Directories and files passed as arguments to ./bqetl format will be formatted in place, with directories recursively searched for files with a .sql extension, e.g.:

$ echo 'SELECT 1,2,3' > test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n  1,\n  2,\n  3\n

If no arguments are specified the script will read from stdin and write to stdout, e.g.:

$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n  1,\n  2,\n  3\n

To turn off sql formatting for a block of SQL, wrap it in format:off and format:on comments, like this:

SELECT\n  -- format:off\n  submission_date, sample_id, client_id\n  -- format:on\n
"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"

Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.

  1. Open the query.sql file inside the <dataset>.<table> location and add the new definitions for the field.
  2. Run ./bqetl format <path to the query> to format the query. Alternatively, run ./bqetl format $(git ls-tree -d HEAD --name-only) validate the format of all queries that have been modified.
  3. Run ./bqetl query validate <dataset>.<table> to dry run the query.
  4. Run ./bqetl query schema update <dataset>.<table> --update_downstream to make local schema.yaml updates and update schemas of downstream dependencies.
  5. Open a new PR with these changes.
  6. PR reviewed and approved.
  7. Find and run again the CI pipeline for the PR.
  8. Merge pull-request.
  9. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG

The following is an example to update a new field in telemetry_derived.clients_daily_v6

"},{"location":"cookbooks/common_workflows/#example-add-a-new-field-to-clients_daily","title":"Example: Add a new field to clients_daily","text":"
  1. Open the clients_daily_v6 query.sql file and add new field definitions.
  2. Run ./bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql
  3. Run ./bqetl query validate telemetry_derived.clients_daily_v6.
  4. Authenticate to GCP: gcloud auth login --update-adc
  5. Run ./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false.
  6. Open a PR with these changes.
  7. PR is reviewed and approved.
  8. Merge pull-request.
  9. Table deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#remove-a-field-from-a-table-schema","title":"Remove a field from a table schema","text":"

Deleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.

"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"
  1. Run ./bqetl mozfun create <dataset>.<name> --udf.
  2. Navigate to the udf.sql file in sql/mozfun/<dataset>/<name>/ and add UDF the definition and tests.
  3. Run ./bqetl mozfun validate <dataset>.<name> for formatting and running tests.
  4. Open a PR.
  5. PR gets reviewed, approved and merged.
  6. To publish UDF immediately:
"},{"location":"cookbooks/common_workflows/#adding-a-new-internal-udf","title":"Adding a new internal UDF","text":"

Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a mozfun UDF.

  1. Run ./bqetl routine create <dataset>.<name> --udf
  2. Navigate to the udf.sql in sql/moz-fx-data-shared-prod/<dataset>/<name>/ file and add UDF definition and tests
  3. Run ./bqetl routine validate <dataset>.<name> for formatting and running tests
  4. Open a PR
  5. PR gets reviewed and approved and merged
  6. UDF deploys happen on a nightly cadence through the bqetl_artifact_deployment Airflow DAG
"},{"location":"cookbooks/common_workflows/#adding-a-stored-procedure","title":"Adding a stored procedure","text":"

The same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute ./bqetl mozfun create <dataset>.<name> --stored_procedure or ./bqetl routine create <dataset>.<name> --stored_procedure for internal stored procedures.

"},{"location":"cookbooks/common_workflows/#updating-an-existing-udf","title":"Updating an existing UDF","text":"
  1. Navigate to the udf.sql file and make updates
  2. Run ./bqetl mozfun validate <dataset>.<name> or ./bqetl routine validate <dataset>.<name> for formatting and running tests
  3. Open a PR
  4. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#renaming-an-existing-udf","title":"Renaming an existing UDF","text":"
  1. Run ./bqetl mozfun rename <dataset>.<name> <new_dataset>.<new_name>
  2. Open a PR
  3. PR gets reviews, approved and merged
"},{"location":"cookbooks/common_workflows/#using-a-private-internal-udf","title":"Using a private internal UDF","text":"
  1. Follow the steps for Adding a new internal UDF above to create a stub of the private UDF. Note this should not contain actual private UDF code or logic. The directory name and function parameters should match the private UDF.
  2. Do Not publish the stub UDF. This could result in incorrect results for other users of the private UDF.
  3. Open a PR
  4. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#creating-a-new-bigquery-dataset","title":"Creating a new BigQuery Dataset","text":"

To provision a new BigQuery dataset for holding tables, you'll need to create a dataset_metadata.yaml which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.

The bqetl query create command will automatically generate a skeleton dataset_metadata.yaml file if the query name contains a dataset that is not yet defined.

See example with commentary for telemetry_derived:

friendly_name: Telemetry Derived\ndescription: |-\n  Derived data based on pings from legacy Firefox telemetry, plus many other\n  general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n#   \"derived\" for `_derived` datasets that contain concrete tables\n#   \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n  members:\n  - workgroup:mozilla-confidential\n
"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"

See also the reference for Public Data.

  1. Get a data review by following the data publishing process
  2. Update the metadata.yaml file of the query to be published
  3. If an internal dataset already exists, move it to mozilla-public-data
  4. If an init.sql file exists for the query, change the destination project for the created table to mozilla-public-data
  5. Open a PR
  6. PR gets reviewed, approved and merged
"},{"location":"cookbooks/common_workflows/#adding-new-python-requirements","title":"Adding new Python requirements","text":"

When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt or constraints.txt, but not both.

# Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 >> requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n
"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"

When opening a pull-request to merge a fork, the manual-trigger-required-for-fork CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <username>:<branch> of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.

"},{"location":"cookbooks/common_workflows/#building-the-documentation","title":"Building the Documentation","text":"

The repository documentation is built using MkDocs. To generate and check the docs locally:

  1. Run ./bqetl docs generate --output_dir generated_docs
  2. Navigate to the generated_docs directory
  3. Run mkdocs serve to start a local mkdocs server.
"},{"location":"cookbooks/common_workflows/#setting-up-change-control-to-code-files","title":"Setting up change control to code files","text":"

Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.

Here is a sample PR with the implementation of change control for contextual services data.

  1. Select or create a Github team or identity and add the GitHub emails of the query codeowners. A GitHub identity is particularly useful when you need to include non @mozilla emails or to randomly assign PR reviewers from the team members. This team requires edit permissions to bigquery-etl, to achieve this, inherit the team from one that has the required permissions e.g. mozilla > telemetry.
  2. Open the metadata.yaml for the query where you want to apply change control:
  3. Setup the CODEOWNERS:
  4. The queries labeled change_controlled are automatically validated in the CI. To run the validation locally:
"},{"location":"cookbooks/creating_a_derived_dataset/","title":"A quick guide to creating a derived dataset with BigQuery-ETL and how to set it up as a public dataset","text":"

This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.

This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.

To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.

If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.

"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"

Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.

In this example, we will create a table of aggregated metrics related to mozregression, that will be used in dashboards to help prioritize feature development inside Mozilla.

"},{"location":"cookbooks/creating_a_derived_dataset/#initial-steps","title":"Initial steps","text":"

Set up bigquery-etl on your system per the instructions in the README.md.

"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"

The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it org_mozilla_mozregression_derived.mozregression_aggregates.

The org_mozilla_mozregression_derived part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the _derived postfix to hold derived tables like this one.

Run:

./bqetl query create <dataset>.<table_name>\n
In our example:

./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n

This command does three things:

We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the mozdata project.

"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-yaml","title":"Fill out the YAML","text":"

The next step is to modify the generated metadata.yaml and query.sql sections with specific information.

Let's look at what the metadata.yaml file for our example looks like. Make sure to adapt this file for your own dataset.

friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n

Most of the fields are self-explanatory. incremental means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.

For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the date field and a clustering on app_used and os.

"},{"location":"cookbooks/creating_a_derived_dataset/#the-yaml-file-structure-for-a-public-dataset","title":"The YAML file structure for a public dataset","text":"

Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: public_json, public_bigquery and review_bugs which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.

friendly_name: mozregression aggregates\ndescription:\n  Aggregated metrics of mozregression usage\nlabels:\n  incremental: true\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1691105\nowners:\n  - wlachance@mozilla.com\nbigquery:\n  time_partitioning:\n    type: day\n    field: date\n    require_partition_filter: true\n    expiration_days: null\n  clustering:\n    fields:\n    - app_used\n    - os\n
"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"

Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a @submission_date parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.

Test your query and add it to the query.sql file.

In our example, the query is tested in sql.telemetry.mozilla.org, and the query.sql file looks like this:

SELECT\n  DATE(submission_timestamp) AS date,\n  client_info.app_display_version AS mozregression_version,\n  metrics.string.usage_variant AS mozregression_variant,\n  metrics.string.usage_app AS app_used,\n  normalized_os AS os,\n  mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n  count(DISTINCT(client_info.client_id)) AS distinct_clients,\n  count(*) AS total_uses\nFROM\n  `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n  DATE(submission_timestamp) = @submission_date\n  AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n  date,\n  mozregression_version,\n  mozregression_variant,\n  app_used,\n  os,\n  os_version;\n

We use the truncate_version UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.

We also have a short clause (client_info.app_display_version NOT LIKE '%.dev%') to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.

"},{"location":"cookbooks/creating_a_derived_dataset/#formatting-and-validating-the-query","title":"Formatting and validating the query","text":"

Now that we've written our query, we can format it and validate it. Once that's done, we run:

./bqetl query validate <dataset>.<table>\n
For our example:
./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
If there are no problems, you should see no output.

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"

Use bqetl to set up the schema that will be used to create the table.

Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.

./bqetl query schema update <dataset>.<table>\n

For our example:

./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"

BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).

Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in dags.yaml at the root or run ./bqetl dag info. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:

./bqetl dag create <dag_name> --schedule-interval \"0 4 * * *\" --owner <email_for_notifications> --description \"Add a clear description of the DAG here\" --start-date <YYYY-MM-DD> --tag impact/<tier>\n

For our example, the starting date is 2020-06-01 and we use a schedule interval of 0 4 \\* \\* \\* (4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.

The --tag impact/tier3 parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.

When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as --tag triage/no_triage. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the triage/no_triage tag can be removed and problems will addressed during the Airflow Triage process.

./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n
"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"

Queries are automatically scheduled during creation in the DAG set using the option --dag, or in the default DAG bqetl_default when this option is not used.

If the query was created with --no-schedule, it is possible to manually schedule the query via the bqetl tool:

./bqetl query schedule <dataset>.<table> --dag <dag_name> --task-name <task_name>\n

Here is the command for our example. Notice the name of the table as created with the suffix _v1.

./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n

Note that we are scheduling the generation of the underlying table which is org_mozilla_mozregression_derived.mozregression_aggregates_v1 rather than the view.

"},{"location":"cookbooks/creating_a_derived_dataset/#get-data-review","title":"Get Data Review","text":"

This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.

Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.

The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.

"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"

Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:

git checkout -b <new_branch_name>\ngit add dags.yaml dags/<dag_name>.py sql/moz-fx-data-shared-prod/telemetry/<view> sql/moz-fx-data-shared-prod/<dataset>/<table>\ngit commit\ngit push origin <new_branch_name>\n

And next is the workflow for our specific example:

git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n

Then create your pull request, either from the GitHub web interface or the command line, per your preference.

Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.

This example assumes that origin points to your fork. Adjust the last push invocation appropriately if you have a different remote set.

Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.

"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"

Once the PR has been approved, deploy the schema to bqetl using this command:

./bqetl query schema deploy <schema>.<table>\n

For our example:

./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n

"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"

Note For large sets of data, follow the recommended practices for backfills.

"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"
  1. Create a backfill schedule entry to (re)-process data in your table:

    bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD>\n
    bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD> --shredder_mitigation\n
  2. Fill out the missing details:

  3. Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.

  4. Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.

"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-backfill","title":"Completing the backfill:","text":"
  1. Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)

  2. If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.

  3. You will be notified when swapping is complete.

Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.

"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"

At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.

"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"

This repository uses pytest:

# create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n

To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.

"},{"location":"cookbooks/testing/#how-to-configure-a-udf-test","title":"How to Configure a UDF Test","text":"

Include a comment like -- Tests followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.

Each test must use the UDF and throw an error to fail. Assert functions defined in sql/mozfun/assert/ may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like #xfail, similar to a SQL dialect prefix in the BigQuery Cloud Console.

For example:

CREATE TEMP FUNCTION udf_example(option INT64) AS (\n  CASE\n  WHEN option > 0 then TRUE\n  WHEN option = 0 then FALSE\n  ELSE ERROR(\"invalid option\")\n  END\n);\n-- Tests\nSELECT\n  mozfun.assert.true(udf_example(1)),\n  mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n  udf_example(-1);\n#xfail\nSELECT\n  udf_example(NULL);\n
"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"

Queries are tested by running the query.sql with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named tests/sql/{project}/{dataset}/{table}/{test_name}/, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day - table must match a directory named like {dataset}/{table}, e.g. telemetry_derived/clients_last_seen_v1 - test_name should start with test_, e.g. test_single_day - If test_name is test_init or test_script, then the query with is_init() set to true or script.sql respectively; otherwise, the test will run query.sql 1. Add .yaml files for input tables, e.g. clients_daily_v6.yaml - Include the dataset prefix if it's set in the tested query, e.g. analysis.clients_last_seen_v1.yaml - Include the project prefix if it's set in the tested query, e.g. moz-fx-other-data.new_dataset.table_1.yaml - This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\") 1. Add .sql files for input view queries, e.g. main_summary_v4.sql - Don't include a CREATE ... AS clause - Fully qualify table names as `{project}.{dataset}.table` - Include the dataset prefix if it's set in the tested query, e.g. telemetry.main_summary_v4.sql - This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\") 1. Add expect.yaml to validate the result - DATE and DATETIME type columns in the result are coerced to strings using .isoformat() - Columns named generated_time are removed from the result before comparing to expect because they should not be static - NULL values should be omitted in expect.yaml. If a column is expected to be NULL don't add it to expect.yaml. (Be careful with spreading previous rows (-<<: *base) here) 1. Optionally add .schema.json files for input table schemas to the table directory, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json. These tables will be available for every test in the suite. The schema.json file need to match the table name in the query.sql file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add query_params.yaml to define query parameters - query_params must be a list

"},{"location":"cookbooks/testing/#init-tests","title":"Init Tests","text":"

Tests of is_init() statements are supported, similarly to other generated tests. Simply name the test test_init. The other guidelines still apply.

"},{"location":"cookbooks/testing/#additional-guidelines-and-options","title":"Additional Guidelines and Options","text":""},{"location":"cookbooks/testing/#how-to-run-circleci-locally","title":"How to Run CircleCI Locally","text":"
gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n  --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n
"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"

INPUTS

x INT64, n INT64\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"

Given a map of representing activity for STRING keys, this function returns an array of which keys were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_1","title":"Parameters","text":"

INPUTS

days_seen_bits_map ARRAY<STRUCT<key STRING, value INT64>>, start_offset INT64, n_bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"

This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days. On the first of each month, a new month is appended, and the first month is dropped. If the date is not the first of the month, the new entry is added to the last element in the array. For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]: On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24] This happens for every aggregate (searches, ad clicks, etc.)

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"

INPUTS

prev STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, curr STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"

Adds together two engine searches structs. Each engine searches struct has a MAP[engine -> search_counts_struct]. We want to add add together the prev and curr's values for a certain engine. This allows us to be flexible with the number of engines we're using.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"

INPUTS

prev ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, curr ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, submission_date DATE\n

OUTPUTS

value\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"

Return sums of each search type grouped by the index. Results are ordered by index.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"

INPUTS

searches ARRAY<STRUCT<total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"

This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY>, i.e. the output of SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"

INPUTS

active_addons ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"

Returns an aggregated map with all the keys and the first corresponding value from the given maps

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"

INPUTS

maps ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"

INPUTS

search_counts ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"

Aggregates the total counts of the given search counters

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"

INPUTS

engine_searches_list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"

An array of 11 zeroes, followed by a supplied value

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"

INPUTS

val INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"

Drop the first element of an array, and append the given element. Result is an array with the same length as the input.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"

INPUTS

arr ANY TYPE, append ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"

An array of 12 zeroes

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"

INPUTS

) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"

INPUTS

arr ANY TYPE, start_index INT64, end_index INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"

This function counts the 1s in lowest 7 bits of an INT64

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"

A bitmask for 365 bits

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"

INPUTS

) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"

INPUTS

) AS ( 0x0FFFFFFF\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"

INPUTS

) AS ( 0x7F\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"

Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask. The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"

INPUTS

start_ordinal INT64, _length INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"

INPUTS

bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"

INPUTS

s STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

OUTPUTS

ARRAY<DATE>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"

INPUTS

bits INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"

Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: sql bits_from_offsets( ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL) IGNORE NULLS) ) The resulting value can be cast to an INT64 representing the most recent 64 days via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) Or representing the most recent 28 days (compatible with bits28 functions) via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) << 36 >> 36

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_25","title":"Parameters","text":"

INPUTS

offsets ARRAY<INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"

Given a BYTE and an INT64, return whether the user was active that many weeks ago. NULL input returns NULL output.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"

INPUTS

b BYTES, n INT64\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"

Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"

Given a BYTES, return the number of days since the client was first seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0. Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS (( SELECT MAX(n) FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_seen.sql

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"

Given a BYTES, return the number of days since the client was last seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0. Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_seen(b BYTES) AS (( SELECT MIN(n) FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_first_seen.sql

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"

INPUTS

b BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"

Convert a boolean to 365 bit byte array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"

INPUTS

val BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"

Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"

INPUTS

histogram STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"

We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"

INPUTS

prev INT64, curr INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"

Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"

INPUTS

prev BYTES, curr BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"

Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"

INPUTS

prev INT64, curr INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"

INPUTS

prev BYTES, curr BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"

The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See udf.combine_experiment_days for a more specific example of this approach.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_36","title":"Parameters","text":"

INPUTS

-- prev ARRAY<STRUCT<key STRING, value INT64>>, -- curr ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"

The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"

INPUTS

-- prev ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>, -- curr ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"

For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji. This is just for fun. See: - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"

INPUTS

country_code string\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"

Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"

INPUTS

days_seen_bytes BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"

Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"

INPUTS

days_since_created_profile INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"

Rename struct fields in anonymous event tuples to meaningful names.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"

INPUTS

tuple STRUCT<f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY<STRUCT<key STRING, value STRING>>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"

INPUTS

raw BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"

Return an array containing only distinct values of the given array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"

INPUTS

) RETURNS STRING AS ( 'helloworld'\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"

INPUTS

code_points ANY TYPE\n

OUTPUTS

ARRAY<INT64>\n
"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"

Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"

INPUTS

metric ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"

Extract the document type from a table name e.g. _TABLE_SUFFIX.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"

INPUTS

table_name STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"

Extract the document version from a table name e.g. _TABLE_SUFFIX.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"

INPUTS

table_name STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"

This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed. It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"

Return a path derived from an error message in payload_bytes_error

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_53","title":"Parameters","text":"

INPUTS

error_message STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format>

In short it is yDDDHHmm:

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format> where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_54","title":"Parameters","text":"

INPUTS

app_build STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"

INPUTS

os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"

INPUTS

device_category STRING, browser STRING, operating_system STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"

INPUTS

silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"

Determine if a browser in a Google Analytics data is produced by Mozilla

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"

INPUTS

browser STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"

Convert geoip lookup fields to a struct, replacing '??' with NULL. Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"

INPUTS

country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"

Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"

INPUTS

country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"

Given a gzip compressed byte string, extract the uncompressed size from the footer. WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"

INPUTS

compressed BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"

Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"

INPUTS

histogram STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"

INPUTS

histogram_list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"

INPUTS

histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n

OUTPUTS

STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"

INPUTS

histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"

INPUTS

histogram ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"

INPUTS

histogram STRING, threshold INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"

Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"

INPUTS

key BYTES, message BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"

INPUTS

value INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"

INPUTS

value INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"

Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"

INPUTS

keyed_histogram ANY TYPE, target_key STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"

Returns a JSON string which has the pair appended to the provided input JSON string. NULL is also valid for input. Examples: udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"baz\": \"boo\"}'

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_79","title":"Parameters","text":"

INPUTS

input STRING, arr ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"

Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT>>. To use this with that representation, it should be as udf.kv_array_to_json_string(struct.key_value)."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"

INPUTS

kv_arr ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"

INPUTS

processes ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"

For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"

INPUTS

country STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"

Only for use by the LTV Revenue join. Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"

INPUTS

engine STRING, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"

This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"

INPUTS

medium STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"

Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"

INPUTS

aggs ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n

OUTPUTS

ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"

This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"

INPUTS

dividend BYTES, divisor INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"

Stub monetized_search UDF for tests

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"

INPUTS

engine STRING, country STRING, distribution_id STRING, submission_date DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"

This struct represents the past year's worth of searches. Each month has its own entry, hence 12.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"

INPUTS

) AS ( STRUCT( udf.array_of_12_zeroes(\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"

Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"

INPUTS

telemetry_sdk_build STRING, metrics ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"

INPUTS

client_info ANY TYPE, metrics ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"

INPUTS

ping_info ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"

INPUTS

payload ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"

INPUTS

metadata ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"

Sum up the monthy search count arrays by normalized engine

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"

INPUTS

engine_searches ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"

INPUTS

os STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"

Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"

INPUTS

engine STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"

Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"

One represented as a byte array of 365 bits

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"

INPUTS

) AS ( CONCAT(REPEAT(b'\\x00', 45\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"

This is a UDF to help distinguish desktop client attribution as being organic or paid

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"

INPUTS

medium STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"

This is a UDF to help distinguish mobile client attribution as being organic or paid

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"

INPUTS

adjust_network STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"

INPUTS

event_properties ANY TYPE, indices ANY TYPE\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n
"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"

Sum an array from a parquet-derived field. These are lists of an element that contain the field value.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_106","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"

Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"

INPUTS

uri STRING\n

OUTPUTS

STRUCT<namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"

Take a ISO 8601 date or date and time string and return a DATE. Return null if parse fails. Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"

INPUTS

date_str STRING\n

OUTPUTS

DATE\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"

INPUTS

) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"

This is a stub implementation for use with tests; real implementation is in private-bigquery-etl

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"

INPUTS

distribution_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"

Returns the 0-based index of the first set bit. No set bits returns NULL.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"

INPUTS

i INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"

Identical to bits28_days_since_seen. Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0). To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"

Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"

INPUTS

legacy_app_name STRING, normalized_os STRING\n

OUTPUTS

STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"

Pseudonymize Ad IDs, handling opt-outs.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"

INPUTS

hashed_ad_id STRING, key BYTES\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"

This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"

INPUTS

metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"

Floor a timestamp object to the given minute interval.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"

INPUTS

timestamp_expression TIMESTAMP, minute INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"

Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes. This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"

INPUTS

) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"

Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"

INPUTS

client_id STRING\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"

Calculate the sums of search counts per source and engine

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"

INPUTS

entries ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"

Shift input bits one day left and drop any bits beyond 28 days.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"

Shift input bits one day left and drop any bits beyond 365 days.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"

INPUTS

x BYTES\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"

Returns the bitfield shifted by one day, 0 for NULL

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"

INPUTS

x INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"

Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"

INPUTS

bit_arrays ARRAY<STRUCT<days_created_profile_bits INT64, days_active_bits INT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"

This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"

INPUTS

a ARRAY<INT64>, b ARRAY<INT64>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"

Zero represented as a 365-bit byte array

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"

INPUTS

) AS ( REPEAT(b'\\x00', 46\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"

Generates an array if all zeroes, of arbitrary length

"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"

INPUTS

len INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"

Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"

INPUTS

percentiles ARRAY<INT64>, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>, metric STRING\n

OUTPUTS

ARRAY<STRUCT<metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"

Calculate the CRC-32 hash of an input string. The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"

INPUTS

data STRING\n

OUTPUTS

INT64 DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"

URL decodes the raw firefox_installer.install.attribution string to a STRUCT. The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted. If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"

INPUTS

attribution STRING\n

OUTPUTS

STRUCT<campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from payload which is in bytes.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_3","title":"Parameters","text":"

INPUTS

payload BYTES\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"

Unzips a GZIP string. This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"

INPUTS

input BYTES\n

OUTPUTS

STRING DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"

INPUTS

n_buckets INT64, values_per_bucket ARRAY<FLOAT64>\n

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, pm FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"

Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"

INPUTS

percentile FLOAT64, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>\n

OUTPUTS

STRUCT<low FLOAT64, high FLOAT64, percentile FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(active_days_in_week) AS active_days_in_week, SUM(wau) AS wau FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM bucketed GROUP BY submission_date

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"

INPUTS

n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n

OUTPUTS

intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"

Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(dau) AS dau_sum FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM bucketed GROUP BY submission_date

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"

INPUTS

n_buckets INT64, counts_per_bucket ARRAY<INT64>\n

OUTPUTS

STRUCT<total INT64, low INT64, high INT64, pm INT64>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY<STRUCT<key STRING, value STRING>>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"

Returns a parsed struct from a JSON string representing a histogram. This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

STRUCT<bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"

Returns an array of parsed structs from a JSON string representing a keyed histogram. This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"

Extract missing columns from additional properties. More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...]. param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node. Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements'] See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"

INPUTS

input STRING, indicates_node ARRAY<STRING>, known_nodes ARRAY<STRING>\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"

Add fields from additional_attributes to active_addons in main pings. Return an array instead of a \"map\" for backwards compatibility. The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL. The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"

INPUTS

active_addons ARRAY<STRUCT<key STRING, value STRUCT<app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING>>>, active_addons_json STRING\n

OUTPUTS

ARRAY<STRUCT<addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"

Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"

INPUTS

dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n

OUTPUTS

STRUCT<keyed_boolean_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value BOOL>>>>, keyed_uint_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value INT64>>>>, string_addon_scalars ARRAY<STRUCT<key STRING, value STRING>>, keyed_string_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value STRING>>>>, uint_addon_scalars ARRAY<STRUCT<key STRING, value INT64>>, boolean_addon_scalars ARRAY<STRUCT<key STRING, value BOOL>>>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"

Report the ids of the addons which are in the addonDetails but not in the activeAddons. They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"

INPUTS

active_addon_ids ARRAY<STRING>, addon_details_json STRING\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"

Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"

INPUTS

params STRING\n

OUTPUTS

STRUCT<`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN>\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"

Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"

INPUTS

client_id STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"

This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824

"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"

INPUTS

input ARRAY<STRING>\n

OUTPUTS

ARRAY<STRING>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/about/","title":"mozfun","text":"

mozfun is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.

"},{"location":"mozfun/addons/","title":"Addons","text":""},{"location":"mozfun/addons/#is_adblocker-udf","title":"is_adblocker (UDF)","text":"

Returns whether a given Addon ID is an adblocker.

Determine if a given Addon ID is for an adblocker.

As an example, this query will give the number of users who have an adblocker installed.

SELECT\n    submission_date,\n    COUNT(DISTINCT client_id) AS dau,\nFROM\n    mozdata.telemetry.addons\nWHERE\n    mozfun.addons.is_adblocker(addon_id)\n    AND submission_date >= \"2023-01-01\"\nGROUP BY\n    submission_date\n

"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"

INPUTS

addon_id STRING\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n

Source | Edit

"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"

INPUTS

name STRING, expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"

Like map_equals but error message contains only the offending entry

"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n
"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"

Compare SQL Strings for equality

"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"

INPUTS

expected ANY TYPE, actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"

INPUTS

actual ANY TYPE\n

Source | Edit

"},{"location":"mozfun/bits28/","title":"bits28","text":"

The bits28 functions provide an API for working with \"bit pattern\" INT64 fields, as used in the clients_last_seen dataset for desktop Firefox and similar datasets for other applications.

A powerful feature of the clients_last_seen methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.

The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.

"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"

Return a boolean indicating if any bits are set in the specified range of a bit pattern. The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at start_offset.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

BOOLEAN\n

Source | Edit

"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"

Return the position of the rightmost set bit in an INT64 bit pattern.

To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  mozfun.bits28.days_since_seen(18)\n-- >> 1\n
"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"

Convert a string representing individual bits into an INT64.

Implementation based on https://stackoverflow.com/a/51600210/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"

Return an INT64 representing a range of bits from a source bit pattern.

The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.

n_bits is the number of bits to consider, counting right from the bit at start_offset.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n  mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n  mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n  `mozdata.telemetry.clients_last_seen`\nWHERE\n  submission_date > '2020-01-01'\n
"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"

INPUTS

bits INT64, start_offset INT64, n_bits INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"

Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.

"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

Source | Edit

"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"

Convert a bit pattern into an array of the dates is represents.

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"

INPUTS

bits INT64, submission_date DATE\n

OUTPUTS

ARRAY<DATE>\n

Source | Edit

"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"

Convert an INT64 field into a 28-character string representing the individual bits.

Implementation based on https://stackoverflow.com/a/51600210/1260237

See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference

SELECT\n  [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- >>> ['0000000000000000000000000001',\n--      '0000000000000000000000000010',\n--      '0000000000000000000000000011']\n
"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"

INPUTS

bits INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"

Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.

"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"

INPUTS

bit_pos INT64\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"

Extract bits from a byte array. Roughly matches substr with three arguments: b: bytes - The byte string we need to extract from start: int - The position of the first bit we want to extract. Can be negative to start from the end of the byte array. One-indexed, like substring. length: int - The number of bits we want to extract

The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.

Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'

"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"

INPUTS

b BYTES, `begin` INT64, length INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"

Zero bits on the right of byte

"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"

INPUTS

b BYTES, length INT64\n

OUTPUTS

BYTES\n

Source | Edit

"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"

These functions are specific for use with the events_daily and event_types tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.

The events_daily table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the compressed format, that clients' event history for that day. The characters are ordered by the timestamp which they appeared that day.

The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.

"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"

Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.

"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"

INPUTS

match_strings ARRAY<STRING>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"

Generate the SQL statement that can be used to create an easily queryable view on events data.

"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n

OUTPUTS

sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"

Create a view that queries the events_daily table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.

"},{"location":"mozfun/event_analysis/#usage","title":"Usage","text":"
create_events_view(\n    view_name STRING,\n    project STRING,\n    dataset STRING,\n    funnels ARRAY<STRUCT<\n        funnel_name STRING,\n        funnel ARRAY<STRUCT<\n            step_name STRING,\n            events ARRAY<STRUCT<\n                category STRING,\n                event_name STRING>>>>>>,\n    counts ARRAY<STRUCT<\n        count_name STRING,\n        events ARRAY<STRUCT<\n            category STRING,\n            event_name STRING>>>>\n  )\n
"},{"location":"mozfun/event_analysis/#recommended-pattern","title":"Recommended Pattern","text":"

Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.

This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.

"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"

The view will be created at

`moz-fx-data-shared-prod`.analysis.{event_name}.\n

The view will have a schema roughly matching the following:

root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n |  |-- {funnel_step_1_name} boolean\n |  |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n |  |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n

"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"

Each funnel will be a STRUCT with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.

STRUCT(\n    completed_step_1 BOOLEAN,\n    completed_step_2 BOOLEAN,\n    ...\n) AS funnel_name\n

With one row per-user per-day, you can use COUNTIF(funnel_name.completed_step_N) to query these fields. See below for an example.

"},{"location":"mozfun/event_analysis/#event-counts","title":"Event Counts","text":"

Each event count is simply an INT64 representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.

event_count_name INT64\n
"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"

The following creates a few fields: - collection_flow is a funnel for those that started creating a collection within Fenix, and then finished, either by adding those tabs to an existing collection or saving it as a new collection. - collection_flow_saved represents users who started the collection flow then saved it as a new collection. - number_of_collections_created is the number of collections created - number_of_collections_deleted is the number of collections deleted

CALL mozfun.event_analysis.create_events_view(\n  'fenix_collection_funnels',\n  'moz-fx-data-shared-prod',\n  'org_mozilla_firefox',\n\n  -- Funnels\n  [\n    STRUCT(\n      \"collection_flow\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"completed_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name),\n        STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n    ] AS funnel),\n\n    STRUCT(\n      \"collection_flow_saved\" AS funnel_name,\n      [STRUCT(\n        \"started_collection_creation\" AS step_name,\n        [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n      STRUCT(\n        \"saved_collection\" AS step_name,\n        [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n    ] AS funnel)\n  ],\n\n  -- Event Counts\n  [\n    STRUCT(\n      \"number_of_collections_created\" AS count_name,\n      [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n    ),\n    STRUCT(\n      \"number_of_collections_deleted\" AS count_name,\n      [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n    )\n  ]\n);\n

From there, you can query a few things. For example, the fraction of users who completed each step of the collection flow over time:

SELECT\n    submission_date,\n    COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n    COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n

Or you can see the number of collections created and deleted:

SELECT\n    submission_date,\n    SUM(number_of_collections_created) AS number_of_collections_created,\n    SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n    `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n    submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n    submission_date\n

"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"

INPUTS

view_name STRING, project STRING, dataset STRING, funnels ARRAY<STRUCT<funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>>>, counts ARRAY<STRUCT<count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"

Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.

"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"

INPUTS

step_regexes ARRAY<STRING>, intermediate_steps BOOLEAN\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"

Generate the SQL statement that can be used to create an easily queryable view on events data.

"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, funnel ARRAY<STRUCT<list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

OUTPUTS

sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"

Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.

"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"

Given an event index string, create a match string that is an exact match in the events_daily table.

"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"

INPUTS

index STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"

Given an event index and property index from an event_types table, returns a regular expression to match corresponding events within an events_daily table's events string that aren't missing the specified property.

"},{"location":"mozfun/event_analysis/#parameters_7","title":"Parameters","text":"

INPUTS

event_index STRING, property_index INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"

Given an event index, property index, and property value from an event_types table, returns a regular expression to match corresponding events within an events_daily table's events string.

"},{"location":"mozfun/event_analysis/#parameters_8","title":"Parameters","text":"

INPUTS

event_index STRING, property_index INTEGER, property_value STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"

Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.

"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"
extract_event_counts(\n    events STRING\n)\n

events - A comma-separated events string, where each event is represented as a string of unicode chars.

"},{"location":"mozfun/event_analysis/#example","title":"Example","text":"

See this dashboard for example usage.

"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"

INPUTS

events STRING\n

OUTPUTS

ARRAY<STRUCT<index STRING, count INT64>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"

Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.

"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"
extract_event_counts_with_properties(\n    events STRING\n)\n

events - A comma-separated events string, where each event is represented as a string of unicode chars.

"},{"location":"mozfun/event_analysis/#example_1","title":"Example","text":"

See this query for example usage.

"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"

This function extracts both counts for events with each property, and for all events without their properties.

This allows us to include both total counts for an event (with any property value), and events that don't have properties.

"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"

INPUTS

events STRING\n

OUTPUTS

ARRAY<STRUCT<event_index STRING, property_index INT64, property_value_index STRING, count INT64>>\n

Source | Edit

"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"

For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.

"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n

OUTPUTS

count_sql STRING\n

Source | Edit

"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"

For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.

"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"

INPUTS

project STRING, dataset STRING, funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n

OUTPUTS

funnel_sql STRING\n

Source | Edit

"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"

Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"

UDF for handling empty Google Analytics data.

"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"

INPUTS

s STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"

Parses the custom build id used for Fenix builds in GLAM to a datetime.

"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"

INPUTS

build_hour STRING\n

OUTPUTS

DATETIME\n

Source | Edit

"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"

Returns a custom build id generated from the build seconds of a FOG build.

"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"

INPUTS

build_hour STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"

Returns a custom build id generated from the build hour of a Fenix build.

"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"

INPUTS

app_build_id STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"

INPUTS

buckets ARRAY<STRING>, val FLOAT64\n

OUTPUTS

FLOAT64\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"

Cast histogram buckets into a string array.

"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"

INPUTS

buckets ARRAY<INT64>\n

OUTPUTS

ARRAY<STRING>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"

Cast a histogram into a JSON blob.

"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"

INPUTS

histogram ARRAY<STRUCT<key STRING, value FLOAT64>>\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"

Cast a String-based JSON histogram to an Array of Structs

"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"

INPUTS

json_str STRING\n

OUTPUTS

ARRAY<STRUCT<KEY STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"

Interpolate missing histogram buckets with empty buckets.

"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"

INPUTS

input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"

Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.

See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg

To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.

This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).

Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).

Calculate the total number of buckets K, as well as the total number of profiles N reporting

Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }

"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"

INPUTS

input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>, total_users INT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"

Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.

"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"

INPUTS

aggs ARRAY<STRUCT<key STRING, value INT64>>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value INT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"

Create an empty histogram from an array of buckets.

"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"

INPUTS

buckets ARRAY<STRING>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"

Generate exponential buckets for a histogram.

"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"

INPUTS

min FLOAT64, max FLOAT64, nBuckets FLOAT64\n

OUTPUTS

ARRAY<FLOAT64>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"

Generate functional buckets for a histogram. This is specific to Glean.

See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs

A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:

i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$

In other words, there are n buckets for each power of base magnitude.

"},{"location":"mozfun/glam/#parameters_12","title":"Parameters","text":"

INPUTS

log_base INT64, buckets_per_magnitude INT64, range_max INT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"

Generate linear buckets for a histogram.

"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"

INPUTS

min FLOAT64, max FLOAT64, nBuckets FLOAT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"

Generate scalar buckets for a histogram using a fixed number of buckets.

"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"

INPUTS

min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n

OUTPUTS

ARRAY<FLOAT64>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"

Compute the normalized sum of an array of histograms.

"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"

INPUTS

arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"

Compute the normalized and the non-normalized sum of an array of histograms.

"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"

INPUTS

arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64, non_norm_value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"

INPUTS

required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"

INPUTS

required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value FLOAT64>>\n

Source | Edit

"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"

Get the value of the approximate CDF at the given percentile.

"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"

INPUTS

pct FLOAT64, histogram ARRAY<STRUCT<key STRING, value FLOAT64>>, type STRING\n

OUTPUTS

FLOAT64\n

Source | Edit

"},{"location":"mozfun/glean/","title":"glean","text":"

Functions for working with Glean data.

"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"

Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects

This UDF transforms the ping_info.experiments field from Glean pings into the format for experiments used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.

If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.

"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"

INPUTS

ping_info__experiments ARRAY<STRUCT<key STRING, value STRUCT<branch STRING, extra STRUCT<type STRING, enrollment_id STRING>>>>\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"

Parses a Glean datetime metric string value as a BigQuery timestamp.

See https://mozilla.github.io/glean/book/reference/metrics/datetime.html

"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"

INPUTS

datetime_string STRING\n

OUTPUTS

TIMESTAMP\n

Source | Edit

"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"

Returns the number of nanoseconds represented by a Glean timespan struct.

See https://mozilla.github.io/glean/book/user/metrics/timespan.html

"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"

Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.

See https://mozilla.github.io/glean/book/user/metrics/timespan.html

"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"

INPUTS

timespan STRUCT<time_unit STRING, value INT64>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"

Extract Segments from a campaign name. Includes region, country_code, and language.

"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"

INPUTS

campaign_name STRING\n

OUTPUTS

STRUCT<campaign_region STRING, campaign_country_code STRING, campaign_language STRING>\n

Source | Edit

"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"

Functions for use with Google Search Console data.

"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"

Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".

"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"

INPUTS

site_domain_name STRING, query STRING, search_type STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"

Extract the country code from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"

Extract the domain name from a URL.

"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"

Extract the language code from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"

Extract the locale from a URL if it's present.

"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"

Extract the path from a URL.

"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"

Extract a particular path segment from a URL.

"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"

INPUTS

url STRING, segment_number INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/hist/","title":"hist","text":"

Functions for working with string encodings of histograms from desktop telemetry.

"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"

Given histogram h, return the count of all measurements across all buckets.

Given histogram h, return the count of all measurements across all buckets.

Extracts the values from the histogram and sums them, returning the total_count.

"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"

INPUTS

histogram STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"

Return a parsed struct from a string-encoded histogram.

We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.

The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.

For more on desktop telemetry histogram structure, see:

The compact encodings were originally proposed in:

SELECT\n  mozfun.hist.extract(\n    '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n  ).sum\n-- 1\n
SELECT\n  mozfun.hist.extract('5').sum\n-- 5\n
"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"

Extract a histogram sum from a JSON str representation

"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"

Sum of a keyed histogram, across all keys it contains.

"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"

Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is ARRAY<STRUCT<key STRING, value STRING>>

The return type is INT64.

The key field will be ignored, and the `value is expected to be the compact histogram representation.

"},{"location":"mozfun/hist/#parameters_3","title":"Parameters","text":"

INPUTS

keyed_histogram ARRAY<STRUCT<key STRING, value STRING>>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"

Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.

https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307

"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"

INPUTS

histogram ANY TYPE\n

OUTPUTS

STRUCT<sum INT64, VALUES ARRAY<STRUCT<value INT64>>>\n

Source | Edit

"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"

Merge an array of histograms into a single histogram.

"},{"location":"mozfun/hist/#parameters_5","title":"Parameters","text":"

INPUTS

histogram_list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"

Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.

"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"

INPUTS

histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n

OUTPUTS

STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n

Source | Edit

"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"

Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.

"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"

INPUTS

histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n

OUTPUTS

ARRAY<STRUCT<percentile FLOAT64, value INT64>>\n

Source | Edit

"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"

Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.

"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"

Return the number of recorded observations greater than threshold for the histogram. CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.

https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239

"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"

INPUTS

histogram STRING, threshold INT64\n

Source | Edit

"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"

Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.

"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"

INPUTS

start DATETIME, `end` DATETIME\n

OUTPUTS

STRUCT<`interval` STRING, interval_count INT64>\n

Source | Edit

"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"

Used to parse data field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions

"},{"location":"mozfun/iap/#parameters_1","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"

Used to parse data field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171

"},{"location":"mozfun/iap/#parameters_2","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"

Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody

"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"

INPUTS

provider_receipt_json STRING\n

OUTPUTS

STRUCT<environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, pending_renewal_info ARRAY<STRUCT<auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING>>, receipt STRUCT<adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64>, status INT64>DETERMINISTIC\n

Source | Edit

"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"

Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.

"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"

INPUTS

apple_receipt ANY TYPE\n

OUTPUTS

STRUCT<environment STRING, active_period STRUCT<start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64>, trial_period STRUCT<start_time TIMESTAMP, end_time TIMESTAMP>>\n

Source | Edit

"},{"location":"mozfun/json/","title":"json","text":"

Functions for parsing Mozilla-specific JSON data types.

"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"

Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.

This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.

"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"

INPUTS

input STRING\n

Source | Edit

"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"

Converts a standard \"map\" like datastructure array<struct<key, value>> into a JSON value.

Convert the standard Array<Struct<key, value>> style maps to JSON values.

"},{"location":"mozfun/json/#parameters_1","title":"Parameters","text":"

INPUTS

input JSON\n

OUTPUTS

json\n

Source | Edit

"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"

Converts a nested JSON object with repeated key/value pairs into a nested JSON object.

Convert a JSON object like { \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] } to a JSON object like { \"metric\": { \"key\": 2 } }.

This only works on JSON types.

"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"

OUTPUTS

json\n

Source | Edit

"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"

Returns an array of key/value structs from a string representing a JSON map.

BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.

Non-string non-null values are encoded as json.

"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"

INPUTS

input STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"

Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.

"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"

"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"

"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"

LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"

These states include whether a client was paid or organic.

"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"

Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.

"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"

INPUTS

adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"

LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date

"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"

INPUTS

days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"

LTV states for iOS.

"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"

INPUTS

days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/","title":"map","text":"

Functions for working with arrays of key/value structs.

"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"

Sums all values in a keyed scalar.

"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"

Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is ARRAY<STRUCT<key STRING, value INT64>>

The return type is INT64.

The key field will be ignored.

"},{"location":"mozfun/map/#parameters","title":"Parameters","text":"

INPUTS

keyed_scalar ARRAY<STRUCT<key STRING, value INT64>>\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"

Create a map from two arrays (like zipping)

"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"

INPUTS

keys ANY TYPE, `values` ANY TYPE\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"

Fetch the value associated with a given key from an array of key/value structs.

Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.

"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

Source | Edit

"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"

Fetch the value associated with a given key from an array of key/value structs.

Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.

"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"

INPUTS

map ANY TYPE, k ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"

Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.

"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"

Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.

map.set_key

Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: ARRAY<STRUCT<key ANY TYPE, value ANY TYPE>>.

The type of the key and value you are setting must match the types in the map itself.

"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"

INPUTS

map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"

Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY>, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"

INPUTS

entries ANY TYPE\n

Source | Edit

"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"

Please provide a description for the routine

"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"

This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.

We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

See also: marketing.parse_campaign_name, which does the same, but for campaign names.

"},{"location":"mozfun/marketing/#parameters","title":"Parameters","text":"

INPUTS

ad_group_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"

Parse a campaign name. Extracts things like region, country_code, and language.

"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"

This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.

We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"

INPUTS

campaign_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"

Parse segments from a creative name.

"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"

This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.

We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.

For a history of this naming scheme, see the original proposal.

See also: marketing.parse_campaign_name, which does the same, but for campaign names.

"},{"location":"mozfun/marketing/#parameters_2","title":"Parameters","text":"

INPUTS

creative_name STRING\n

OUTPUTS

ARRAY<STRUCT<key STRING, value STRING>>\n

Source | Edit

"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"

Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).

"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"

Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here

"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"

INPUTS

app_name STRING, os STRING\n

OUTPUTS

STRUCT<normalized_app_name STRING, normalized_app_name_os STRING>\n

Source | Edit

"},{"location":"mozfun/norm/","title":"norm","text":"

Functions for normalizing data.

"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"

Adds metadata related to the browser version in a struct.

This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.

"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"

INPUTS

version_string STRING\n

OUTPUTS

STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>\n

Source | Edit

"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"

Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.

"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"

INPUTS

start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n

Source | Edit

"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"

Extracts numeric version data from a version string like <major>.<minor>.<patch>.

Note: Non-zero minor and patch versions will be floating point Numeric.

Usage:

SELECT\n    mozfun.norm.extract_version(version_string, 'major') as major_version,\n    mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n    mozfun.norm.extract_version(version_string, 'patch') as patch_version\n

Example using \"96.05.01\":

SELECT\n    mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n    mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n    mozfun.norm.extract_version('96.05.01', 'patch') as patch_version  -- 1\n
"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"

INPUTS

version_string STRING, extraction_level STRING\n

OUTPUTS

NUMERIC\n

Source | Edit

"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"

Returns canonical, human-understandable identification info for Fenix sources.

The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.

This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.

The first argument (app_id) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.

The second argument (app_build_id) should be the value in client_info.app_build.

The function returns a STRUCT that contains the logical app_name and channel as well as the Play Store app_id in the canonical form which would appear in Play Store URLs.

Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using telemetry.fenix_clients_last_seen which already handles the union. Otherwise, see the example below as a template for how construct a custom union.

Mapping of channels to datasets:

-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n  (\n    SELECT AS STRUCT\n      m.submission_timestamp,\n      m.metrics.string.geckoview_version,\n      mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n  )\n);\n\nWITH base AS (\n  SELECT\n    extract_fields('org_mozilla_firefox_beta', m).*\n  FROM\n    `mozdata.org_mozilla_firefox_beta.metrics` AS m\n  UNION ALL\n  SELECT\n    extract_fields('org_mozilla_fenix', m).*\n  FROM\n    `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n  DATE(submission_timestamp) AS submission_date,\n  geckoview_version,\n  COUNT(*)\nFROM\n  base\nWHERE\n  app_name = 'Fenix'  -- excludes 'Firefox Preview'\n  AND channel = 'beta'\n  AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n  submission_date,\n  geckoview_version\n
"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"

INPUTS

app_id STRING, app_build_id STRING\n

OUTPUTS

STRUCT<app_name STRING, channel STRING, app_id STRING>\n

Source | Edit

"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"

Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.

Fenix originally used an 8-digit app_build format

In short it is yDDDHHmm:

The last date seen with an 8-digit build ID is 2020-08-10.

Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.

This function tolerates both formats.

After using this you may wish to DATETIME_TRUNC(result, DAY) for grouping by build date.

"},{"location":"mozfun/norm/#parameters_4","title":"Parameters","text":"

INPUTS

app_build STRING\n

OUTPUTS

INT64\n

Source | Edit

"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"

Map Fenix package name to the channel name

"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"

INPUTS

package_name STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"

This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.

The function will return the first value pair in the input array, that is not null and has the earliest timestamp.

Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.

Usage:

SELECT\n   mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>\n
"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"

INPUTS

value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>\n

OUTPUTS

STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>\n

Source | Edit

"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"

Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"

This function is primarily designed to parse the field os_version in table mozdata.default_browser_agent.default_browser. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version values collected in table mozdata.default_browser_agent.default_browser.

"},{"location":"mozfun/norm/#status-as-of-november-2022","title":"Status as of November 2022","text":"

As of November 2022, the expected valid values of os_version are either x.y.z or w.x.y.z where w, x, y, and z are integers.

As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z in x.y.z are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z in w.x.y.z are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.

As of November 2022, the parser currently handles just over 99.89% of data in the field os_version in table mozdata.default_browser_agent.default_browser.

"},{"location":"mozfun/norm/#build-number-conventions","title":"Build number conventions","text":"

Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900 for version 22621. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.

"},{"location":"mozfun/norm/#example-usage","title":"Example usage","text":"
SELECT\n  `os_version`,\n  mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n
"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"

INPUTS

os_version STRING\n

OUTPUTS

STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>\n

Source | Edit

"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"

Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"

INPUTS

client_info ANY TYPE, metrics ANY TYPE\n

OUTPUTS

string\n

Source | Edit

"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"

Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.

"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"

INPUTS

ping_info ANY TYPE\n

Source | Edit

"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"

Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.

"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"

INPUTS

metadata ANY TYPE\n

OUTPUTS

`date`, CAST(NULL\n

Source | Edit

"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"

Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".

This is a reimplementation of logic used in the data pipeline to populate normalized_os.

"},{"location":"mozfun/norm/#parameters_11","title":"Parameters","text":"

INPUTS

os STRING\n

Source | Edit

"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"

Returns a normalized app_name and canonical_app_name for a product based on legacy_app_name and normalized_os values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.

As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name and canonical_app_name returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.

For legacy telemetry pings like main ping for desktop and core ping for mobile products, the legacy_app_name given as input to this function should come from the submission URI (stored as metadata.uri.app_name in BigQuery ping tables). For Glean pings, we have invented product values that can be passed in to this function as the legacy_app_name parameter.

The returned app_name values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product is a deprecated field that was similar in intent.

The returned canonical_app_name is more verbose and is suited for displaying in visualizations. canonical_name is a synonym that we provide for historical compatibility with previous versions of this function.

The returned struct also contains boolean contributes_to_2021_kpi as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.

The normalized_os value that's passed in should be the top-level normalized_os value present in any ping table or you may want to wrap a raw value in mozfun.norm.os like mozfun.norm.product_info(app_name, mozfun.norm.os(os)).

This function also tolerates passing in a product value as legacy_app_name so that this function is still useful for derived tables which have thrown away the raw app_name value from legacy pings.

The mappings are as follows:

legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"

INPUTS

legacy_app_name STRING, normalized_os STRING\n

OUTPUTS

STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n

Source | Edit

"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"

Convert urlbar result types into product-friendly names

This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.

"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"

INPUTS

res STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"

Truncates a version string like <major>.<minor>.<patch> to either the major or minor version. The return value is NUMERIC, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).

For example, \"5.1.0\" would be translated to 5.1 if the parameter is \"minor\" or 5 if the parameter is major.

If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as 10 when run through this function, no matter what the arguments).

This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.

"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"

INPUTS

os_version STRING, truncation_level STRING\n

OUTPUTS

NUMERIC\n

Source | Edit

"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"

Accepts vpn attribution fields as input and returns a struct of normalized fields.

"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>\n

Source | Edit

"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"

Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.

Requires os, os_version and windows_build_number.

E.G. from windows_build_number >= 22000 return Windows 11

"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"

INPUTS

os STRING, os_version STRING, windows_build_number INT64\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/serp_events/","title":"serp_events","text":"

Functions for working with Glean SERP events.

"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"

Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.

"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"

INPUTS

num_loaded INT, num_blocked INT\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"

Determine whether a SERP display component referenced in the serp events contains monetizable ads

"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"

INPUTS

component STRING\n

OUTPUTS

BOOL\n

Source | Edit

"},{"location":"mozfun/stats/","title":"stats","text":"

Statistics functions.

"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"

Returns the most frequently occuring element in an array.

In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: stats.mode_last_retain_nulls, which retains nulls.

"},{"location":"mozfun/stats/#parameters","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

Source | Edit

"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"

Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.

"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"

INPUTS

list ANY TYPE\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"

Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.

Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.

Use it like:

DECLARE res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n

You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.

"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"

INPUTS

query_a STRING, query_b STRING\n

OUTPUTS

res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>\n

Source | Edit

"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"

Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.

This UDF extracts UTM parameters from a URL string.

UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.

"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRUCT<utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING>\n

Source | Edit

"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"

Extract the Path from a URL

This UDF extracts path from a URL string.

The path is everything after the host and before parameters. This function returns \"/\" if there is no path.

"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"

INPUTS

url STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/","title":"vpn","text":"

Functions for processing VPN data.

"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"

Assign an acquisition channel based on utm parameters

"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"

Assign a channel group based on utm parameters

"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"

Normalize utm parameters to use the same NULL placeholders as Google Analytics

"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"

INPUTS

utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n

OUTPUTS

STRUCT<utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING>\n

Source | Edit

"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"

Combine the pricing and interval for a subscription plan into a single field

"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"

INPUTS

provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n

OUTPUTS

STRING\n

Source | Edit

"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"

Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.

Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.

More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org proposals/ folder).

"},{"location":"reference/airflow_tags/#valid-tags","title":"Valid tags","text":""},{"location":"reference/airflow_tags/#impacttier-tag","title":"impact/tier tag","text":"

We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:

"},{"location":"reference/airflow_tags/#triage-tag","title":"triage/ tag","text":"

This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.

"},{"location":"reference/configuration/","title":"Configuration","text":"

The behaviour of bqetl can be configured via the bqetl_project.yaml file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.

The general structure of bqetl_project.yaml is as follows:

dry_run:\n  function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n  test_project: bigquery-etl-integration-test\n  skip:\n  - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n  - sql/**/apple_ads_external*/**/query.sql\n  # - ...\n\nviews:\n  skip_validation:\n  - sql/moz-fx-data-test-project/test/simple_view/view.sql\n  - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n  - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n  # - ...\n  skip_publishing:\n  - activity_stream/tile_id_types/view.sql\n  - pocket/pocket_reach_mau/view.sql\n  # - ...\n  non_user_facing_suffixes:\n  - _derived\n  - _external\n  # - ...\n\nschema:\n  skip_update:\n  - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n  # - ...\n  skip_prefixes:\n  - pioneer\n  - rally\n\nroutines:\n  skip_publishing:\n  - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n  skip:\n  - bigquery_etl/glam/templates/*.sql\n  - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n  - stored_procedures/safe_crc32_uuid.sql\n  # - ...\n
"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"

ConfigLoader can be used in the bigquery_etl tooling codebase to access configuration parameters. bqetl_project.yaml is automatically loaded in ConfigLoader and parameters can be accessed via a get() method:

from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n

The ConfigLoader.get() method allows multiple string parameters to reference a configuration value that is stored in a nested structure. A fallback value can be optionally provided in case the configuration parameter is not set.

"},{"location":"reference/configuration/#adding-configuration-parameters","title":"Adding configuration parameters","text":"

New configuration parameters can simply be added to bqetl_project.yaml. ConfigLoader.get() allows for these new parameters simply to be referenced without needing to be changed or updated.

"},{"location":"reference/data_checks/","title":"bqetl Data Checks","text":"

Instructions on how to add data checks can be found in the Adding data checks section below.

"},{"location":"reference/data_checks/#background","title":"Background","text":"

To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.

"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"

bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.

This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the checks.sql files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.

It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.

"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"

Each check needs to be categorised with a marker, currently following markers are available:

Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.

"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"

Inside the query directory, which usually contains query.sql or query.py, metadata.yaml and schema.yaml, create a new file called checks.sql (unless already exists).

Please make sure each check you add contains a marker (see: the Two categories of checks section above).

Once checks have been added, we need to regenerate the DAG responsible for scheduling the query.

"},{"location":"reference/data_checks/#update-checkssql","title":"Update checks.sql","text":"

If checks.sql already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.

When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.

"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"

All checks can be removed by deleting the checks.sql file and regenerating the DAG responsible for scheduling the query.

Alternatively, specific checks can be removed by deleting them from the checks.sql file.

"},{"location":"reference/data_checks/#example-checkssql","title":"Example checks.sql","text":"

Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in tests/checks which may take different parameters used to generate the SQL check expression.

Example of what a checks.sql may look like:

-- raw SQL checks\n#fail\nASSERT (\n  SELECT\n    COUNTIF(ISNULL(country)) / COUNT(*)\n    FROM telemetry.table_v1\n    WHERE submission_date = @submission_date\n  ) > 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"

Usage:

Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n
"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"

Usage:

Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"

Usage:

Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n

Example:

#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n

Please keep in mind the below checks can be combined and specified in the same checks.sql file. For example:

#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"

Compares the row count of the current partition to the average of number_of_days past partitions and checks if the row count is within the average +- threshold_percentage %

Usage:

Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n

Example:

#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n

"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"

Checks that the column has values of specific character length.

Usage:

Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n

Example:

#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n

"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"

Checks that the column values adhere to a pattern based on a regex expression.

Usage:

Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n

Example:

#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n

"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"

To list all available commands in the bqetl data checks CLI:

$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n  Commands for managing and running bqetl data checks.\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n  IN ACTIVE DEVELOPMENT\n\n  The current progress can be found under:\n\n          https://mozilla-hub.atlassian.net/browse/DENG-919\n\n  \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  render  Renders data check query using parameters provided (OPTIONAL).\n  run     Runs data checks defined for the dataset (checks.sql).\n

To see see how to use a specific command use:

$ ./bqetl check [command] --help\n

render

"},{"location":"reference/data_checks/#usage","title":"Usage","text":"
$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --help                          Show this message and exit.\n
"},{"location":"reference/data_checks/#example","title":"Example","text":"
./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n

run

"},{"location":"reference/data_checks/#usage_1","title":"Usage","text":"
$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n  --project-id, --project_id TEXT\n                                  GCP project ID\n  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.\n  --dry_run, --dry-run            To dry run the query to make sure it is\n                                  valid\n  --marker TEXT                   Marker to filter checks.\n  --help                          Show this message and exit.\n
"},{"location":"reference/data_checks/#examples","title":"Examples","text":"
# to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n
"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":""},{"location":"reference/incremental/#properties","title":"Properties","text":""},{"location":"reference/public_data/","title":"Public Data","text":"

For background, see Accessing Public Data on docs.telemetry.mozilla.org.

"},{"location":"reference/recommended_practices/","title":"Recommended practices","text":""},{"location":"reference/recommended_practices/#queries","title":"Queries","text":""},{"location":"reference/recommended_practices/#querying-metrics","title":"Querying Metrics","text":""},{"location":"reference/recommended_practices/#query-metadata","title":"Query Metadata","text":"
friendly_name: SSL Ratios\ndescription: >\n  Percentages of page loads Firefox users have performed that were\n  conducted over SSL broken down by country.\nowners:\n  - example@mozilla.com\nlabels:\n  application: firefox\n  incremental: true # incremental queries add data to existing tables\n  schedule: daily # scheduled in Airflow to run daily\n  public_json: true\n  public_bigquery: true\n  review_bugs:\n    - 1414839 # Bugzilla bug ID of data review\n  incremental_export: false # non-incremental JSON export writes all data to a single location\n
"},{"location":"reference/recommended_practices/#views","title":"Views","text":""},{"location":"reference/recommended_practices/#udfs","title":"UDFs","text":""},{"location":"reference/recommended_practices/#large-backfills","title":"Large Backfills","text":""},{"location":"reference/scheduling/","title":"Scheduling Queries in Airflow","text":""},{"location":"reference/stage-deploys-continuous-integration/","title":"Stage Deploys","text":""},{"location":"reference/stage-deploys-continuous-integration/#stage-deploys-in-continuous-integration","title":"Stage Deploys in Continuous Integration","text":"

Before changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks.

Currently, the bigquery-etl-integration-test project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like mozdata, so stage artifacts cannot reference any other artifacts that live in production.

Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the generated-sql branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the generated-sql branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment. * This CI step runs after the generate-sql CI step to ensure that checks will also be executed on generated queries and to ensure schema.yaml files have been automatically created for queries. 2. The bqetl CLI has a command to run stage deploys, which is called in the CI: ./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS * --dataset-suffix will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions. * Determining dependencies is only relevant for UDFs and views. For queries, available schema.yaml files will simply be deployed. * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well. * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage. * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder query.sql file will be created * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to. * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to sql/bigquery-etl-integration-test into their corresponding temporary datasets. * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a bigquery-etl-integration-test folder * The deploy is executed in the order of: UDFs, tables, views * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage * By default the table expiration is set to 1 hour * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)

After CI checks have passed and the pull-request has been approved, changes can be merged to main. Once a new version of bigquery-etl has been published the changes can be deployed to production through the bqetl_artifact_deployment Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html

"},{"location":"reference/stage-deploys-continuous-integration/#local-deploys-to-stage","title":"Local Deploys to Stage","text":"

Local changes can be deployed to stage using the ./bqetl stage deploy command:

./bqetl stage deploy \\\n  --dataset-suffix=test \\\n  --copy-sql-to-tmp-dir \\\n  sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n  sql/mozfun/map/sum/udf.sql\n

Files (for example ones with changes) that should be deployed to stage need to be specified. The stage deploy accepts the following parameters: * --dataset-suffix is an optional suffix that will be added to the datasets deployed to stage * --copy-sql-to-tmp-dir copies SQL stored in sql/ to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running git checkout -- sql/ * (optional) --remove-updated-artifacts removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.

Deployed stage artifacts can be deleted from bigquery-etl-integration-test by running:

./bqetl stage clean --delete-expired --dataset-suffix=test\n
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 9e000f5347f..7607e6e6328 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ