diff --git a/bqetl/index.html b/bqetl/index.html index ed705bc5d3b..4eff55a2779 100644 --- a/bqetl/index.html +++ b/bqetl/index.html @@ -2621,10 +2621,10 @@
initialize
Examples
diff --git a/search/search_index.json b/search/search_index.json index 3618588eee2..c9c3cc26456 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"The bqetl
command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.
Running some commands, for example to create or query tables, will require Mozilla GCP access.
"},{"location":"bqetl/#installation","title":"Installation","text":"Follow the Quick Start to set up bigquery-etl and the bqetl CLI.
"},{"location":"bqetl/#configuration","title":"Configuration","text":"bqetl
can be configured via the bqetl_project.yaml
file. See Configuration to find available configuration options.
To list all available commands in the bqetl CLI:
$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n CLI tools for working with bigquery-etl.\n\nOptions:\n --version Show the version and exit.\n --help Show this message and exit.\n\nCommands:\n alchemer Commands for importing alchemer data.\n dag Commands for managing DAGs.\n dependency Build and use query dependency graphs.\n dryrun Dry run SQL.\n format Format SQL.\n glam Tools for GLAM ETL.\n mozfun Commands for managing mozfun routines.\n query Commands for managing queries.\n routine Commands for managing routines.\n stripe Commands for Stripe ETL.\n view Commands for managing views.\n backfill Commands for managing backfills.\n
See help for any command:
$ ./bqetl [command] --help\n
"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"CLI autocomplete for bqetl
can be enabled for bash and zsh shells using the script/bqetl_complete
script:
source script/bqetl_complete\n
Then pressing tab after bqetl
commands should print possible commands, e.g. for zsh:
% bqetl query<TAB><TAB>\nbackfill -- Run a backfill for a query.\ncreate -- Create a new query with name...\ninfo -- Get information about all or specific...\ninitialize -- Run a full backfill on the destination...\nrender -- Render a query Jinja template.\nrun -- Run a query.\n...\n
source script/bqetl_complete
can also be added to ~/.bashrc
or ~/.zshrc
to persist settings across shell instances.
For more details on shell completion, see the click documentation.
"},{"location":"bqetl/#query","title":"query
","text":"Commands for managing queries.
"},{"location":"bqetl/#create","title":"create
","text":"Create a new query with name ., for example: telemetry_derived.active_profiles. Use the --project_id
option to change the project the query is added to; default is moz-fx-data-shared-prod
. Views are automatically generated in the publicly facing dataset.
Usage
$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n
Examples
./bqetl query create telemetry_derived.deviations_v1 \\\n --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n
"},{"location":"bqetl/#schedule","title":"schedule
","text":"Schedule an existing query
Usage
$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n
Examples
./bqetl query schedule telemetry_derived.deviations_v1 \\\n --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n --dag=bqetl_deviations \\\n --task-name=deviations\n
"},{"location":"bqetl/#info","title":"info
","text":"Get information about all or specific queries.
Usage
$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n --cost --last_updated\n
"},{"location":"bqetl/#backfill","title":"backfill
","text":"Run a backfill for a query. Additional parameters will get passed to bq.
Usage
$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n
Examples
# Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31 \\\n --exclude=2021-03-03 \\\n --dry_run\n
"},{"location":"bqetl/#run","title":"run
","text":"Run a query. Additional parameters will get passed to bq. If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored. If the name
is not found within the sql/
folder bqetl assumes it hasn't been generated yet and will start the generating process for all sql_generators/
files. This generation process will take some time and run dryrun calls against BigQuery but this is expected. Additional parameters (all parameters that are not specified in the Options) must come after the query-name. Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.
Usage
$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n
Examples
# Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1 --project_id=moz-fx-data-shared-prod --dataset_id=telemetry_derived --destination_table=ssl_ratios_v1\n
"},{"location":"bqetl/#run-multipart","title":"run-multipart
","text":"Run a multipart query.
Usage
$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n
Examples
# Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n
"},{"location":"bqetl/#validate","title":"validate
","text":"Validate a query. Checks formatting, scheduling information and dry runs the query.
Usage
$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n
Examples
./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n --use_cloud_function=false \\\n --project_id=moz-fx-data-marketing-prod \\\n ga_derived.blogs_goals_v1\n
"},{"location":"bqetl/#initialize","title":"initialize
","text":"Run a full backfill on the destination table for the query. Using this command will: - Create the table if it doesn't exist and run a full backfill. - Run a full backfill if the table exists and is empty. - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query. It supports query.sql
files that use the is_init() pattern. To run in parallel per sample_id, include a @sample_id parameter in the query.
Usage
$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts. This ensures that artifacts, like materialized views only get initialized if they don't already exist.\n--force: Run the initialization even if the destination table contains data.\n
Examples
Examples:\n - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n
"},{"location":"bqetl/#render","title":"render
","text":"Render a query Jinja template.
Usage
$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n
Examples
./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n --output-dir=/tmp\n
"},{"location":"bqetl/#schema","title":"schema
","text":"Commands for managing query schemas.
"},{"location":"bqetl/#update","title":"update
","text":"Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.
Usage
$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n
Examples
./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n
"},{"location":"bqetl/#deploy","title":"deploy
","text":"Deploy the query schema.
Usage
$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query. Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n
Examples
./bqetl query schema deploy telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#validate_1","title":"validate
","text":"Validate the query schema
Usage
$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n
Examples
./bqetl query schema validate telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#dag","title":"dag
","text":"Commands for managing DAGs.
"},{"location":"bqetl/#info_1","title":"info
","text":"Get information about available DAGs.
Usage
$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n
Examples
# Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n
"},{"location":"bqetl/#create_1","title":"create
","text":"Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_
prefix. Created DAGs are added to the dags.yaml
file.
Usage
$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n
Examples
./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n
"},{"location":"bqetl/#generate","title":"generate
","text":"Generate Airflow DAGs from DAG definitions.
Usage
$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n
Examples
# Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n
"},{"location":"bqetl/#remove","title":"remove
","text":"Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.
Usage
$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n
Examples
# Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n
"},{"location":"bqetl/#dependency","title":"dependency
","text":"Build and use query dependency graphs.
"},{"location":"bqetl/#show","title":"show
","text":"Show table references in sql files.
Usage
$ ./bqetl dependency show [OPTIONS] [paths]\n
"},{"location":"bqetl/#record","title":"record
","text":"Record table references in metadata. Fails if metadata already contains references section.
Usage
$ ./bqetl dependency record [OPTIONS] [paths]\n
"},{"location":"bqetl/#dryrun","title":"dryrun
","text":"Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false
and ensure that the command line has access to a GCP service account.
Usage
$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n
Examples
Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n
"},{"location":"bqetl/#format","title":"format
","text":"Format SQL files.
Usage
$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n
Examples
# Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n
"},{"location":"bqetl/#routine","title":"routine
","text":"Commands for managing routines for internal use.
"},{"location":"bqetl/#create_2","title":"create
","text":"Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.
Usage
$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n
Examples
# Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n
"},{"location":"bqetl/#info_2","title":"info
","text":"Get routine information.
Usage
$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n
Examples
# Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n
"},{"location":"bqetl/#validate_2","title":"validate
","text":"Validate formatting of routines and run tests.
Usage
$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n
Examples
# Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#publish","title":"publish
","text":"Publish routines to BigQuery. Requires service account access.
Usage
$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
Examples
# Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#rename","title":"rename
","text":"Rename routine or routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n
"},{"location":"bqetl/#mozfun","title":"mozfun
","text":"Commands for managing public mozfun routines.
"},{"location":"bqetl/#create_3","title":"create
","text":"Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun
project.
Usage
$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n
Examples
# Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n
"},{"location":"bqetl/#info_3","title":"info
","text":"Get mozfun routine information.
Usage
$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n
Examples
# Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n
"},{"location":"bqetl/#validate_3","title":"validate
","text":"Validate formatting of mozfun routines and run tests.
Usage
$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n
Examples
# Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n
"},{"location":"bqetl/#publish_1","title":"publish
","text":"Publish mozfun routines. This command is used by Airflow only.
Usage
$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
"},{"location":"bqetl/#rename_1","title":"rename
","text":"Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n
"},{"location":"bqetl/#backfill_1","title":"backfill
","text":"Commands for managing backfills.
"},{"location":"bqetl/#create_4","title":"create
","text":"Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.
Usage
$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n
Examples
./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31 \\\n --exclude=2021-03-03 \\\n
"},{"location":"bqetl/#validate_4","title":"validate
","text":"Validate backfill.yaml file format and content.
Usage
$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n ./bqetl backfill validate\n
"},{"location":"bqetl/#info_4","title":"info
","text":"Get backfill(s) information from all or specific table(s).
Usage
$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n
Examples
# Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n
"},{"location":"bqetl/#scheduled","title":"scheduled
","text":"Get information on backfill(s) that require processing.
Usage
$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n
Examples
# Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n
"},{"location":"bqetl/#initiate","title":"initiate
","text":"Process entry in backfill.yaml with Initiate status that has not yet been processed.
Usage
$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"bqetl/#complete","title":"complete
","text":"Complete entry in backfill.yaml with Complete status that has not yet been processed..
Usage
$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"This is a quick guide of how to perform common workflows in bigquery-etl using the bqetl
CLI.
For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the bqetl
CLI needs to be installed by running ./bqetl bootstrap
.
The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.
./bqetl query create <dataset>.<table>_<version>
<dataset>.<table>_<version>
query.sql
file that has been created in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
to write the query./bqetl query schema update <dataset>.<table>_<version>
to generate the schema.yaml
fileschema.yaml
metadata.yaml
file in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
./bqetl query validate <dataset>.<table>_<version>
to dry run and format the query./bqetl dag info
list or create a new DAG ./bqetl dag create <bqetl_new_dag>
./bqetl query schedule <dataset>.<table>_<version> --dag <bqetl_dag>
to schedule the querybqetl_artifact_deployment
Airflow DAG./bqetl query backfill --project-id <project id> <dataset>.<table>_<version>
query.sql
file of the query to be updated and make changes./bqetl query validate <dataset>.<table>_<version>
to dry run and format the query./bqetl dag generate <bqetl_dag>
to update the DAG file./bqetl query schema update <dataset>.<table>_<version>
to make local schema.yaml
updatesbqetl_artifact_deployment
Airflow DAGWe enforce consistent SQL formatting as part of CI. After adding or changing a query, use ./bqetl format
to apply formatting rules.
Directories and files passed as arguments to ./bqetl format
will be formatted in place, with directories recursively searched for files with a .sql
extension, e.g.:
$ echo 'SELECT 1,2,3' > test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n 1,\n 2,\n 3\n
If no arguments are specified the script will read from stdin and write to stdout, e.g.:
$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n 1,\n 2,\n 3\n
To turn off sql formatting for a block of SQL, wrap it in format:off
and format:on
comments, like this:
SELECT\n -- format:off\n submission_date, sample_id, client_id\n -- format:on\n
"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.
query.sql
file inside the <dataset>.<table>
location and add the new definitions for the field../bqetl format <path to the query>
to format the query. Alternatively, run ./bqetl format $(git ls-tree -d HEAD --name-only)
validate the format of all queries that have been modified../bqetl query validate <dataset>.<table>
to dry run the query.jobs.create
permissions in moz-fx-data-shared-prod
), run:gcloud auth login --update-adc # to authenticate to GCP
gcloud config set project mozdata # to set the project
./bqetl query validate --use-cloud-function=false --project-id=mozdata <full path to the query file>
./bqetl query schema update <dataset>.<table> --update_downstream
to make local schema.yaml updates and update schemas of downstream dependencies.--update_downstream
is optional as it takes longer. It is recommended when you know that there are downstream dependencies whose schema.yaml
need to be updated, in which case, the update will happen automatically.--force
should only be used in very specific cases, particularly the clients_last_seen
tables. It skips some checks that would otherwise catch some error scenarios.bqetl_artifact_deployment
Airflow DAGThe following is an example to update a new field in telemetry_derived.clients_daily_v6
clients_daily_v6
query.sql
file and add new field definitions../bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql
./bqetl query validate telemetry_derived.clients_daily_v6
.gcloud auth login --update-adc
./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false
.schema.yaml
files of downstream dependencies, like clients_last_seen_v1
are updated.--use-cloud-function=false
is necessary when updating tables related to clients_daily
but optional for other tables. The dry run cloud function times out when fetching the deployed table schema for some of clients_daily
s downstream dependencies. Using GCP credentials instead works, however this means users need to have permissions to run queries in moz-fx-data-shared-prod
.bqetl_artifact_deployment
Airflow DAGDeleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.
"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"./bqetl mozfun create <dataset>.<name> --udf
.udf.sql
file in sql/mozfun/<dataset>/<name>/
and add UDF the definition and tests../bqetl mozfun validate <dataset>.<name>
for formatting and running tests.mozfun
DAG and clear latest run.Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a mozfun
UDF.
./bqetl routine create <dataset>.<name> --udf
udf.sql
in sql/moz-fx-data-shared-prod/<dataset>/<name>/
file and add UDF definition and tests./bqetl routine validate <dataset>.<name>
for formatting and running testsbqetl_artifact_deployment
Airflow DAGThe same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute ./bqetl mozfun create <dataset>.<name> --stored_procedure
or ./bqetl routine create <dataset>.<name> --stored_procedure
for internal stored procedures.
udf.sql
file and make updates./bqetl mozfun validate <dataset>.<name>
or ./bqetl routine validate <dataset>.<name>
for formatting and running tests./bqetl mozfun rename <dataset>.<name> <new_dataset>.<new_name>
To provision a new BigQuery dataset for holding tables, you'll need to create a dataset_metadata.yaml
which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.
The bqetl query create
command will automatically generate a skeleton dataset_metadata.yaml
file if the query name contains a dataset that is not yet defined.
See example with commentary for telemetry_derived
:
friendly_name: Telemetry Derived\ndescription: |-\n Derived data based on pings from legacy Firefox telemetry, plus many other\n general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n# \"derived\" for `_derived` datasets that contain concrete tables\n# \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n members:\n - workgroup:mozilla-confidential\n
"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"See also the reference for Public Data.
metadata.yaml
file of the query to be publishedpublic_bigquery: true
and optionally public_json: true
review_bugs
mozilla-public-data
init.sql
file exists for the query, change the destination project for the created table to mozilla-public-data
moz-fx-data-shared-prod
referencing the public datasetWhen adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt
or constraints.txt
, but not both.
# Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 >> requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n
"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"When opening a pull-request to merge a fork, the manual-trigger-required-for-fork
CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <username>:<branch>
of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork
CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.
The repository documentation is built using MkDocs. To generate and check the docs locally:
./bqetl docs generate --output_dir generated_docs
generated_docs
directorymkdocs serve
to start a local mkdocs
server.Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.
Here is a sample PR with the implementation of change control for contextual services data.
mozilla > telemetry
.metadata.yaml
for the query where you want to apply change control:owners
, add the selected GitHub identity, along with the list of owners' emails.labels
, add change_controlled: true
. This enables identifying change controlled data in the BigQuery console and in the Data Catalog.CODEOWNERS
:CODEOWNERS
file located in the root of the repo./sql_generators/active_users/templates/ @mozilla/kpi_table_reviewers
.script/bqetl query validate <query_path>
./sql-generators
, first run ./script/bqetl generate <path>
and then run script/bqetl query validate <query_path>
.This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.
This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.
To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.
If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.
"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.
In this example, we will create a table of aggregated metrics related to mozregression
, that will be used in dashboards to help prioritize feature development inside Mozilla.
Set up bigquery-etl on your system per the instructions in the README.md.
"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it org_mozilla_mozregression_derived.mozregression_aggregates
.
The org_mozilla_mozregression_derived
part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the _derived
postfix to hold derived tables like this one.
Run:
./bqetl query create <dataset>.<table_name>\n
In our example: ./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n
This command does three things:
metadata.yaml
and query.sql
representing the query to build the dataset in sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1
sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates
.bqetl_internal_tooling
.bqetl_default
.--no-schedule
is used, queries are not schedule. This option is available for queries that run once or should be scheduled at a later time. The query can be manually scheduled at a later time.We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the mozdata
project.
The next step is to modify the generated metadata.yaml
and query.sql
sections with specific information.
Let's look at what the metadata.yaml
file for our example looks like. Make sure to adapt this file for your own dataset.
friendly_name: mozregression aggregates\ndescription:\n Aggregated metrics of mozregression usage\nlabels:\n incremental: true\nowners:\n - wlachance@mozilla.com\nbigquery:\n time_partitioning:\n type: day\n field: date\n require_partition_filter: true\n expiration_days: null\n clustering:\n fields:\n - app_used\n - os\n
Most of the fields are self-explanatory. incremental
means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.
For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the date
field and a clustering on app_used
and os
.
Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: public_json
, public_bigquery
and review_bugs
which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.
friendly_name: mozregression aggregates\ndescription:\n Aggregated metrics of mozregression usage\nlabels:\n incremental: true\n public_json: true\n public_bigquery: true\n review_bugs:\n - 1691105\nowners:\n - wlachance@mozilla.com\nbigquery:\n time_partitioning:\n type: day\n field: date\n require_partition_filter: true\n expiration_days: null\n clustering:\n fields:\n - app_used\n - os\n
"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a @submission_date
parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.
Test your query and add it to the query.sql
file.
In our example, the query is tested in sql.telemetry.mozilla.org
, and the query.sql
file looks like this:
SELECT\n DATE(submission_timestamp) AS date,\n client_info.app_display_version AS mozregression_version,\n metrics.string.usage_variant AS mozregression_variant,\n metrics.string.usage_app AS app_used,\n normalized_os AS os,\n mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n count(DISTINCT(client_info.client_id)) AS distinct_clients,\n count(*) AS total_uses\nFROM\n `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n DATE(submission_timestamp) = @submission_date\n AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n date,\n mozregression_version,\n mozregression_variant,\n app_used,\n os,\n os_version;\n
We use the truncate_version
UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.
We also have a short clause (client_info.app_display_version NOT LIKE '%.dev%'
) to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.
Now that we've written our query, we can format it and validate it. Once that's done, we run:
./bqetl query validate <dataset>.<table>\n
For our example: ./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
If there are no problems, you should see no output."},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"Use bqetl to set up the schema that will be used to create the table.
Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.
./bqetl query schema update <dataset>.<table>\n
For our example:
./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).
Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in dags.yaml
at the root or run ./bqetl dag info
. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:
./bqetl dag create <dag_name> --schedule-interval \"0 4 * * *\" --owner <email_for_notifications> --description \"Add a clear description of the DAG here\" --start-date <YYYY-MM-DD> --tag impact/<tier>\n
For our example, the starting date is 2020-06-01
and we use a schedule interval of 0 4 \\* \\* \\*
(4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.
The --tag impact/tier3
parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.
When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as --tag triage/no_triage
. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the triage/no_triage
tag can be removed and problems will addressed during the Airflow Triage process.
./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n
"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"Queries are automatically scheduled during creation in the DAG set using the option --dag
, or in the default DAG bqetl_default
when this option is not used.
If the query was created with --no-schedule
, it is possible to manually schedule the query via the bqetl
tool:
./bqetl query schedule <dataset>.<table> --dag <dag_name> --task-name <task_name>\n
Here is the command for our example. Notice the name of the table as created with the suffix _v1.
./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n
Note that we are scheduling the generation of the underlying table which is org_mozilla_mozregression_derived.mozregression_aggregates_v1
rather than the view.
This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.
Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.
The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.
"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:
git checkout -b <new_branch_name>\ngit add dags.yaml dags/<dag_name>.py sql/moz-fx-data-shared-prod/telemetry/<view> sql/moz-fx-data-shared-prod/<dataset>/<table>\ngit commit\ngit push origin <new_branch_name>\n
And next is the workflow for our specific example:
git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n
Then create your pull request, either from the GitHub web interface or the command line, per your preference.
Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.
This example assumes that origin
points to your fork. Adjust the last push invocation appropriately if you have a different remote set.
Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.
"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"Once the PR has been approved, deploy the schema to bqetl using this command:
./bqetl query schema deploy <schema>.<table>\n
For our example:
./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"Note For large sets of data, follow the recommended practices for backfills.
"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"Create a backfill schedule entry to (re)-process data in your table:
bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD>\n
--shredder_mitigation
parameter in the backfill command:bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD> --shredder_mitigation\n
Fill out the missing details:
Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.
Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.
Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)
If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.
You will be notified when swapping is complete.
Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.
"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.
"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"This repository uses pytest
:
# create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n
To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS
environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.
Include a comment like -- Tests
followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.
Each test must use the UDF and throw an error to fail. Assert functions defined in sql/mozfun/assert/
may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like #xfail
, similar to a SQL dialect prefix in the BigQuery Cloud Console.
For example:
CREATE TEMP FUNCTION udf_example(option INT64) AS (\n CASE\n WHEN option > 0 then TRUE\n WHEN option = 0 then FALSE\n ELSE ERROR(\"invalid option\")\n END\n);\n-- Tests\nSELECT\n mozfun.assert.true(udf_example(1)),\n mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n udf_example(-1);\n#xfail\nSELECT\n udf_example(NULL);\n
"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"Queries are tested by running the query.sql
with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named tests/sql/{project}/{dataset}/{table}/{test_name}/
, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day
- table
must match a directory named like {dataset}/{table}
, e.g. telemetry_derived/clients_last_seen_v1
- test_name
should start with test_
, e.g. test_single_day
- If test_name
is test_init
or test_script
, then the query with is_init()
set to true
or script.sql
respectively; otherwise, the test will run query.sql
1. Add .yaml
files for input tables, e.g. clients_daily_v6.yaml
- Include the dataset prefix if it's set in the tested query, e.g. analysis.clients_last_seen_v1.yaml
- Include the project prefix if it's set in the tested query, e.g. moz-fx-other-data.new_dataset.table_1.yaml
- This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\")
1. Add .sql
files for input view queries, e.g. main_summary_v4.sql
- Don't include a CREATE ... AS
clause - Fully qualify table names as `{project}.{dataset}.table`
- Include the dataset prefix if it's set in the tested query, e.g. telemetry.main_summary_v4.sql
- This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\")
1. Add expect.yaml
to validate the result - DATE
and DATETIME
type columns in the result are coerced to strings using .isoformat()
- Columns named generated_time
are removed from the result before comparing to expect
because they should not be static - NULL
values should be omitted in expect.yaml
. If a column is expected to be NULL
don't add it to expect.yaml
. (Be careful with spreading previous rows (-<<: *base
) here) 1. Optionally add .schema.json
files for input table schemas to the table directory, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json
. These tables will be available for every test in the suite. The schema.json
file need to match the table name in the query.sql
file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add query_params.yaml
to define query parameters - query_params
must be a list
Tests of is_init()
statements are supported, similarly to other generated tests. Simply name the test test_init
. The other guidelines still apply.
generated_time
should be a required DATETIME
field to ensure minimal validationbq load
are supportedyaml
and json
format are supported and must contain an array of rows which are converted in memory to ndjson
before loadingyaml
for readability or ndjson
for compatiblity with bq load
expect.yaml
yaml
, json
and ndjson
are supportedyaml
for readability or ndjson
for compatiblity with bq load
time_partitioning_field
will cause the table to use it for time partitioningyaml
, json
and ndjson
are supportedyaml
for readability or json
for compatiblity with bq load
name
, type
or type_
, and value
query_parameters.yaml
may be used instead of query_params.yaml
, but they are mutually exclusiveyaml
, json
and ndjson
are supportedyaml
for readabilitycircleci
service account in the biguqery-etl-integration-test
projectcircleci build
and set required environment variables GOOGLE_PROJECT_ID
and GCLOUD_SERVICE_KEY
:gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n
"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"INPUTS
x INT64, n INT64\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"Given a map of representing activity for STRING key
s, this function returns an array of which key
s were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.
INPUTS
days_seen_bits_map ARRAY<STRUCT<key STRING, value INT64>>, start_offset INT64, n_bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days. On the first of each month, a new month is appended, and the first month is dropped. If the date is not the first of the month, the new entry is added to the last element in the array. For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]: On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24] This happens for every aggregate (searches, ad clicks, etc.)
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"INPUTS
prev STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, curr STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"Adds together two engine searches structs. Each engine searches struct has a MAP[engine -> search_counts_struct]. We want to add add together the prev and curr's values for a certain engine. This allows us to be flexible with the number of engines we're using.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"INPUTS
prev ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, curr ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, submission_date DATE\n
OUTPUTS
value\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"Return sums of each search type grouped by the index. Results are ordered by index.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"INPUTS
searches ARRAY<STRUCT<total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY>, i.e. the output of SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4
, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"
INPUTS
active_addons ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"Returns an aggregated map with all the keys and the first corresponding value from the given maps
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"INPUTS
maps ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"INPUTS
search_counts ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"Aggregates the total counts of the given search counters
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"INPUTS
engine_searches_list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"An array of 11 zeroes, followed by a supplied value
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"INPUTS
val INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"Drop the first element of an array, and append the given element. Result is an array with the same length as the input.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"INPUTS
arr ANY TYPE, append ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"An array of 12 zeroes
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"INPUTS
) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"INPUTS
arr ANY TYPE, start_index INT64, end_index INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"This function counts the 1s in lowest 7 bits of an INT64
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"A bitmask for 365 bits
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"INPUTS
) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"INPUTS
) AS ( 0x0FFFFFFF\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"INPUTS
) AS ( 0x7F\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask. The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"INPUTS
start_ordinal INT64, _length INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"INPUTS
bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"INPUTS
s STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
OUTPUTS
ARRAY<DATE>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"INPUTS
bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: sql bits_from_offsets( ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL) IGNORE NULLS) )
The resulting value can be cast to an INT64 representing the most recent 64 days via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64)
Or representing the most recent 28 days (compatible with bits28 functions) via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) << 36 >> 36
INPUTS
offsets ARRAY<INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"Given a BYTE and an INT64, return whether the user was active that many weeks ago. NULL input returns NULL output.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"INPUTS
b BYTES, n INT64\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"Given a BYTES, return the number of days since the client was first seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0. Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS (( SELECT MAX(n) FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_seen.sql
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"Given a BYTES, return the number of days since the client was last seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0. Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_seen(b BYTES) AS (( SELECT MIN(n) FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_first_seen.sql
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"Convert a boolean to 365 bit byte array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"INPUTS
val BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"INPUTS
histogram STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"INPUTS
prev INT64, curr INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"INPUTS
prev BYTES, curr BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"INPUTS
prev INT64, curr INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"INPUTS
prev BYTES, curr BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See udf.combine_experiment_days
for a more specific example of this approach.
INPUTS
-- prev ARRAY<STRUCT<key STRING, value INT64>>, -- curr ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"INPUTS
-- prev ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>, -- curr ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji. This is just for fun. See: - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"INPUTS
country_code string\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"INPUTS
days_seen_bytes BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"INPUTS
days_since_created_profile INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"Rename struct fields in anonymous event tuples to meaningful names.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"INPUTS
tuple STRUCT<f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY<STRUCT<key STRING, value STRING>>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"INPUTS
raw BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"Return an array containing only distinct values of the given array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"INPUTS
) RETURNS STRING AS ( 'helloworld'\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"INPUTS
code_points ANY TYPE\n
OUTPUTS
ARRAY<INT64>\n
"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"INPUTS
metric ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"Extract the document type from a table name e.g. _TABLE_SUFFIX.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"INPUTS
table_name STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"Extract the document version from a table name e.g. _TABLE_SUFFIX.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"INPUTS
table_name STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed. It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"Return a path derived from an error message in payload_bytes_error
INPUTS
error_message STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.
Fenix originally used an 8-digit app_build format>
In short it is yDDDHHmm
:
The last date seen with an 8-digit build ID is 2020-08-10.
Newer builds use a 10-digit format> where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.
This function tolerates both formats.
After using this you may wish to DATETIME_TRUNC(result, DAY)
for grouping by build date.
INPUTS
app_build STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"INPUTS
os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"INPUTS
device_category STRING, browser STRING, operating_system STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"INPUTS
silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"Determine if a browser in a Google Analytics data is produced by Mozilla
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"INPUTS
browser STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"Convert geoip lookup fields to a struct, replacing '??' with NULL. Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"INPUTS
country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"INPUTS
country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"Given a gzip compressed byte string, extract the uncompressed size from the footer. WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"INPUTS
compressed BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"INPUTS
histogram STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"INPUTS
histogram_list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"INPUTS
histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n
OUTPUTS
STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"INPUTS
histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"INPUTS
histogram ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"INPUTS
histogram STRING, threshold INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"INPUTS
key BYTES, message BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"INPUTS
value INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"INPUTS
value INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"
INPUTS
keyed_histogram ANY TYPE, target_key STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"Returns a JSON string which has the pair
appended to the provided input
JSON string. NULL is also valid for input
. Examples: udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"baz\": \"boo\"}'
INPUTS
input STRING, arr ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT>>. To use this with that representation, it should be as udf.kv_array_to_json_string(struct.key_value)
."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"
INPUTS
kv_arr ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"INPUTS
processes ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"INPUTS
country STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"Only for use by the LTV Revenue join. Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"INPUTS
engine STRING, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"INPUTS
medium STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"INPUTS
aggs ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n
OUTPUTS
ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"INPUTS
dividend BYTES, divisor INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"Stub monetized_search UDF for tests
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"INPUTS
engine STRING, country STRING, distribution_id STRING, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"This struct represents the past year's worth of searches. Each month has its own entry, hence 12.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"INPUTS
) AS ( STRUCT( udf.array_of_12_zeroes(\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"INPUTS
telemetry_sdk_build STRING, metrics ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"INPUTS
client_info ANY TYPE, metrics ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"INPUTS
ping_info ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"INPUTS
payload ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"INPUTS
metadata ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"Sum up the monthy search count arrays by normalized engine
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"INPUTS
engine_searches ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"INPUTS
os STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"INPUTS
engine STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"One represented as a byte array of 365 bits
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"INPUTS
) AS ( CONCAT(REPEAT(b'\\x00', 45\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"This is a UDF to help distinguish desktop client attribution as being organic or paid
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"INPUTS
medium STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"This is a UDF to help distinguish mobile client attribution as being organic or paid
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"INPUTS
adjust_network STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"INPUTS
event_properties ANY TYPE, indices ANY TYPE\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"Sum an array from a parquet-derived field. These are lists of an element
that contain the field value.
INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"INPUTS
uri STRING\n
OUTPUTS
STRUCT<namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"Take a ISO 8601 date or date and time string and return a DATE. Return null if parse fails. Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"INPUTS
date_str STRING\n
OUTPUTS
DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"INPUTS
) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"Returns the 0-based index of the first set bit. No set bits returns NULL.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"INPUTS
i INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"Identical to bits28_days_since_seen. Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0). To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"INPUTS
legacy_app_name STRING, normalized_os STRING\n
OUTPUTS
STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"Pseudonymize Ad IDs, handling opt-outs.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"INPUTS
hashed_ad_id STRING, key BYTES\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"INPUTS
metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"Floor a timestamp object to the given minute interval.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"INPUTS
timestamp_expression TIMESTAMP, minute INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes. This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"INPUTS
) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"INPUTS
client_id STRING\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"Calculate the sums of search counts per source and engine
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"INPUTS
entries ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"Shift input bits one day left and drop any bits beyond 28 days.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"Shift input bits one day left and drop any bits beyond 365 days.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"INPUTS
x BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"Returns the bitfield shifted by one day, 0 for NULL
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"INPUTS
bit_arrays ARRAY<STRUCT<days_created_profile_bits INT64, days_active_bits INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"INPUTS
a ARRAY<INT64>, b ARRAY<INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"Zero represented as a 365-bit byte array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"INPUTS
) AS ( REPEAT(b'\\x00', 46\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"Generates an array if all zeroes, of arbitrary length
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"INPUTS
len INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"INPUTS
percentiles ARRAY<INT64>, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>, metric STRING\n
OUTPUTS
ARRAY<STRUCT<metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"Calculate the CRC-32 hash of an input string. The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"INPUTS
data STRING\n
OUTPUTS
INT64 DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"URL decodes the raw firefox_installer.install.attribution string to a STRUCT. The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted. If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"INPUTS
attribution STRING\n
OUTPUTS
STRUCT<campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from payload
which is in bytes.
INPUTS
payload BYTES\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"Unzips a GZIP string. This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"INPUTS
input BYTES\n
OUTPUTS
STRING DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"INPUTS
n_buckets INT64, values_per_bucket ARRAY<FLOAT64>\n
OUTPUTS
STRUCT<low FLOAT64, high FLOAT64, pm FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"INPUTS
percentile FLOAT64, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>\n
OUTPUTS
STRUCT<low FLOAT64, high FLOAT64, percentile FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(active_days_in_week) AS active_days_in_week, SUM(wau) AS wau FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM bucketed GROUP BY submission_date
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"INPUTS
n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n
OUTPUTS
intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(dau) AS dau_sum FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM bucketed GROUP BY submission_date
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"INPUTS
n_buckets INT64, counts_per_bucket ARRAY<INT64>\n
OUTPUTS
STRUCT<total INT64, low INT64, high INT64, pm INT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY<STRUCT<key STRING, value STRING>>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"Returns a parsed struct from a JSON string representing a histogram. This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
STRUCT<bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"Returns an array of parsed structs from a JSON string representing a keyed histogram. This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"Extract missing columns from additional properties. More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...]. param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node. Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements'] See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"INPUTS
input STRING, indicates_node ARRAY<STRING>, known_nodes ARRAY<STRING>\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"Add fields from additional_attributes to active_addons in main pings. Return an array instead of a \"map\" for backwards compatibility. The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL. The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"INPUTS
active_addons ARRAY<STRUCT<key STRING, value STRUCT<app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING>>>, active_addons_json STRING\n
OUTPUTS
ARRAY<STRUCT<addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"INPUTS
dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n
OUTPUTS
STRUCT<keyed_boolean_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value BOOL>>>>, keyed_uint_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value INT64>>>>, string_addon_scalars ARRAY<STRUCT<key STRING, value STRING>>, keyed_string_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value STRING>>>>, uint_addon_scalars ARRAY<STRUCT<key STRING, value INT64>>, boolean_addon_scalars ARRAY<STRUCT<key STRING, value BOOL>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"Report the ids of the addons which are in the addonDetails but not in the activeAddons. They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"INPUTS
active_addon_ids ARRAY<STRING>, addon_details_json STRING\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"INPUTS
params STRING\n
OUTPUTS
STRUCT<`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"INPUTS
client_id STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"INPUTS
input ARRAY<STRING>\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/about/","title":"mozfun","text":"mozfun
is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.
Returns whether a given Addon ID is an adblocker.
Determine if a given Addon ID is for an adblocker.
As an example, this query will give the number of users who have an adblocker installed.
SELECT\n submission_date,\n COUNT(DISTINCT client_id) AS dau,\nFROM\n mozdata.telemetry.addons\nWHERE\n mozfun.addons.is_adblocker(addon_id)\n AND submission_date >= \"2023-01-01\"\nGROUP BY\n submission_date\n
"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"INPUTS
addon_id STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n
Source | Edit
"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"INPUTS
name STRING, expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"Like map_equals but error message contains only the offending entry
"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"Compare SQL Strings for equality
"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/bits28/","title":"bits28","text":"The bits28
functions provide an API for working with \"bit pattern\" INT64 fields, as used in the clients_last_seen
dataset for desktop Firefox and similar datasets for other applications.
A powerful feature of the clients_last_seen
methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.
The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.
"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"Return a boolean indicating if any bits are set in the specified range of a bit pattern. The start_offset
must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at start_offset
.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"Return the position of the rightmost set bit in an INT64 bit pattern.
To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n mozfun.bits28.days_since_seen(18)\n-- >> 1\n
"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"Convert a string representing individual bits into an INT64.
Implementation based on https://stackoverflow.com/a/51600210/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"Return an INT64 representing a range of bits from a source bit pattern.
The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.
n_bits is the number of bits to consider, counting right from the bit at start_offset.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n `mozdata.telemetry.clients_last_seen`\nWHERE\n submission_date > '2020-01-01'\n
"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.
"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
Source | Edit
"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"Convert a bit pattern into an array of the dates is represents.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
OUTPUTS
ARRAY<DATE>\n
Source | Edit
"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"Convert an INT64 field into a 28-character string representing the individual bits.
Implementation based on https://stackoverflow.com/a/51600210/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- >>> ['0000000000000000000000000001',\n-- '0000000000000000000000000010',\n-- '0000000000000000000000000011']\n
"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.
"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"INPUTS
bit_pos INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"Extract bits from a byte array. Roughly matches substr with three arguments: b: bytes - The byte string we need to extract from start: int - The position of the first bit we want to extract. Can be negative to start from the end of the byte array. One-indexed, like substring. length: int - The number of bits we want to extract
The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.
Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'
"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"INPUTS
b BYTES, `begin` INT64, length INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"Zero bits on the right of byte
"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"INPUTS
b BYTES, length INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"These functions are specific for use with the events_daily
and event_types
tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.
The events_daily
table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the compressed format, that clients' event history for that day. The characters are ordered by the timestamp which they appeared that day.
The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.
"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.
"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"INPUTS
match_strings ARRAY<STRING>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"Generate the SQL statement that can be used to create an easily queryable view on events data.
"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n
OUTPUTS
sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"Create a view that queries the events_daily
table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.
create_events_view(\n view_name STRING,\n project STRING,\n dataset STRING,\n funnels ARRAY<STRUCT<\n funnel_name STRING,\n funnel ARRAY<STRUCT<\n step_name STRING,\n events ARRAY<STRUCT<\n category STRING,\n event_name STRING>>>>>>,\n counts ARRAY<STRUCT<\n count_name STRING,\n events ARRAY<STRUCT<\n category STRING,\n event_name STRING>>>>\n )\n
view_name
: The name of the view that will be created. This view will be in the shared-prod project, in the analysis bucket, and so will be queryable at: `moz-fx-data-shared-prod`.analysis.{view_name}\n
project
: The project where the dataset
is located.dataset
: The dataset that must contain both the events_daily
and event_types
tables.funnels
: An array of funnels that will be created. Each funnel has two parts: 1. funnel_name
: The name of the funnel is what the column representing the funnel will be named in the view. For example, with the value \"onboarding\"
, the view can be selected as follows: SELECT onboarding\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. funnel
: The ordered series of steps that make up a funnel. Each step also has: 1. step_name
: Used to name the column within the funnel and represents whether the user completed that step on that day. For example, within onboarding
a user may have completed_first_card
as a step; this can be queried at SELECT onboarding.completed_first_step\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. events
: The set of events which indicate the user completed that step of the funnel. Most of the time this is a single event. Each event has a category
and event_name
.counts
: An array of counts. Each count has two parts, similar to funnel steps: 1. count_name
: Used to name the column representing the event count. E.g. \"clicked_settings_count\"
would be queried at SELECT clicked_settings_count\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. events
: The set of events you want to count. Each event has a category
and event_name
.Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.
This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.
"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"The view will be created at
`moz-fx-data-shared-prod`.analysis.{event_name}.\n
The view will have a schema roughly matching the following:
root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n | |-- {funnel_step_1_name} boolean\n | |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n | |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n
"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"Each funnel will be a STRUCT
with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.
STRUCT(\n completed_step_1 BOOLEAN,\n completed_step_2 BOOLEAN,\n ...\n) AS funnel_name\n
With one row per-user per-day, you can use COUNTIF(funnel_name.completed_step_N)
to query these fields. See below for an example.
Each event count is simply an INT64
representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.
event_count_name INT64\n
"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"The following creates a few fields: - collection_flow
is a funnel for those that started creating a collection within Fenix, and then finished, either by adding those tabs to an existing collection or saving it as a new collection. - collection_flow_saved
represents users who started the collection flow then saved it as a new collection. - number_of_collections_created
is the number of collections created - number_of_collections_deleted
is the number of collections deleted
CALL mozfun.event_analysis.create_events_view(\n 'fenix_collection_funnels',\n 'moz-fx-data-shared-prod',\n 'org_mozilla_firefox',\n\n -- Funnels\n [\n STRUCT(\n \"collection_flow\" AS funnel_name,\n [STRUCT(\n \"started_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n STRUCT(\n \"completed_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'saved' AS event_name),\n STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n ] AS funnel),\n\n STRUCT(\n \"collection_flow_saved\" AS funnel_name,\n [STRUCT(\n \"started_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n STRUCT(\n \"saved_collection\" AS step_name,\n [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n ] AS funnel)\n ],\n\n -- Event Counts\n [\n STRUCT(\n \"number_of_collections_created\" AS count_name,\n [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n ),\n STRUCT(\n \"number_of_collections_deleted\" AS count_name,\n [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n )\n ]\n);\n
From there, you can query a few things. For example, the fraction of users who completed each step of the collection flow over time:
SELECT\n submission_date,\n COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n submission_date\n
Or you can see the number of collections created and deleted:
SELECT\n submission_date,\n SUM(number_of_collections_created) AS number_of_collections_created,\n SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n submission_date\n
"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"INPUTS
view_name STRING, project STRING, dataset STRING, funnels ARRAY<STRUCT<funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>>>, counts ARRAY<STRUCT<count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.
"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"INPUTS
step_regexes ARRAY<STRING>, intermediate_steps BOOLEAN\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"Generate the SQL statement that can be used to create an easily queryable view on events data.
"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, funnel ARRAY<STRUCT<list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
OUTPUTS
sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.
"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"Given an event index string, create a match string that is an exact match in the events_daily table.
"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"INPUTS
index STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"Given an event index and property index from an event_types
table, returns a regular expression to match corresponding events within an events_daily
table's events
string that aren't missing the specified property.
INPUTS
event_index STRING, property_index INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"Given an event index, property index, and property value from an event_types
table, returns a regular expression to match corresponding events within an events_daily
table's events
string.
INPUTS
event_index STRING, property_index INTEGER, property_value STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.
"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"extract_event_counts(\n events STRING\n)\n
events
- A comma-separated events string, where each event is represented as a string of unicode chars.
See this dashboard for example usage.
"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"INPUTS
events STRING\n
OUTPUTS
ARRAY<STRUCT<index STRING, count INT64>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.
"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"extract_event_counts_with_properties(\n events STRING\n)\n
events
- A comma-separated events string, where each event is represented as a string of unicode chars.
See this query for example usage.
"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"This function extracts both counts for events with each property, and for all events without their properties.
This allows us to include both total counts for an event (with any property value), and events that don't have properties.
"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"INPUTS
events STRING\n
OUTPUTS
ARRAY<STRUCT<event_index STRING, property_index INT64, property_value_index STRING, count INT64>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.
"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n
OUTPUTS
count_sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.
"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
OUTPUTS
funnel_sql STRING\n
Source | Edit
"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"
UDF for handling empty Google Analytics data.
"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"Parses the custom build id used for Fenix builds in GLAM to a datetime.
"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"INPUTS
build_hour STRING\n
OUTPUTS
DATETIME\n
Source | Edit
"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"Returns a custom build id generated from the build seconds of a FOG build.
"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"INPUTS
build_hour STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"Returns a custom build id generated from the build hour of a Fenix build.
"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"INPUTS
app_build_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"INPUTS
buckets ARRAY<STRING>, val FLOAT64\n
OUTPUTS
FLOAT64\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"Cast histogram buckets into a string array.
"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"INPUTS
buckets ARRAY<INT64>\n
OUTPUTS
ARRAY<STRING>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"Cast a histogram into a JSON blob.
"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"INPUTS
histogram ARRAY<STRUCT<key STRING, value FLOAT64>>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"Cast a String-based JSON histogram to an Array of Structs
"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"INPUTS
json_str STRING\n
OUTPUTS
ARRAY<STRUCT<KEY STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"Interpolate missing histogram buckets with empty buckets.
"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"INPUTS
input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.
See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg
To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.
This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).
Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).
Calculate the total number of buckets K, as well as the total number of profiles N reporting
Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }
"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"INPUTS
input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>, total_users INT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.
"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"INPUTS
aggs ARRAY<STRUCT<key STRING, value INT64>>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"Create an empty histogram from an array of buckets.
"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"INPUTS
buckets ARRAY<STRING>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"Generate exponential buckets for a histogram.
"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"INPUTS
min FLOAT64, max FLOAT64, nBuckets FLOAT64\n
OUTPUTS
ARRAY<FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"Generate functional buckets for a histogram. This is specific to Glean.
See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs
A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:
i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$
In other words, there are n buckets for each power of base
magnitude.
INPUTS
log_base INT64, buckets_per_magnitude INT64, range_max INT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"Generate linear buckets for a histogram.
"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"INPUTS
min FLOAT64, max FLOAT64, nBuckets FLOAT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"Generate scalar buckets for a histogram using a fixed number of buckets.
"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"INPUTS
min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"Compute the normalized sum of an array of histograms.
"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"INPUTS
arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"Compute the normalized and the non-normalized sum of an array of histograms.
"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"INPUTS
arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64, non_norm_value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"INPUTS
required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"INPUTS
required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"Get the value of the approximate CDF at the given percentile.
"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"INPUTS
pct FLOAT64, histogram ARRAY<STRUCT<key STRING, value FLOAT64>>, type STRING\n
OUTPUTS
FLOAT64\n
Source | Edit
"},{"location":"mozfun/glean/","title":"glean","text":"Functions for working with Glean data.
"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects
This UDF transforms the ping_info.experiments
field from Glean pings into the format for experiments
used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.
If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.
"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"INPUTS
ping_info__experiments ARRAY<STRUCT<key STRING, value STRUCT<branch STRING, extra STRUCT<type STRING, enrollment_id STRING>>>>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"Parses a Glean datetime metric string value as a BigQuery timestamp.
See https://mozilla.github.io/glean/book/reference/metrics/datetime.html
"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"INPUTS
datetime_string STRING\n
OUTPUTS
TIMESTAMP\n
Source | Edit
"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"Returns the number of nanoseconds represented by a Glean timespan struct.
See https://mozilla.github.io/glean/book/user/metrics/timespan.html
"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.
See https://mozilla.github.io/glean/book/user/metrics/timespan.html
"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"Extract Segments from a campaign name. Includes region, country_code, and language.
"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"INPUTS
campaign_name STRING\n
OUTPUTS
STRUCT<campaign_region STRING, campaign_country_code STRING, campaign_language STRING>\n
Source | Edit
"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"Functions for use with Google Search Console data.
"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".
"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"INPUTS
site_domain_name STRING, query STRING, search_type STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"Extract the country code from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"Extract the domain name from a URL.
"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"Extract the language code from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"Extract the locale from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"Extract the path from a URL.
"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"Extract a particular path segment from a URL.
"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"INPUTS
url STRING, segment_number INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/hist/","title":"hist","text":"Functions for working with string encodings of histograms from desktop telemetry.
"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"Given histogram h, return the count of all measurements across all buckets.
Given histogram h, return the count of all measurements across all buckets.
Extracts the values from the histogram and sums them, returning the total_count.
"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"INPUTS
histogram STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"Return a parsed struct from a string-encoded histogram.
We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.
The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.
For more on desktop telemetry histogram structure, see:
The compact encodings were originally proposed in:
SELECT\n mozfun.hist.extract(\n '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n ).sum\n-- 1\n
SELECT\n mozfun.hist.extract('5').sum\n-- 5\n
"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"Extract a histogram sum from a JSON str representation
"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"Sum of a keyed histogram, across all keys it contains.
"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is ARRAY<STRUCT<key STRING, value STRING>>
The return type is INT64
.
The key
field will be ignored, and the `value is expected to be the compact histogram representation.
INPUTS
keyed_histogram ARRAY<STRUCT<key STRING, value STRING>>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.
https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307
"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"INPUTS
histogram ANY TYPE\n
OUTPUTS
STRUCT<sum INT64, VALUES ARRAY<STRUCT<value INT64>>>\n
Source | Edit
"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"Merge an array of histograms into a single histogram.
INPUTS
histogram_list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.
"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"INPUTS
histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n
OUTPUTS
STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n
Source | Edit
"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.
"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"INPUTS
histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<percentile FLOAT64, value INT64>>\n
Source | Edit
"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.
"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"Return the number of recorded observations greater than threshold for the histogram. CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.
https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239
"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"INPUTS
histogram STRING, threshold INT64\n
Source | Edit
"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.
"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"INPUTS
start DATETIME, `end` DATETIME\n
OUTPUTS
STRUCT<`interval` STRING, interval_count INT64>\n
Source | Edit
"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"Used to parse data
field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions
INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"Used to parse data
field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171
INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody
"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"INPUTS
provider_receipt_json STRING\n
OUTPUTS
STRUCT<environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, pending_renewal_info ARRAY<STRUCT<auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING>>, receipt STRUCT<adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64>, status INT64>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.
"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"INPUTS
apple_receipt ANY TYPE\n
OUTPUTS
STRUCT<environment STRING, active_period STRUCT<start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64>, trial_period STRUCT<start_time TIMESTAMP, end_time TIMESTAMP>>\n
Source | Edit
"},{"location":"mozfun/json/","title":"json","text":"Functions for parsing Mozilla-specific JSON data types.
"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.
This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.
"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"Converts a standard \"map\" like datastructure array<struct<key, value>>
into a JSON value.
Convert the standard Array<Struct<key, value>>
style maps to JSON
values.
INPUTS
input JSON\n
OUTPUTS
json\n
Source | Edit
"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"Converts a nested JSON object with repeated key/value pairs into a nested JSON object.
Convert a JSON object like { \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] }
to a JSON
object like { \"metric\": { \"key\": 2 } }
.
This only works on JSON types.
"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"OUTPUTS
json\n
Source | Edit
"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"Returns an array of key/value structs from a string representing a JSON map.
BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.
Non-string non-null values are encoded as json.
"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.
"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"
"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"
"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"
These states include whether a client was paid or organic.
"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.
"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date
"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"INPUTS
days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"LTV states for iOS.
"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"INPUTS
days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/","title":"map","text":"Functions for working with arrays of key/value structs.
"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"Sums all values in a keyed scalar.
"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is ARRAY<STRUCT<key STRING, value INT64>>
The return type is INT64
.
The key
field will be ignored.
INPUTS
keyed_scalar ARRAY<STRUCT<key STRING, value INT64>>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"Create a map from two arrays (like zipping)
"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"INPUTS
keys ANY TYPE, `values` ANY TYPE\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"Fetch the value associated with a given key from an array of key/value structs.
Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.
"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
Source | Edit
"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"Fetch the value associated with a given key from an array of key/value structs.
Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.
"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.
"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.
map.set_key
Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: ARRAY<STRUCT<key ANY TYPE, value ANY TYPE>>
.
The type of the key and value you are setting must match the types in the map itself.
"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"INPUTS
map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY>, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"
INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"Please provide a description for the routine
"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.
We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
See also: marketing.parse_campaign_name
, which does the same, but for campaign names.
INPUTS
ad_group_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"Parse a campaign name. Extracts things like region, country_code, and language.
"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.
We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"INPUTS
campaign_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"Parse segments from a creative name.
"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.
We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
See also: marketing.parse_campaign_name
, which does the same, but for campaign names.
INPUTS
creative_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).
"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here
"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"INPUTS
app_name STRING, os STRING\n
OUTPUTS
STRUCT<normalized_app_name STRING, normalized_app_name_os STRING>\n
Source | Edit
"},{"location":"mozfun/norm/","title":"norm","text":"Functions for normalizing data.
"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"Adds metadata related to the browser version in a struct.
This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.
"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"INPUTS
version_string STRING\n
OUTPUTS
STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>\n
Source | Edit
"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.
"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"INPUTS
start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n
Source | Edit
"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"Extracts numeric version data from a version string like <major>.<minor>.<patch>
.
Note: Non-zero minor and patch versions will be floating point Numeric
.
Usage:
SELECT\n mozfun.norm.extract_version(version_string, 'major') as major_version,\n mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n mozfun.norm.extract_version(version_string, 'patch') as patch_version\n
Example using \"96.05.01\"
:
SELECT\n mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n mozfun.norm.extract_version('96.05.01', 'patch') as patch_version -- 1\n
"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"INPUTS
version_string STRING, extraction_level STRING\n
OUTPUTS
NUMERIC\n
Source | Edit
"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"Returns canonical, human-understandable identification info for Fenix sources.
The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.
This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.
The first argument (app_id
) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.
The second argument (app_build_id
) should be the value in client_info.app_build.
The function returns a STRUCT
that contains the logical app_name
and channel
as well as the Play Store app_id
in the canonical form which would appear in Play Store URLs.
Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using telemetry.fenix_clients_last_seen
which already handles the union. Otherwise, see the example below as a template for how construct a custom union.
Mapping of channels to datasets:
org_mozilla_firefox
org_mozilla_firefox_beta
(current) and org_mozilla_fenix
org_mozilla_fenix
(current), org_mozilla_fennec_aurora
, and org_mozilla_fenix_nightly
-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n (\n SELECT AS STRUCT\n m.submission_timestamp,\n m.metrics.string.geckoview_version,\n mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n )\n);\n\nWITH base AS (\n SELECT\n extract_fields('org_mozilla_firefox_beta', m).*\n FROM\n `mozdata.org_mozilla_firefox_beta.metrics` AS m\n UNION ALL\n SELECT\n extract_fields('org_mozilla_fenix', m).*\n FROM\n `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n DATE(submission_timestamp) AS submission_date,\n geckoview_version,\n COUNT(*)\nFROM\n base\nWHERE\n app_name = 'Fenix' -- excludes 'Firefox Preview'\n AND channel = 'beta'\n AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n submission_date,\n geckoview_version\n
"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"INPUTS
app_id STRING, app_build_id STRING\n
OUTPUTS
STRUCT<app_name STRING, channel STRING, app_id STRING>\n
Source | Edit
"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.
Fenix originally used an 8-digit app_build format
In short it is yDDDHHmm
:
The last date seen with an 8-digit build ID is 2020-08-10.
Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.
This function tolerates both formats.
After using this you may wish to DATETIME_TRUNC(result, DAY)
for grouping by build date.
INPUTS
app_build STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"Map Fenix package name to the channel name
"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"INPUTS
package_name STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.
The function will return the first value pair in the input array, that is not null and has the earliest timestamp.
Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.
Usage:
SELECT\n mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>\n
"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"INPUTS
value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>\n
OUTPUTS
STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>\n
Source | Edit
"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"
This function is primarily designed to parse the field os_version
in table mozdata.default_browser_agent.default_browser
. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version
values collected in table mozdata.default_browser_agent.default_browser
.
As of November 2022, the expected valid values of os_version
are either x.y.z
or w.x.y.z
where w
, x
, y
, and z
are integers.
As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z
in x.y.z
are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z
in w.x.y.z
are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.
As of November 2022, the parser currently handles just over 99.89% of data in the field os_version
in table mozdata.default_browser_agent.default_browser
.
Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900
for version 22621
. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.
SELECT\n `os_version`,\n mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n
"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"INPUTS
os_version STRING\n
OUTPUTS
STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>\n
Source | Edit
"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"INPUTS
client_info ANY TYPE, metrics ANY TYPE\n
OUTPUTS
string\n
Source | Edit
"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"INPUTS
ping_info ANY TYPE\n
Source | Edit
"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.
"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"INPUTS
metadata ANY TYPE\n
OUTPUTS
`date`, CAST(NULL\n
Source | Edit
"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".
This is a reimplementation of logic used in the data pipeline to populate normalized_os
.
INPUTS
os STRING\n
Source | Edit
"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"Returns a normalized app_name
and canonical_app_name
for a product based on legacy_app_name
and normalized_os
values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.
As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo
API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name
and canonical_app_name
returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.
For legacy telemetry pings like main
ping for desktop and core
ping for mobile products, the legacy_app_name
given as input to this function should come from the submission URI (stored as metadata.uri.app_name
in BigQuery ping tables). For Glean pings, we have invented product
values that can be passed in to this function as the legacy_app_name
parameter.
The returned app_name
values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product
is a deprecated field that was similar in intent.
The returned canonical_app_name
is more verbose and is suited for displaying in visualizations. canonical_name
is a synonym that we provide for historical compatibility with previous versions of this function.
The returned struct also contains boolean contributes_to_2021_kpi
as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.
The normalized_os
value that's passed in should be the top-level normalized_os
value present in any ping table or you may want to wrap a raw value in mozfun.norm.os
like mozfun.norm.product_info(app_name, mozfun.norm.os(os))
.
This function also tolerates passing in a product
value as legacy_app_name
so that this function is still useful for derived tables which have thrown away the raw app_name
value from legacy pings.
The mappings are as follows:
legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"INPUTS
legacy_app_name STRING, normalized_os STRING\n
OUTPUTS
STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n
Source | Edit
"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"Convert urlbar result types into product-friendly names
This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.
"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"INPUTS
res STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"Truncates a version string like <major>.<minor>.<patch>
to either the major or minor version. The return value is NUMERIC
, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).
For example, \"5.1.0\" would be translated to 5.1
if the parameter is \"minor\" or 5
if the parameter is major.
If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as 10
when run through this function, no matter what the arguments).
This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.
"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"INPUTS
os_version STRING, truncation_level STRING\n
OUTPUTS
NUMERIC\n
Source | Edit
"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"Accepts vpn attribution fields as input and returns a struct of normalized fields.
"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>\n
Source | Edit
"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.
Requires os, os_version and windows_build_number.
E.G. from windows_build_number >= 22000 return Windows 11
"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"INPUTS
os STRING, os_version STRING, windows_build_number INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/serp_events/","title":"serp_events","text":"Functions for working with Glean SERP events.
"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.
"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"INPUTS
num_loaded INT, num_blocked INT\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"Determine whether a SERP display component referenced in the serp events contains monetizable ads
"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"INPUTS
component STRING\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/stats/","title":"stats","text":"Statistics functions.
"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"Returns the most frequently occuring element in an array.
In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: stats.mode_last_retain_nulls
, which retains nulls.
INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.
"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"INPUTS
list ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.
Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.
Use it like:
DECLARE res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n
You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.
"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"INPUTS
query_a STRING, query_b STRING\n
OUTPUTS
res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>\n
Source | Edit
"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.
This UDF extracts UTM parameters from a URL string.
UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.
"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRUCT<utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING>\n
Source | Edit
"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"Extract the Path from a URL
This UDF extracts path from a URL string.
The path is everything after the host and before parameters. This function returns \"/\" if there is no path.
"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/","title":"vpn","text":"Functions for processing VPN data.
"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"Assign an acquisition channel based on utm parameters
"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"Assign a channel group based on utm parameters
"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"Normalize utm parameters to use the same NULL placeholders as Google Analytics
"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRUCT<utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING>\n
Source | Edit
"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"Combine the pricing and interval for a subscription plan into a single field
"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"INPUTS
provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.
Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.
More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org proposals/
folder).
We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:
This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.
The behaviour of bqetl
can be configured via the bqetl_project.yaml
file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.
The general structure of bqetl_project.yaml
is as follows:
dry_run:\n function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n test_project: bigquery-etl-integration-test\n skip:\n - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n - sql/**/apple_ads_external*/**/query.sql\n # - ...\n\nviews:\n skip_validation:\n - sql/moz-fx-data-test-project/test/simple_view/view.sql\n - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n # - ...\n skip_publishing:\n - activity_stream/tile_id_types/view.sql\n - pocket/pocket_reach_mau/view.sql\n # - ...\n non_user_facing_suffixes:\n - _derived\n - _external\n # - ...\n\nschema:\n skip_update:\n - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n # - ...\n skip_prefixes:\n - pioneer\n - rally\n\nroutines:\n skip_publishing:\n - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n skip:\n - bigquery_etl/glam/templates/*.sql\n - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n - stored_procedures/safe_crc32_uuid.sql\n # - ...\n
"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"ConfigLoader
can be used in the bigquery_etl tooling codebase to access configuration parameters. bqetl_project.yaml
is automatically loaded in ConfigLoader
and parameters can be accessed via a get()
method:
from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n
The ConfigLoader.get()
method allows multiple string parameters to reference a configuration value that is stored in a nested structure. A fallback
value can be optionally provided in case the configuration parameter is not set.
New configuration parameters can simply be added to bqetl_project.yaml
. ConfigLoader.get()
allows for these new parameters simply to be referenced without needing to be changed or updated.
Instructions on how to add data checks can be found in the Adding data checks section below.
"},{"location":"reference/data_checks/#background","title":"Background","text":"To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.
"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.
This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the checks.sql
files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.
It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.
"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"Each check needs to be categorised with a marker, currently following markers are available:
#fail
indicates that the ETL pipeline should stop if this check fails (circuit-breaker pattern) and a notification is sent out. This marker should be used for checks that indicate a serious data issue.#warn
indicates that the ETL pipeline should continue even if this check fails. These type of checks can be used to indicate potential issues that might require more manual investigation.Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.
"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"Inside the query directory, which usually contains query.sql
or query.py
, metadata.yaml
and schema.yaml
, create a new file called checks.sql
(unless already exists).
Please make sure each check you add contains a marker (see: the Two categories of checks section above).
Once checks have been added, we need to regenerate the DAG
responsible for scheduling the query.
If checks.sql
already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.
When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.
"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"All checks can be removed by deleting the checks.sql
file and regenerating the DAG responsible for scheduling the query.
Alternatively, specific checks can be removed by deleting them from the checks.sql
file.
Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in tests/checks
which may take different parameters used to generate the SQL check expression.
Example of what a checks.sql
may look like:
-- raw SQL checks\n#fail\nASSERT (\n SELECT\n COUNTIF(ISNULL(country)) / COUNT(*)\n FROM telemetry.table_v1\n WHERE submission_date = @submission_date\n ) > 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"Usage:
Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n
"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"Usage:
Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n
Please keep in mind the below checks can be combined and specified in the same checks.sql
file. For example:
#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"Compares the row count of the current partition to the average of number_of_days
past partitions and checks if the row count is within the average +- threshold_percentage
%
Usage:
Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n
Example:
#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n
"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"Checks that the column has values of specific character length.
Usage:
Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n
Example:
#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"Checks that the column values adhere to a pattern based on a regex expression.
Usage:
Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n
Example:
#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n
"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"To list all available commands in the bqetl data checks CLI:
$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n Commands for managing and running bqetl data checks.\n\n \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n IN ACTIVE DEVELOPMENT\n\n The current progress can be found under:\n\n https://mozilla-hub.atlassian.net/browse/DENG-919\n\n \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n render Renders data check query using parameters provided (OPTIONAL).\n run Runs data checks defined for the dataset (checks.sql).\n
To see see how to use a specific command use:
$ ./bqetl check [command] --help\n
render
$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n --project-id, --project_id TEXT\n GCP project ID\n --sql_dir, --sql-dir DIRECTORY Path to directory which contains queries.\n --help Show this message and exit.\n
"},{"location":"reference/data_checks/#example","title":"Example","text":"./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n
run
$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n --project-id, --project_id TEXT\n GCP project ID\n --sql_dir, --sql-dir DIRECTORY Path to directory which contains queries.\n --dry_run, --dry-run To dry run the query to make sure it is\n valid\n --marker TEXT Marker to filter checks.\n --help Show this message and exit.\n
"},{"location":"reference/data_checks/#examples","title":"Examples","text":"# to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n
"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":"WRITE_TRUNCATE
mode or bq query --replace
to replace partitions atomically to prevent duplicate data@submission_date
query parametersubmission_date
matching the query parametersql/moz-fx-data-shared-prod/clients_last_seen_v1.sql
can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding daysFor background, see Accessing Public Data on docs.telemetry.mozilla.org
.
public_bigquery
flag must be set in metadata.yaml
mozilla-public-data
GCP project which is accessible by everyone, also external userspublic_json
flag must be set in metadata.yaml
000000000000.json
, 000000000001.json
, ...)incremental_export
controls how data should be exported as JSON:false
: all data of the source table gets exported to a single locationtrue
: only data that matches the submission_date
parameter is exported as JSON to a separate directory for this datemetadata.json
gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.jsonlast_updated
, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updatedsql/<project>/<dataset>/<table>_<version>/query.sql
e.g.<project>
defines both where the destination table resides and in which project the query job runs sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
sql/<project>/<dataset>/<table>_<version>/query.sql
as abovesql/<project>/query_type.sql.py
e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
--source telemetry_core_parquet_v3
to generate sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql
and using --source main_summary_v4
to generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet\n
moz-fx-data-shared-prod
the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)_
prefix in generated column names not meant for output_bits
suffix for any integer column that represents a bit patternDATETIME
type, due to incompatibility with spark-bigquery-connector*_stable
tables instead of including custom deduplicationdocument_id
by submission_timestamp
where filtering duplicates is necessarymozdata
project which are duplicates of views in another project (commonly moz-fx-data-shared-prod
). Refer to the original view instead.{{ metrics.calculate() }}
: SELECT\n *\nFROM\n {{ metrics.calculate(\n metrics=['days_of_use', 'active_hours'],\n platform='firefox_desktop',\n group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},\n where='submission_date = \"2023-01-01\"'\n ) }}\n\n-- this translates to\nSELECT\n *\nFROM\n (\n WITH clients_daily AS (\n SELECT\n client_id AS client_id,\n submission_date AS submission_date,\n COALESCE(SUM(active_hours_sum), 0) AS active_hours,\n COUNT(submission_date) AS days_of_use,\n FROM\n mozdata.telemetry.clients_daily\n GROUP BY\n client_id,\n submission_date\n )\n SELECT\n clients_daily.client_id,\n clients_daily.submission_date,\n active_hours,\n days_of_use,\n FROM\n clients_daily\n )\n
metrics
: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)platform
: platform to compute metrics for (e.g. firefox_desktop
, firefox_ios
, fenix
, ...)group_by
: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY
always includes the configured client_id
and submission_date
fieldswhere
: SQL filter clausegroup_by_client_id
: Whether the field configured as client_id
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by defaultgroup_by_submission_date
: Whether the field configured as submission_date
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by default{{ metrics.data_source() }}
: SELECT\n *\nFROM\n {{ metrics.data_source(\n data_source='main',\n platform='firefox_desktop',\n where='submission_date = \"2023-01-01\"'\n ) }}\n\n-- this translates to\nSELECT\n *\nFROM\n (\n SELECT *\n FROM `mozdata.telemetry.main`\n WHERE submission_date = \"2023-01-01\"\n )\n
./bqetl query render path/to/query.sql
generated-sql
branch has rendered queries/views/UDFs./bqetl query run
does support running Jinja queriesmetadata.yaml
file should be created in the same directoryfriendly_name: SSL Ratios\ndescription: >\n Percentages of page loads Firefox users have performed that were\n conducted over SSL broken down by country.\nowners:\n - example@mozilla.com\nlabels:\n application: firefox\n incremental: true # incremental queries add data to existing tables\n schedule: daily # scheduled in Airflow to run daily\n public_json: true\n public_bigquery: true\n review_bugs:\n - 1414839 # Bugzilla bug ID of data review\n incremental_export: false # non-incremental JSON export writes all data to a single location\n
sql/<project>/<dataset>/<table>/view.sql
e.g. sql/moz-fx-data-shared-prod/telemetry/core/view.sql
fx-data-dev@mozilla.org
moz-fx-data-shared-prod
project; the scripts/publish_views
tooling can handle parsing the definitions to publish to other projects such as derived-datasets
mozdata
project which are duplicates of views in another project (commonly moz-fx-data-shared-prod
). Refer to the original view instead.BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
.sql
e.g. mode_last.sql
udf/
directory and JS UDFs must be defined in the udf_js
directoryudf_legacy/
directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.CREATE OR REPLACE FUNCTION
syntax<dir_name>.
so, for example, all functions in udf/*.sql
are part of the udf
datasetCREATE OR REPLACE FUNCTION <dir_name>.<file_name>
scripts/publish_persistent_udfs
for publishing these UDFs to BigQuerySQL
over js
for performanceNULL
for new data and EXCEPT
to exclude from views until droppedSELECT\n job_type,\n state,\n submission_date,\n destination_dataset_id,\n destination_table_id,\n total_terabytes_billed,\n total_slot_ms,\n error_location,\n error_reason,\n error_message\nFROM\n moz-fx-data-shared-prod.monitoring.bigquery_usage\nWHERE\n submission_date <= CURRENT_DATE()\n AND destination_dataset_id LIKE \"%backfills_staging_derived%\"\n AND destination_table_id LIKE \"%{{ your table name }}%\"\nORDER BY\n submission_date DESC\n
dags.yaml
dags.yaml
, e.g., by adding the following: bqetl_ssl_ratios: # name of the DAG; must start with bqetl_\n schedule_interval: 0 2 * * * # query schedule\n description: The DAG schedules SSL ratios queries.\n default_args:\n owner: example@mozilla.com\n start_date: \"2020-04-05\" # YYYY-MM-DD\n email: [\"example@mozilla.com\"]\n retries: 2 # number of retries if the query execution fails\n retry_delay: 30m\n
bqetl_
as prefix.schedule_interval
is either defined as a CRON expression or alternatively as one of the following CRON presets: once
, hourly
, daily
, weekly
, monthly
start_date
defines the first date for which the query should be executedstart_date
is set in the past, backfilling can be done via the Airflow web interfaceemail
lists email addresses alerts should be sent to in case of failures when running the querybqetl
CLI by running bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner=\"example@mozilla.com\" --start_date=\"2020-04-05\" --description=\"This DAG generates SSL ratios.\"
metadata.yaml
file that includes a scheduling
section, for example: friendly_name: SSL ratios\n# ... more metadata, see Query Metadata section above\nscheduling:\n dag_name: bqetl_ssl_ratios\n
depends_on_past
keeps query from getting executed if the previous schedule for the query hasn't succeededdate_partition_parameter
- by default set to submission_date
; can be set to null
if query doesn't write to a partitioned tableparameters
specifies a list of query parameters, e.g. [\"n_clients:INT64:500\"]
arguments
- a list of arguments passed when running the query, for example: [\"--append_table\"]
referenced_tables
- manually curated list of tables a Python or BigQuery script depends on; for query.sql
files dependencies will get determined automatically and should only be overwritten manually if really necessarymultipart
indicates whether a query is split over multiple files part1.sql
, part2.sql
, ...depends_on
defines external dependencies in telemetry-airflow that are not detected automatically: depends_on:\n - task_id: external_task\n dag_name: external_dag\n execution_delta: 1h\n
task_id
: name of task query depends ondag_name
: name of the DAG the external task is part ofexecution_delta
: time difference between the schedule_intervals
of the external DAG and the DAG the query is part ofdepends_on_tables_existing
defines tables that the ETL will await the existence of via an Airflow sensor before running: depends_on_tables_existing:\n - task_id: wait_for_foo_bar_baz\n table_id: 'foo.bar.baz_{{ ds_nodash }}'\n poke_interval: 30m\n timeout: 12h\n retries: 1\n retry_delay: 10m\n
task_id
: ID to use for the generated Airflow sensor task.table_id
: Fully qualified ID of the table to wait for, including the project and dataset.poke_interval
: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default poke interval is 5 minutes).timeout
: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default timeout is 8 hours).retries
: The number of retries that should be performed if the sensor times out or otherwise fails. This parameter is optional (the default depends on how the DAG is configured).retry_delay
: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default depends on how the DAG is configured).depends_on_table_partitions_existing
defines table partitions that the ETL will await the existence of via an Airflow sensor before running: depends_on_table_partitions_existing:\n - task_id: wait_for_foo_bar_baz\n table_id: foo.bar.baz\n partition_id: '{{ ds_nodash }}'\n poke_interval: 30m\n timeout: 12h\n retries: 1\n retry_delay: 10m\n
task_id
: ID to use for the generated Airflow sensor task.table_id
: Fully qualified ID of the table to check, including the project and dataset. Note that the service account airflow-access@moz-fx-data-shared-prod.iam.gserviceaccount.com
will need to have the BigQuery Job User role on the project and read access to the dataset.partition_id
: ID of the partition to wait for.poke_interval
: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default poke interval is 5 minutes).timeout
: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default timeout is 8 hours).retries
: The number of retries that should be performed if the sensor times out or otherwise fails. This parameter is optional (the default depends on how the DAG is configured).retry_delay
: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default depends on how the DAG is configured).trigger_rule
: The rule that determines when the airflow task that runs this query should run. The default is all_success
(\"trigger this task when all directly upstream tasks have succeeded\"); other rules can allow a task to run even if not all preceding tasks have succeeded. See the Airflow docs for the list of trigger rule options.destination_table
: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the submission_date
parameter manuallyexternal_downstream_tasks
defines external downstream dependencies for which ExternalTaskMarker
s will be added to the generated DAG. These task markers ensure that when the task is cleared for triggering a rerun, all downstream tasks are automatically cleared as well. external_downstream_tasks:\n - task_id: external_downstream_task\n dag_name: external_dag\n execution_delta: 1h\n
bqetl
CLI: ./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios
./bqetl dag generate
dags/
directory./bqetl dag generate bqetl_ssl_ratios
main
. CI automatically generates DAGs and writes them to the telemetry-airflow-dags repo from where Airflow will pick them updepends_on_fivetran:\n - task_id: fivetran_import_1\n - task_id: another_fivetran_import\n
<task_id>_connector_id
in the Airflow admin interface for each import taskBefore changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks.
Currently, the bigquery-etl-integration-test
project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like mozdata
, so stage artifacts cannot reference any other artifacts that live in production.
Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the generated-sql
branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the generated-sql
branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment. * This CI step runs after the generate-sql
CI step to ensure that checks will also be executed on generated queries and to ensure schema.yaml
files have been automatically created for queries. 2. The bqetl
CLI has a command to run stage deploys, which is called in the CI: ./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS
* --dataset-suffix
will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions. * Determining dependencies is only relevant for UDFs and views. For queries, available schema.yaml
files will simply be deployed. * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well. * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage. * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder query.sql
file will be created * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to. * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to sql/bigquery-etl-integration-test
into their corresponding temporary datasets. * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a bigquery-etl-integration-test
folder * The deploy is executed in the order of: UDFs, tables, views * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage * By default the table expiration is set to 1 hour * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)
After CI checks have passed and the pull-request has been approved, changes can be merged to main
. Once a new version of bigquery-etl has been published the changes can be deployed to production through the bqetl_artifact_deployment
Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html
Local changes can be deployed to stage using the ./bqetl stage deploy
command:
./bqetl stage deploy \\\n --dataset-suffix=test \\\n --copy-sql-to-tmp-dir \\\n sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n sql/mozfun/map/sum/udf.sql\n
Files (for example ones with changes) that should be deployed to stage need to be specified. The stage deploy
accepts the following parameters: * --dataset-suffix
is an optional suffix that will be added to the datasets deployed to stage * --copy-sql-to-tmp-dir
copies SQL stored in sql/
to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running git checkout -- sql/
* (optional) --remove-updated-artifacts
removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.
Deployed stage artifacts can be deleted from bigquery-etl-integration-test
by running:
./bqetl stage clean --delete-expired --dataset-suffix=test\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"bqetl/","title":"bqetl CLI","text":"The bqetl
command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.
Running some commands, for example to create or query tables, will require Mozilla GCP access.
"},{"location":"bqetl/#installation","title":"Installation","text":"Follow the Quick Start to set up bigquery-etl and the bqetl CLI.
"},{"location":"bqetl/#configuration","title":"Configuration","text":"bqetl
can be configured via the bqetl_project.yaml
file. See Configuration to find available configuration options.
To list all available commands in the bqetl CLI:
$ ./bqetl\n\nUsage: bqetl [OPTIONS] COMMAND [ARGS]...\n\n CLI tools for working with bigquery-etl.\n\nOptions:\n --version Show the version and exit.\n --help Show this message and exit.\n\nCommands:\n alchemer Commands for importing alchemer data.\n dag Commands for managing DAGs.\n dependency Build and use query dependency graphs.\n dryrun Dry run SQL.\n format Format SQL.\n glam Tools for GLAM ETL.\n mozfun Commands for managing mozfun routines.\n query Commands for managing queries.\n routine Commands for managing routines.\n stripe Commands for Stripe ETL.\n view Commands for managing views.\n backfill Commands for managing backfills.\n
See help for any command:
$ ./bqetl [command] --help\n
"},{"location":"bqetl/#autocomplete","title":"Autocomplete","text":"CLI autocomplete for bqetl
can be enabled for bash and zsh shells using the script/bqetl_complete
script:
source script/bqetl_complete\n
Then pressing tab after bqetl
commands should print possible commands, e.g. for zsh:
% bqetl query<TAB><TAB>\nbackfill -- Run a backfill for a query.\ncreate -- Create a new query with name...\ninfo -- Get information about all or specific...\ninitialize -- Run a full backfill on the destination...\nrender -- Render a query Jinja template.\nrun -- Run a query.\n...\n
source script/bqetl_complete
can also be added to ~/.bashrc
or ~/.zshrc
to persist settings across shell instances.
For more details on shell completion, see the click documentation.
"},{"location":"bqetl/#query","title":"query
","text":"Commands for managing queries.
"},{"location":"bqetl/#create","title":"create
","text":"Create a new query with name ., for example: telemetry_derived.active_profiles. Use the --project_id
option to change the project the query is added to; default is moz-fx-data-shared-prod
. Views are automatically generated in the publicly facing dataset.
Usage
$ ./bqetl query create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--owner: Owner of the query (email address)\n--dag: Name of the DAG the query should be scheduled under.If there is no DAG name specified, the query isscheduled by default in DAG bqetl_default.To skip the automated scheduling use --no_schedule.To see available DAGs run `bqetl dag info`.To create a new DAG run `bqetl dag create`.\n--no_schedule: Using this option creates the query without scheduling information. Use `bqetl query schedule` to add it manually if required.\n
Examples
./bqetl query create telemetry_derived.deviations_v1 \\\n --owner=example@mozilla.com\n\n\n# The query version gets autocompleted to v1. Queries are created in the\n# _derived dataset and accompanying views in the public dataset.\n./bqetl query create telemetry.deviations --owner=example@mozilla.com\n
"},{"location":"bqetl/#schedule","title":"schedule
","text":"Schedule an existing query
Usage
$ ./bqetl query schedule [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--dag: Name of the DAG the query should be scheduled under. To see available DAGs run `bqetl dag info`. To create a new DAG run `bqetl dag create`.\n--depends_on_past: Only execute query if previous scheduled run succeeded.\n--task_name: Custom name for the Airflow task. By default the task name is a combination of the dataset and table name.\n
Examples
./bqetl query schedule telemetry_derived.deviations_v1 \\\n --dag=bqetl_deviations\n\n\n# Set a specific name for the task\n./bqetl query schedule telemetry_derived.deviations_v1 \\\n --dag=bqetl_deviations \\\n --task-name=deviations\n
"},{"location":"bqetl/#info","title":"info
","text":"Get information about all or specific queries.
Usage
$ ./bqetl query info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Get info for specific queries\n./bqetl query info telemetry_derived.*\n\n\n# Get cost and last update timestamp information\n./bqetl query info telemetry_derived.clients_daily_v6 \\\n --cost --last_updated\n
"},{"location":"bqetl/#backfill","title":"backfill
","text":"Run a backfill for a query. Additional parameters will get passed to bq.
Usage
$ ./bqetl query backfill [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--start_date: First date to be backfilled\n--end_date: Last date to be backfilled\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--dry_run: Dry run the backfill\n--max_rows: How many rows to return in the result\n--parallelism: How many threads to run backfill in parallel\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query.\n--checks: Whether to run checks during backfill\n--custom_query_path: Name of a custom query to run the backfill. If not given, the proces runs as usual.\n--checks_file_name: Name of a custom data checks file to run after each partition backfill. E.g. custom_checks.sql. Optional.\n--scheduling_overrides: Pass overrides as a JSON string for scheduling sections: parameters and/or date_partition_parameter as needed.\n
Examples
# Backfill for specific date range\n# second comment line\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31\n\n\n# Dryrun backfill for specific date range and exclude date\n./bqetl query backfill telemetry_derived.ssl_ratios_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31 \\\n --exclude=2021-03-03 \\\n --dry_run\n
"},{"location":"bqetl/#run","title":"run
","text":"Run a query. Additional parameters will get passed to bq. If a destination_table is set, the query result will be written to BigQuery. Without a destination_table specified, the results are not stored. If the name
is not found within the sql/
folder bqetl assumes it hasn't been generated yet and will start the generating process for all sql_generators/
files. This generation process will take some time and run dryrun calls against BigQuery but this is expected. Additional parameters (all parameters that are not specified in the Options) must come after the query-name. Otherwise the first parameter that is not an option is interpreted as the query-name and since it can't be found the generation process will start.
Usage
$ ./bqetl query run [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--public_project_id: Project with publicly accessible data\n--destination_table: Destination table name results are written to. If not set, the query result will not be written to BigQuery.\n--dataset_id: Destination dataset results are written to. If not set, determines destination dataset based on query.\n
Examples
# Run a query by name\n./bqetl query run telemetry_derived.ssl_ratios_v1\n\n\n# Run a query file\n./bqetl query run /path/to/query.sql\n\n\n# Run a query and save the result to BigQuery\n./bqetl query run telemetry_derived.ssl_ratios_v1 --project_id=moz-fx-data-shared-prod --dataset_id=telemetry_derived --destination_table=ssl_ratios_v1\n
"},{"location":"bqetl/#run-multipart","title":"run-multipart
","text":"Run a multipart query.
Usage
$ ./bqetl query run-multipart [OPTIONS] [query_dir]\n\nOptions:\n\n--using: comma separated list of join columns to use when combining results\n--parallelism: Maximum number of queries to execute concurrently\n--dataset_id: Default dataset, if not specified all tables must be qualified with dataset\n--project_id: GCP project ID\n--temp_dataset: Dataset where intermediate query results will be temporarily stored, formatted as PROJECT_ID.DATASET_ID\n--destination_table: table where combined results will be written\n--time_partitioning_field: time partition field on the destination table\n--clustering_fields: comma separated list of clustering fields on the destination table\n--dry_run: Print bytes that would be processed for each part and don't run queries\n--parameters: query parameter(s) to pass when running parts\n--priority: Priority for BigQuery query jobs; BATCH priority will significantly slow down queries if reserved slots are not enabled for the billing project; defaults to INTERACTIVE\n--schema_update_options: Optional options for updating the schema.\n
Examples
# Run a multipart query\n./bqetl query run_multipart /path/to/query.sql\n
"},{"location":"bqetl/#validate","title":"validate
","text":"Validate a query. Checks formatting, scheduling information and dry runs the query.
Usage
$ ./bqetl query validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --ignore-dryrun-skip.\n--no_dryrun: Skip running dryrun. Default is False.\n
Examples
./bqetl query validate telemetry_derived.clients_daily_v6\n\n\n# Validate query not in shared-prod\n./bqetl query validate \\\n --use_cloud_function=false \\\n --project_id=moz-fx-data-marketing-prod \\\n ga_derived.blogs_goals_v1\n
"},{"location":"bqetl/#initialize","title":"initialize
","text":"Run a full backfill on the destination table for the query. Using this command will: - Create the table if it doesn't exist and run a full backfill. - Run a full backfill if the table exists and is empty. - Raise an exception if the table exists and has data, or if the table exists and the schema doesn't match the query. It supports query.sql
files that use the is_init() pattern. To run in parallel per sample_id, include a @sample_id parameter in the query.
Usage
$ ./bqetl query initialize [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n--dry_run: Dry run the initialization\n--parallelism: Number of threads for parallel processing\n--skip_existing: Skip initialization for existing artifacts, otherwise initialization is run for empty tables.\n--force: Run the initialization even if the destination table contains data.\n
Examples
Examples:\n - For init.sql files: ./bqetl query initialize telemetry_derived.ssl_ratios_v1\n - For query.sql files and parallel run: ./bqetl query initialize sql/moz-fx-data-shared-prod/telemetry_derived/clients_first_seen_v2/query.sql\n
"},{"location":"bqetl/#render","title":"render
","text":"Render a query Jinja template.
Usage
$ ./bqetl query render [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--output_dir: Output directory generated SQL is written to. If not specified, rendered queries are printed to console.\n--parallelism: Number of threads for parallel processing\n
Examples
./bqetl query render telemetry_derived.ssl_ratios_v1 \\\n --output-dir=/tmp\n
"},{"location":"bqetl/#schema","title":"schema
","text":"Commands for managing query schemas.
"},{"location":"bqetl/#update","title":"update
","text":"Update the query schema based on the destination table schema and the query schema. If no schema.yaml file exists for a query, one will be created.
Usage
$ ./bqetl query schema update [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--update_downstream: Update downstream dependencies. GCP authentication required.\n--tmp_dataset: GCP datasets for creating updated tables temporarily.\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--parallelism: Number of threads for parallel processing\n--is_init: Indicates whether the `is_init()` condition should be set to true of false.\n
Examples
./bqetl query schema update telemetry_derived.clients_daily_v6\n\n# Update schema including downstream dependencies (requires GCP)\n./bqetl query schema update telemetry_derived.clients_daily_v6 --update-downstream\n
"},{"location":"bqetl/#deploy","title":"deploy
","text":"Deploy the query schema.
Usage
$ ./bqetl query schema deploy [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--force: Deploy the schema file without validating that it matches the query\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n--skip_existing: Skip updating existing tables. This option ensures that only new tables get deployed.\n--skip_external_data: Skip publishing external data, such as Google Sheets.\n--destination_table: Destination table name results are written to. If not set, determines destination table based on query. Must be fully qualified (project.dataset.table).\n--parallelism: Number of threads for parallel processing\n
Examples
./bqetl query schema deploy telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#validate_1","title":"validate
","text":"Validate the query schema
Usage
$ ./bqetl query schema validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--respect_dryrun_skip: Respect or ignore dry run skip configuration. Default is --respect-dryrun-skip.\n
Examples
./bqetl query schema validate telemetry_derived.clients_daily_v6\n
"},{"location":"bqetl/#dag","title":"dag
","text":"Commands for managing DAGs.
"},{"location":"bqetl/#info_1","title":"info
","text":"Get information about available DAGs.
Usage
$ ./bqetl dag info [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--with_tasks: Include scheduled tasks\n
Examples
# Get information about all available DAGs\n./bqetl dag info\n\n# Get information about a specific DAG\n./bqetl dag info bqetl_ssl_ratios\n\n# Get information about a specific DAG including scheduled tasks\n./bqetl dag info --with_tasks bqetl_ssl_ratios\n
"},{"location":"bqetl/#create_1","title":"create
","text":"Create a new DAG with name bqetl_, for example: bqetl_search When creating new DAGs, the DAG name must have a bqetl_
prefix. Created DAGs are added to the dags.yaml
file.
Usage
$ ./bqetl dag create [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--schedule_interval: Schedule interval of the new DAG. Schedule intervals can be either in CRON format or one of: once, hourly, daily, weekly, monthly, yearly or a timedelta []d[]h[]m\n--owner: Email address of the DAG owner\n--description: Description for DAG\n--tag: Tag to apply to the DAG\n--start_date: First date for which scheduled queries should be executed\n--email: Email addresses that Airflow will send alerts to\n--retries: Number of retries Airflow will attempt in case of failures\n--retry_delay: Time period Airflow will wait after failures before running failed tasks again\n
Examples
./bqetl dag create bqetl_core \\\n--schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"Tables derived from `core` pings sent by mobile applications.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-25\n\n\n# Create DAG and overwrite default settings\n./bqetl dag create bqetl_ssl_ratios --schedule-interval=\"0 2 * * *\" \\\n--owner=example@mozilla.com \\\n--description=\"The DAG schedules SSL ratios queries.\" \\\n--tag=impact/tier_1 \\\n--start-date=2019-07-20 \\\n--email=example2@mozilla.com \\\n--email=example3@mozilla.com \\\n--retries=2 \\\n--retry_delay=30m\n
"},{"location":"bqetl/#generate","title":"generate
","text":"Generate Airflow DAGs from DAG definitions.
Usage
$ ./bqetl dag generate [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n
Examples
# Generate all DAGs\n./bqetl dag generate\n\n# Generate a specific DAG\n./bqetl dag generate bqetl_ssl_ratios\n
"},{"location":"bqetl/#remove","title":"remove
","text":"Remove a DAG. This will also remove the scheduling information from the queries that were scheduled as part of the DAG.
Usage
$ ./bqetl dag remove [OPTIONS] [name]\n\nOptions:\n\n--dags_config: Path to dags.yaml config file\n--sql_dir: Path to directory which contains queries.\n--output_dir: Path directory with generated DAGs\n
Examples
# Remove a specific DAG\n./bqetl dag remove bqetl_vrbrowser\n
"},{"location":"bqetl/#dependency","title":"dependency
","text":"Build and use query dependency graphs.
"},{"location":"bqetl/#show","title":"show
","text":"Show table references in sql files.
Usage
$ ./bqetl dependency show [OPTIONS] [paths]\n
"},{"location":"bqetl/#record","title":"record
","text":"Record table references in metadata. Fails if metadata already contains references section.
Usage
$ ./bqetl dependency record [OPTIONS] [paths]\n
"},{"location":"bqetl/#dryrun","title":"dryrun
","text":"Dry run SQL. Uses the dryrun Cloud Function by default which only has access to shared-prod. To dryrun queries accessing tables in another project use set --use-cloud-function=false
and ensure that the command line has access to a GCP service account.
Usage
$ ./bqetl dryrun [OPTIONS] [paths]\n\nOptions:\n\n--use_cloud_function: Use the Cloud Function for dry running SQL, if set to `True`. The Cloud Function can only access tables in shared-prod. If set to `False`, use active GCP credentials for the dry run.\n--validate_schemas: Require dry run schema to match destination table and file if present.\n--respect_skip: Respect or ignore query skip configuration. Default is --respect-skip.\n--project: GCP project to perform dry run in when --use_cloud_function=False\n
Examples
Examples:\n./bqetl dryrun sql/moz-fx-data-shared-prod/telemetry_derived/\n\n# Dry run SQL with tables that are not in shared prod\n./bqetl dryrun --use-cloud-function=false sql/moz-fx-data-marketing-prod/\n
"},{"location":"bqetl/#format","title":"format
","text":"Format SQL files.
Usage
$ ./bqetl format [OPTIONS] [paths]\n\nOptions:\n\n--check: do not write changes, just return status; return code 0 indicates nothing would change; return code 1 indicates some files would be reformatted\n--parallelism: Number of threads for parallel processing\n
Examples
# Format a specific file\n./bqetl format sql/moz-fx-data-shared-prod/telemetry/core/view.sql\n\n# Format all SQL files in `sql/`\n./bqetl format sql\n\n# Format standard in (will write to standard out)\necho 'SELECT 1,2,3' | ./bqetl format\n
"},{"location":"bqetl/#routine","title":"routine
","text":"Commands for managing routines for internal use.
"},{"location":"bqetl/#create_2","title":"create
","text":"Create a new routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag.
Usage
$ ./bqetl routine create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n
Examples
# Create a UDF\n./bqetl routine create --udf udf.array_slice\n\n\n# Create a stored procedure\n./bqetl routine create --stored_procedure udf.events_daily\n\n\n# Create a UDF in a project other than shared-prod\n./bqetl routine create --udf udf.active_last_week --project=moz-fx-data-marketing-prod\n
"},{"location":"bqetl/#info_2","title":"info
","text":"Get routine information.
Usage
$ ./bqetl routine info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n
Examples
# Get information about all internal routines in a specific dataset\n./bqetl routine info udf.*\n\n\n# Get usage information of specific routine\n./bqetl routine info --usages udf.get_key\n
"},{"location":"bqetl/#validate_2","title":"validate
","text":"Validate formatting of routines and run tests.
Usage
$ ./bqetl routine validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n
Examples
# Validate all routines\n./bqetl routine validate\n\n\n# Validate selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#publish","title":"publish
","text":"Publish routines to BigQuery. Requires service account access.
Usage
$ ./bqetl routine publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
Examples
# Publish all routines\n./bqetl routine publish\n\n\n# Publish selected routines\n./bqetl routine validate udf.*\n
"},{"location":"bqetl/#rename","title":"rename
","text":"Rename routine or routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl routine rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Rename routine\n./bqetl routine rename udf.array_slice udf.list_slice\n\n\n# Rename routine matching a specific pattern\n./bqetl routine rename udf.array_* udf.list_*\n
"},{"location":"bqetl/#mozfun","title":"mozfun
","text":"Commands for managing public mozfun routines.
"},{"location":"bqetl/#create_3","title":"create
","text":"Create a new mozfun routine. Specify whether the routine is a UDF or stored procedure by adding a --udf or --stored_prodecure flag. UDFs are added to the mozfun
project.
Usage
$ ./bqetl mozfun create [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--udf: Create a new UDF\n--stored_procedure: Create a new stored procedure\n
Examples
# Create a UDF\n./bqetl mozfun create --udf bytes.zero_right\n\n\n# Create a stored procedure\n./bqetl mozfun create --stored_procedure event_analysis.events_daily\n
"},{"location":"bqetl/#info_3","title":"info
","text":"Get mozfun routine information.
Usage
$ ./bqetl mozfun info [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--usages: Show routine usages\n
Examples
# Get information about all internal routines in a specific dataset\n./bqetl mozfun info hist.*\n\n\n# Get usage information of specific routine\n./bqetl mozfun info --usages hist.mean\n
"},{"location":"bqetl/#validate_3","title":"validate
","text":"Validate formatting of mozfun routines and run tests.
Usage
$ ./bqetl mozfun validate [OPTIONS] [name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--docs_only: Only validate docs.\n
Examples
# Validate all routines\n./bqetl mozfun validate\n\n\n# Validate selected routines\n./bqetl mozfun validate hist.*\n
"},{"location":"bqetl/#publish_1","title":"publish
","text":"Publish mozfun routines. This command is used by Airflow only.
Usage
$ ./bqetl mozfun publish [OPTIONS] [name]\n\nOptions:\n\n--project_id: GCP project ID\n--dependency_dir: The directory JavaScript dependency files for UDFs are stored.\n--gcs_bucket: The GCS bucket where dependency files are uploaded to.\n--gcs_path: The GCS path in the bucket where dependency files are uploaded to.\n--dry_run: Dry run publishing udfs.\n
"},{"location":"bqetl/#rename_1","title":"rename
","text":"Rename mozfun routine or mozfun routine dataset. Replaces all usages in queries with the new name.
Usage
$ ./bqetl mozfun rename [OPTIONS] [name] [new_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Rename routine\n./bqetl mozfun rename hist.extract hist.ext\n\n\n# Rename routine matching a specific pattern\n./bqetl mozfun rename *.array_* *.list_*\n\n\n# Rename routine dataset\n./bqetl mozfun rename hist.* histogram.*\n
"},{"location":"bqetl/#backfill_1","title":"backfill
","text":"Commands for managing backfills.
"},{"location":"bqetl/#create_4","title":"create
","text":"Create a new backfill entry in the backfill.yaml file. Create a backfill.yaml file if it does not already exist.
Usage
$ ./bqetl backfill create [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--start_date: First date to be backfilled. Date format: yyyy-mm-dd\n--end_date: Last date to be backfilled. Date format: yyyy-mm-dd\n--exclude: Dates excluded from backfill. Date format: yyyy-mm-dd\n--watcher: Watcher of the backfill (email address)\n--custom_query_path: Path of the custom query to run the backfill. Optional.\n--shredder_mitigation: Wether to run a backfill using an auto-generated query that mitigates shredder effect.\n--billing_project: GCP project ID to run the query in. This can be used to run a query using a different slot reservation than the one used by the query's default project.\n
Examples
./bqetl backfill create moz-fx-data-shared-prod.telemetry_derived.deviations_v1 \\\n --start_date=2021-03-01 \\\n --end_date=2021-03-31 \\\n --exclude=2021-03-03 \\\n
"},{"location":"bqetl/#validate_4","title":"validate
","text":"Validate backfill.yaml file format and content.
Usage
$ ./bqetl backfill validate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
./bqetl backfill validate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# validate all backfill.yaml files if table is not specified\nUse the `--project_id` option to change the project to be validated;\ndefault is `moz-fx-data-shared-prod`.\n\n ./bqetl backfill validate\n
"},{"location":"bqetl/#info_4","title":"info
","text":"Get backfill(s) information from all or specific table(s).
Usage
$ ./bqetl backfill info [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Filter backfills with this status.\n
Examples
# Get info for specific table.\n./bqetl backfill info moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill info\n\n\n# Get info from all tables with specific status.\n./bqetl backfill info --status=Initiate\n
"},{"location":"bqetl/#scheduled","title":"scheduled
","text":"Get information on backfill(s) that require processing.
Usage
$ ./bqetl backfill scheduled [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n--status: Whether to get backfills to process or to complete.\n--json_path: None\n
Examples
# Get info for specific table.\n./bqetl backfill scheduled moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\n\n# Get info for all tables.\n./bqetl backfill scheduled\n
"},{"location":"bqetl/#initiate","title":"initiate
","text":"Process entry in backfill.yaml with Initiate status that has not yet been processed.
Usage
$ ./bqetl backfill initiate [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--parallelism: Maximum number of queries to execute concurrently\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Initiate backfill entry for specific table\n./bqetl backfill initiate moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"bqetl/#complete","title":"complete
","text":"Complete entry in backfill.yaml with Complete status that has not yet been processed..
Usage
$ ./bqetl backfill complete [OPTIONS] [qualified_table_name]\n\nOptions:\n\n--sql_dir: Path to directory which contains queries.\n--project_id: GCP project ID\n
Examples
# Complete backfill entry for specific table\n./bqetl backfill complete moz-fx-data-shared-prod.telemetry_derived.clients_daily_v6\n\nUse the `--project_id` option to change the project;\ndefault project_id is `moz-fx-data-shared-prod`.\n
"},{"location":"cookbooks/common_workflows/","title":"Common bigquery-etl workflows","text":"This is a quick guide of how to perform common workflows in bigquery-etl using the bqetl
CLI.
For any workflow, the bigquery-etl repositiory needs to be locally available, for example by cloning the repository, and the bqetl
CLI needs to be installed by running ./bqetl bootstrap
.
The Creating derived datasets tutorial provides a more detailed guide on creating scheduled queries.
./bqetl query create <dataset>.<table>_<version>
<dataset>.<table>_<version>
query.sql
file that has been created in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
to write the query./bqetl query schema update <dataset>.<table>_<version>
to generate the schema.yaml
fileschema.yaml
metadata.yaml
file in sql/moz-fx-data-shared-prod/<dataset>/<table>_<version>/
./bqetl query validate <dataset>.<table>_<version>
to dry run and format the query./bqetl dag info
list or create a new DAG ./bqetl dag create <bqetl_new_dag>
./bqetl query schedule <dataset>.<table>_<version> --dag <bqetl_dag>
to schedule the querybqetl_artifact_deployment
Airflow DAG./bqetl query backfill --project-id <project id> <dataset>.<table>_<version>
query.sql
file of the query to be updated and make changes./bqetl query validate <dataset>.<table>_<version>
to dry run and format the query./bqetl dag generate <bqetl_dag>
to update the DAG file./bqetl query schema update <dataset>.<table>_<version>
to make local schema.yaml
updatesbqetl_artifact_deployment
Airflow DAGWe enforce consistent SQL formatting as part of CI. After adding or changing a query, use ./bqetl format
to apply formatting rules.
Directories and files passed as arguments to ./bqetl format
will be formatted in place, with directories recursively searched for files with a .sql
extension, e.g.:
$ echo 'SELECT 1,2,3' > test.sql\n$ ./bqetl format test.sql\nmodified test.sql\n1 file(s) modified\n$ cat test.sql\nSELECT\n 1,\n 2,\n 3\n
If no arguments are specified the script will read from stdin and write to stdout, e.g.:
$ echo 'SELECT 1,2,3' | ./bqetl format\nSELECT\n 1,\n 2,\n 3\n
To turn off sql formatting for a block of SQL, wrap it in format:off
and format:on
comments, like this:
SELECT\n -- format:off\n submission_date, sample_id, client_id\n -- format:on\n
"},{"location":"cookbooks/common_workflows/#add-a-new-field-to-a-table-schema","title":"Add a new field to a table schema","text":"Adding a new field to a table schema also means that the field has to propagate to several downstream tables, which makes it a more complex case.
query.sql
file inside the <dataset>.<table>
location and add the new definitions for the field../bqetl format <path to the query>
to format the query. Alternatively, run ./bqetl format $(git ls-tree -d HEAD --name-only)
validate the format of all queries that have been modified../bqetl query validate <dataset>.<table>
to dry run the query.jobs.create
permissions in moz-fx-data-shared-prod
), run:gcloud auth login --update-adc # to authenticate to GCP
gcloud config set project mozdata # to set the project
./bqetl query validate --use-cloud-function=false --project-id=mozdata <full path to the query file>
./bqetl query schema update <dataset>.<table> --update_downstream
to make local schema.yaml updates and update schemas of downstream dependencies.--update_downstream
is optional as it takes longer. It is recommended when you know that there are downstream dependencies whose schema.yaml
need to be updated, in which case, the update will happen automatically.--force
should only be used in very specific cases, particularly the clients_last_seen
tables. It skips some checks that would otherwise catch some error scenarios.bqetl_artifact_deployment
Airflow DAGThe following is an example to update a new field in telemetry_derived.clients_daily_v6
clients_daily_v6
query.sql
file and add new field definitions../bqetl format sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/query.sql
./bqetl query validate telemetry_derived.clients_daily_v6
.gcloud auth login --update-adc
./bqetl query schema update telemetry_derived.clients_daily_v6 --update_downstream --ignore-dryrun-skip --use-cloud-function=false
.schema.yaml
files of downstream dependencies, like clients_last_seen_v1
are updated.--use-cloud-function=false
is necessary when updating tables related to clients_daily
but optional for other tables. The dry run cloud function times out when fetching the deployed table schema for some of clients_daily
s downstream dependencies. Using GCP credentials instead works, however this means users need to have permissions to run queries in moz-fx-data-shared-prod
.bqetl_artifact_deployment
Airflow DAGDeleting a field from an existing table schema should be done only when is totally neccessary. If you decide to delete it: 1. Validate if there is data in the column and make sure data it is either backed up or it can be reprocessed. 1. Follow Big Query docs recommendations for deleting. 1. If the column size exceeds the allowed limit, consider setting the field as NULL. See this search_clients_daily_v8 PR for an example.
"},{"location":"cookbooks/common_workflows/#adding-a-new-mozfun-udf","title":"Adding a new mozfun UDF","text":"./bqetl mozfun create <dataset>.<name> --udf
.udf.sql
file in sql/mozfun/<dataset>/<name>/
and add UDF the definition and tests../bqetl mozfun validate <dataset>.<name>
for formatting and running tests.mozfun
DAG and clear latest run.Internal UDFs are usually only used by specific queries. If your UDF might be useful to others consider publishing it as a mozfun
UDF.
./bqetl routine create <dataset>.<name> --udf
udf.sql
in sql/moz-fx-data-shared-prod/<dataset>/<name>/
file and add UDF definition and tests./bqetl routine validate <dataset>.<name>
for formatting and running testsbqetl_artifact_deployment
Airflow DAGThe same steps as creating a new UDF apply for creating stored procedures, except when initially creating the procedure execute ./bqetl mozfun create <dataset>.<name> --stored_procedure
or ./bqetl routine create <dataset>.<name> --stored_procedure
for internal stored procedures.
udf.sql
file and make updates./bqetl mozfun validate <dataset>.<name>
or ./bqetl routine validate <dataset>.<name>
for formatting and running tests./bqetl mozfun rename <dataset>.<name> <new_dataset>.<new_name>
To provision a new BigQuery dataset for holding tables, you'll need to create a dataset_metadata.yaml
which will cause the dataset to be automatically deployed after merging. Changes to existing datasets may trigger manual operator approval (such as changing access policies). For more on access controls, see Data Access Workgroups in Mana.
The bqetl query create
command will automatically generate a skeleton dataset_metadata.yaml
file if the query name contains a dataset that is not yet defined.
See example with commentary for telemetry_derived
:
friendly_name: Telemetry Derived\ndescription: |-\n Derived data based on pings from legacy Firefox telemetry, plus many other\n general-purpose derived tables\nlabels: {}\n\n# Base ACL should can be:\n# \"derived\" for `_derived` datasets that contain concrete tables\n# \"view\" for user-facing datasets containing virtual views\ndataset_base_acl: derived\n\n# Datasets with user-facing set to true will be created both in shared-prod\n# and in mozdata; this should be false for all `_derived` datasets\nuser_facing: false\n\n# Most datasets can have mozilla-confidential access like below, but some\n# datasets will be defined with more restricted access or with additional\n# access for services; see \"Data Access Workgroups\" link above.\nworkgroup_access:\n- role: roles/bigquery.dataViewer\n members:\n - workgroup:mozilla-confidential\n
"},{"location":"cookbooks/common_workflows/#publishing-data","title":"Publishing data","text":"See also the reference for Public Data.
metadata.yaml
file of the query to be publishedpublic_bigquery: true
and optionally public_json: true
review_bugs
mozilla-public-data
init.sql
file exists for the query, change the destination project for the created table to mozilla-public-data
moz-fx-data-shared-prod
referencing the public datasetWhen adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt
or constraints.txt
, but not both.
# Create a python virtual environment (not necessary if you have already\n# run `./bqetl bootstrap`)\npython3 -m venv venv/\n\n# Activate the virtual environment\nsource venv/bin/activate\n\n# If not installed:\npip install pip-tools --constraint requirements.in\n\n# Add the dependency to requirements.in e.g. Jinja2.\necho Jinja2==2.11.1 >> requirements.in\n\n# Compile hashes for new dependencies.\npip-compile --generate-hashes requirements.in\n\n# Deactivate the python virtual environment.\ndeactivate\n
"},{"location":"cookbooks/common_workflows/#making-a-pull-request-from-a-fork","title":"Making a pull request from a fork","text":"When opening a pull-request to merge a fork, the manual-trigger-required-for-fork
CI task will fail and some integration test tasks will be skipped. A user with repository write permissions will have to run the Push to upstream workflow and provide the <username>:<branch>
of the fork as parameter. The parameter will also show up in the logs of the manual-trigger-required-for-fork
CI task together with more detailed instructions. Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be executed.
The repository documentation is built using MkDocs. To generate and check the docs locally:
./bqetl docs generate --output_dir generated_docs
generated_docs
directorymkdocs serve
to start a local mkdocs
server.Each code files in the bigquery-etl repository can have a set of owners who are responsible to review and approve changes, and are automatically assigned as PR reviewers. The query files in the repo also benefit from the metadata labels to be able to validate and identify the data that is change controlled.
Here is a sample PR with the implementation of change control for contextual services data.
mozilla > telemetry
.metadata.yaml
for the query where you want to apply change control:owners
, add the selected GitHub identity, along with the list of owners' emails.labels
, add change_controlled: true
. This enables identifying change controlled data in the BigQuery console and in the Data Catalog.CODEOWNERS
:CODEOWNERS
file located in the root of the repo./sql_generators/active_users/templates/ @mozilla/kpi_table_reviewers
.script/bqetl query validate <query_path>
./sql-generators
, first run ./script/bqetl generate <path>
and then run script/bqetl query validate <query_path>
.This guide takes you through the creation of a simple derived dataset using bigquery-etl and scheduling it using Airflow, to be updated on a daily basis. It applies to the products we ship to customers, that use (or will use) the Glean SDK.
This guide also includes the specific instructions to set it as a public dataset. Make sure you only set the dataset public if you expect the data to be available outside Mozilla. Read our public datasets reference for context.
To illustrate the overall process, we will use a simple test case and a small Glean application for which we want to generate an aggregated dataset based on the raw ping data.
If you are interested in looking at the end result, you can view the pull request at mozilla/bigquery-etl#1760.
"},{"location":"cookbooks/creating_a_derived_dataset/#background","title":"Background","text":"Mozregression is a developer tool used to help developers and community members bisect builds of Firefox to find a regression range in which a bug was introduced. It forms a key part of our quality assurance process.
In this example, we will create a table of aggregated metrics related to mozregression
, that will be used in dashboards to help prioritize feature development inside Mozilla.
Set up bigquery-etl on your system per the instructions in the README.md.
"},{"location":"cookbooks/creating_a_derived_dataset/#create-the-query","title":"Create the Query","text":"The first step is to create a query file and decide on the name of your derived dataset. In this case, we'll name it org_mozilla_mozregression_derived.mozregression_aggregates
.
The org_mozilla_mozregression_derived
part represents a BigQuery dataset, which is essentially a container of tables. By convention, we use the _derived
postfix to hold derived tables like this one.
Run:
./bqetl query create <dataset>.<table_name>\n
In our example: ./bqetl query create org_mozilla_mozregression_derived.mozregression_aggregates --dag bqetl_internal_tooling\n
This command does three things:
metadata.yaml
and query.sql
representing the query to build the dataset in sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1
sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates
.bqetl_internal_tooling
.bqetl_default
.--no-schedule
is used, queries are not schedule. This option is available for queries that run once or should be scheduled at a later time. The query can be manually scheduled at a later time.We generate the view to have a stable interface, while allowing the dataset backend to evolve over time. Views are automatically published to the mozdata
project.
The next step is to modify the generated metadata.yaml
and query.sql
sections with specific information.
Let's look at what the metadata.yaml
file for our example looks like. Make sure to adapt this file for your own dataset.
friendly_name: mozregression aggregates\ndescription:\n Aggregated metrics of mozregression usage\nlabels:\n incremental: true\nowners:\n - wlachance@mozilla.com\nbigquery:\n time_partitioning:\n type: day\n field: date\n require_partition_filter: true\n expiration_days: null\n clustering:\n fields:\n - app_used\n - os\n
Most of the fields are self-explanatory. incremental
means that the table is updated incrementally, e.g. a new partition gets added/updated to the destination table whenever the query is run. For non-incremental queries the entire destination is overwritten when the query is executed.
For big datasets make sure to include optimization strategies. Our aggregation is small so it is only for illustration purposes that we are including a partition by the date
field and a clustering on app_used
and os
.
Setting the dataset as public means that it will be both in Mozilla's public BigQuery project and a world-accessible JSON endpoint, and is a process that requires a data review. The required labels are: public_json
, public_bigquery
and review_bugs
which refers to the Bugzilla bug where opening this data set up to the public was approved: we'll get to that in a subsequent section.
friendly_name: mozregression aggregates\ndescription:\n Aggregated metrics of mozregression usage\nlabels:\n incremental: true\n public_json: true\n public_bigquery: true\n review_bugs:\n - 1691105\nowners:\n - wlachance@mozilla.com\nbigquery:\n time_partitioning:\n type: day\n field: date\n require_partition_filter: true\n expiration_days: null\n clustering:\n fields:\n - app_used\n - os\n
"},{"location":"cookbooks/creating_a_derived_dataset/#fill-out-the-query","title":"Fill out the query","text":"Now that we've filled out the metadata, we can look into creating a query. In many ways, this is similar to creating a SQL query to run on BigQuery in other contexts (e.g. on sql.telemetry.mozilla.org or the BigQuery console)-- the key difference is that we use a @submission_date
parameter so that the query can be run on a day's worth of data to update the underlying table incrementally.
Test your query and add it to the query.sql
file.
In our example, the query is tested in sql.telemetry.mozilla.org
, and the query.sql
file looks like this:
SELECT\n DATE(submission_timestamp) AS date,\n client_info.app_display_version AS mozregression_version,\n metrics.string.usage_variant AS mozregression_variant,\n metrics.string.usage_app AS app_used,\n normalized_os AS os,\n mozfun.norm.truncate_version(normalized_os_version, \"minor\") AS os_version,\n count(DISTINCT(client_info.client_id)) AS distinct_clients,\n count(*) AS total_uses\nFROM\n `moz-fx-data-shared-prod`.org_mozilla_mozregression.usage\nWHERE\n DATE(submission_timestamp) = @submission_date\n AND client_info.app_display_version NOT LIKE '%.dev%'\nGROUP BY\n date,\n mozregression_version,\n mozregression_variant,\n app_used,\n os,\n os_version;\n
We use the truncate_version
UDF to omit the patch level for MacOS and Linux, which should both reduce the size of the dataset as well as make it more difficult to identify individual clients in an aggregated dataset.
We also have a short clause (client_info.app_display_version NOT LIKE '%.dev%'
) to omit developer versions from the aggregates: this makes sure we're not including people developing or testing mozregression itself in our results.
Now that we've written our query, we can format it and validate it. Once that's done, we run:
./bqetl query validate <dataset>.<table>\n
For our example: ./bqetl query validate org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
If there are no problems, you should see no output."},{"location":"cookbooks/creating_a_derived_dataset/#creating-the-table-schema","title":"Creating the table schema","text":"Use bqetl to set up the schema that will be used to create the table.
Review the schema.YAML generated as an output of the following command, and make sure all data types are set correctly and according to the data expected from the query.
./bqetl query schema update <dataset>.<table>\n
For our example:
./bqetl query schema update org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
"},{"location":"cookbooks/creating_a_derived_dataset/#creating-a-dag","title":"Creating a DAG","text":"BigQuery-ETL has some facilities in it to automatically add your query to telemetry-airflow (our instance of Airflow).
Before scheduling your query, you'll need to find an Airflow DAG to run it off of. In some cases, one may already exist that makes sense to use for your dataset -- look in dags.yaml
at the root or run ./bqetl dag info
. In this particular case, there's no DAG that really makes sense -- so we'll create a new one:
./bqetl dag create <dag_name> --schedule-interval \"0 4 * * *\" --owner <email_for_notifications> --description \"Add a clear description of the DAG here\" --start-date <YYYY-MM-DD> --tag impact/<tier>\n
For our example, the starting date is 2020-06-01
and we use a schedule interval of 0 4 \\* \\* \\*
(4am UTC daily) instead of \"daily\" (12am UTC daily) to make sure this isn't competing for slots with desktop and mobile product ETL.
The --tag impact/tier3
parameter specifies that this DAG is considered \"tier 3\". For a list of valid tags and their descriptions see Airflow Tags.
When creating a new DAG, while it is still under active development and assumed to fail during this phase, the DAG can be tagged as --tag triage/no_triage
. That way it will be ignored by the person on Airflow Triage. Once the active development is done, the triage/no_triage
tag can be removed and problems will addressed during the Airflow Triage process.
./bqetl dag create bqetl_internal_tooling --schedule-interval \"0 4 * * *\" --owner wlachance@mozilla.com --description \"This DAG schedules queries for populating queries related to Mozilla's internal developer tooling (e.g. mozregression).\" --start-date 2020-06-01 --tag impact/tier_3\n
"},{"location":"cookbooks/creating_a_derived_dataset/#scheduling-your-query","title":"Scheduling your query","text":"Queries are automatically scheduled during creation in the DAG set using the option --dag
, or in the default DAG bqetl_default
when this option is not used.
If the query was created with --no-schedule
, it is possible to manually schedule the query via the bqetl
tool:
./bqetl query schedule <dataset>.<table> --dag <dag_name> --task-name <task_name>\n
Here is the command for our example. Notice the name of the table as created with the suffix _v1.
./bqetl query schedule org_mozilla_mozregression_derived.mozregression_aggregates_v1 --dag bqetl_internal_tooling --task-name mozregression_aggregates__v1\n
Note that we are scheduling the generation of the underlying table which is org_mozilla_mozregression_derived.mozregression_aggregates_v1
rather than the view.
This is for public datasets only! You can skip this step if you're only creating a dataset for Mozilla-internal use.
Before a dataset can be made public, it needs to go through data review according to our data publishing process. This means filing a bug, answering a few questions, and then finding a data steward to review your proposal.
The dataset we're using in this example is very simple and straightforward and does not have any particularly sensitive data, so the data review is very simple. You can see the full details in bug 1691105.
"},{"location":"cookbooks/creating_a_derived_dataset/#create-a-pull-request","title":"Create a Pull Request","text":"Now is a good time to create a pull request with your changes to GitHub. This is the usual git workflow:
git checkout -b <new_branch_name>\ngit add dags.yaml dags/<dag_name>.py sql/moz-fx-data-shared-prod/telemetry/<view> sql/moz-fx-data-shared-prod/<dataset>/<table>\ngit commit\ngit push origin <new_branch_name>\n
And next is the workflow for our specific example:
git checkout -b mozregression-aggregates\ngit add dags.yaml dags/bqetl_internal_tooling.py sql/moz-fx-data-shared-prod/org_mozilla_mozregression/mozregression_aggregates sql/moz-fx-data-shared-prod/org_mozilla_mozregression_derived/mozregression_aggregates_v1\ngit commit\ngit push origin mozregression-aggregates\n
Then create your pull request, either from the GitHub web interface or the command line, per your preference.
Note At this point, the CI is expected to fail because the schema does not exist yet in BigQuery. This will be handled in the next step.
This example assumes that origin
points to your fork. Adjust the last push invocation appropriately if you have a different remote set.
Speaking of forks, note that if you're making this pull request from a fork, many jobs will currently fail due to lack of credentials. In fact, even if you're pushing to the origin, you'll get failures because the table is not yet created. That brings us to the next step, but before going further it's generally best to get someone to review your work: at this point we have more than enough for people to provide good feedback on.
"},{"location":"cookbooks/creating_a_derived_dataset/#creating-an-initial-table","title":"Creating an initial table","text":"Once the PR has been approved, deploy the schema to bqetl using this command:
./bqetl query schema deploy <schema>.<table>\n
For our example:
./bqetl query schema deploy org_mozilla_mozregression_derived.mozregression_aggregates_v1\n
"},{"location":"cookbooks/creating_a_derived_dataset/#backfilling-a-table","title":"Backfilling a table","text":"Note For large sets of data, follow the recommended practices for backfills.
"},{"location":"cookbooks/creating_a_derived_dataset/#initiating-the-backfill","title":"Initiating the backfill:","text":"Create a backfill schedule entry to (re)-process data in your table:
bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD>\n
--shredder_mitigation
parameter in the backfill command:bqetl backfill create <project>.<dataset>.<table> --start_date=<YYYY-MM-DD> --end_date=<YYYY-MM-DD> --shredder_mitigation\n
Fill out the missing details:
Open a Pull Request with the backfill entry, see this example. Once merged, you should receive a notification in around an hour that processing has started. Your backfill data will be temporarily placed in a staging location.
Watchers need to join the #dataops-alerts Slack channel. They will be notified via Slack when processing is complete, and you can validate your backfill data.
Validate that the backfill data looks like what you expect (calculate important metrics, look for nulls, etc.)
If the data is valid, open a Pull Request, setting the backfill status to Complete, see this example. Once merged, you should receive a notification in around an hour that swapping has started. Current production data will be backed up and the staging backfill data will be swapped into production.
You will be notified when swapping is complete.
Note. If your backfill is complex (backfill validation fails for e.g.), it is recommended to talk to someone in Data Engineering or Data SRE (#data-help) to process the backfill via the backfill DAG.
"},{"location":"cookbooks/creating_a_derived_dataset/#completing-the-pull-request","title":"Completing the Pull Request","text":"At this point, the table exists in Bigquery so you are able to: - Find and re-run the CI of your PR and make sure that all tests pass - Merge your PR.
"},{"location":"cookbooks/testing/","title":"How to Run Tests","text":"This repository uses pytest
:
# create a venv\npython3.11 -m venv venv/\n\n# install pip-tools for managing dependencies\n./venv/bin/pip install pip-tools -c requirements.in\n\n# install python dependencies with pip-sync (provided by pip-tools)\n./venv/bin/pip-sync --pip-args=--no-deps requirements.txt\n\n# run pytest with all linters and 8 workers in parallel\n./venv/bin/pytest --black --flake8 --isort --mypy-ignore-missing-imports --pydocstyle -n 8\n\n# use -k to selectively run a set of tests that matches the expression `udf`\n./venv/bin/pytest -k udf\n\n# narrow down testpaths for quicker turnaround when selecting a single test\n./venv/bin/pytest -o \"testpaths=tests/sql\" -k mobile_search_aggregates_v1\n\n# run integration tests with 4 workers in parallel\ngcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS\nexport GOOGLE_PROJECT_ID=bigquery-etl-integration-test\ngcloud config set project $GOOGLE_PROJECT_ID\n./venv/bin/pytest -m integration -n 4\n
To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS
environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.
Include a comment like -- Tests
followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.
Each test must use the UDF and throw an error to fail. Assert functions defined in sql/mozfun/assert/
may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like #xfail
, similar to a SQL dialect prefix in the BigQuery Cloud Console.
For example:
CREATE TEMP FUNCTION udf_example(option INT64) AS (\n CASE\n WHEN option > 0 then TRUE\n WHEN option = 0 then FALSE\n ELSE ERROR(\"invalid option\")\n END\n);\n-- Tests\nSELECT\n mozfun.assert.true(udf_example(1)),\n mozfun.assert.false(udf_example(0));\n#xfail\nSELECT\n udf_example(-1);\n#xfail\nSELECT\n udf_example(NULL);\n
"},{"location":"cookbooks/testing/#how-to-configure-a-generated-test","title":"How to Configure a Generated Test","text":"Queries are tested by running the query.sql
with test-input tables and comparing the result to an expected table. 1. Make a directory for test resources named tests/sql/{project}/{dataset}/{table}/{test_name}/
, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/test_single_day
- table
must match a directory named like {dataset}/{table}
, e.g. telemetry_derived/clients_last_seen_v1
- test_name
should start with test_
, e.g. test_single_day
- If test_name
is test_init
or test_script
, then the query with is_init()
set to true
or script.sql
respectively; otherwise, the test will run query.sql
1. Add .yaml
files for input tables, e.g. clients_daily_v6.yaml
- Include the dataset prefix if it's set in the tested query, e.g. analysis.clients_last_seen_v1.yaml
- Include the project prefix if it's set in the tested query, e.g. moz-fx-other-data.new_dataset.table_1.yaml
- This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"analysis.clients_last_seen_v1\", \"clients_last_seen_v1\")
1. Add .sql
files for input view queries, e.g. main_summary_v4.sql
- Don't include a CREATE ... AS
clause - Fully qualify table names as `{project}.{dataset}.table`
- Include the dataset prefix if it's set in the tested query, e.g. telemetry.main_summary_v4.sql
- This will result in the dataset prefix being removed from the query, e.g. query = query.replace(\"telemetry.main_summary_v4\", \"main_summary_v4\")
1. Add expect.yaml
to validate the result - DATE
and DATETIME
type columns in the result are coerced to strings using .isoformat()
- Columns named generated_time
are removed from the result before comparing to expect
because they should not be static - NULL
values should be omitted in expect.yaml
. If a column is expected to be NULL
don't add it to expect.yaml
. (Be careful with spreading previous rows (-<<: *base
) here) 1. Optionally add .schema.json
files for input table schemas to the table directory, e.g. tests/sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_raw_v1/clients_daily_v6.schema.json
. These tables will be available for every test in the suite. The schema.json
file need to match the table name in the query.sql
file. If it has project and dataset listed there, the schema file also needs project and dataset. 1. Optionally add query_params.yaml
to define query parameters - query_params
must be a list
Tests of is_init()
statements are supported, similarly to other generated tests. Simply name the test test_init
. The other guidelines still apply.
generated_time
should be a required DATETIME
field to ensure minimal validationbq load
are supportedyaml
and json
format are supported and must contain an array of rows which are converted in memory to ndjson
before loadingyaml
for readability or ndjson
for compatiblity with bq load
expect.yaml
yaml
, json
and ndjson
are supportedyaml
for readability or ndjson
for compatiblity with bq load
time_partitioning_field
will cause the table to use it for time partitioningyaml
, json
and ndjson
are supportedyaml
for readability or json
for compatiblity with bq load
name
, type
or type_
, and value
query_parameters.yaml
may be used instead of query_params.yaml
, but they are mutually exclusiveyaml
, json
and ndjson
are supportedyaml
for readabilitycircleci
service account in the biguqery-etl-integration-test
projectcircleci build
and set required environment variables GOOGLE_PROJECT_ID
and GCLOUD_SERVICE_KEY
:gcloud_service_key=`cat /path/to/key_file.json`\n\n# to run a specific job, e.g. integration:\ncircleci build --job integration \\\n --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n\n# to run all jobs\ncircleci build \\\n --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \\\n --env GCLOUD_SERVICE_KEY=$gcloud_service_key\n
"},{"location":"moz-fx-data-shared-prod/udf/","title":"Udf","text":""},{"location":"moz-fx-data-shared-prod/udf/#active_n_weeks_ago-udf","title":"active_n_weeks_ago (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters","title":"Parameters","text":"INPUTS
x INT64, n INT64\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#active_values_from_days_seen_map-udf","title":"active_values_from_days_seen_map (UDF)","text":"Given a map of representing activity for STRING key
s, this function returns an array of which key
s were active for the time period in question. start_offset should be at most 0. n_bits should be at most the remaining bits.
INPUTS
days_seen_bits_map ARRAY<STRUCT<key STRING, value INT64>>, start_offset INT64, n_bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_engine_searches-udf","title":"add_monthly_engine_searches (UDF)","text":"This function specifically windows searches into calendar-month windows. This means groups are not necessarily directly comparable, since different months have different numbers of days. On the first of each month, a new month is appended, and the first month is dropped. If the date is not the first of the month, the new entry is added to the last element in the array. For example, if we were adding 12 to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]: On the first of the month, the result would be [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12] On any other day of the month, the result would be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 24] This happens for every aggregate (searches, ad clicks, etc.)
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_2","title":"Parameters","text":"INPUTS
prev STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, curr STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_monthly_searches-udf","title":"add_monthly_searches (UDF)","text":"Adds together two engine searches structs. Each engine searches struct has a MAP[engine -> search_counts_struct]. We want to add add together the prev and curr's values for a certain engine. This allows us to be flexible with the number of engines we're using.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_3","title":"Parameters","text":"INPUTS
prev ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, curr ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>, submission_date DATE\n
OUTPUTS
value\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#add_searches_by_index-udf","title":"add_searches_by_index (UDF)","text":"Return sums of each search type grouped by the index. Results are ordered by index.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_4","title":"Parameters","text":"INPUTS
searches ARRAY<STRUCT<total_searches INT64, tagged_searches INT64, search_with_ads INT64, ad_click INT64, index INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_active_addons-udf","title":"aggregate_active_addons (UDF)","text":"This function selects most frequently occuring value for each addon_id, using the latest value in the input among ties. The type for active_addons is ARRAY>, i.e. the output of SELECT ARRAY_CONCAT_AGG(active_addons) FROM telemetry.main_summary_v4
, and is left unspecified to allow changes to the fields of the STRUCT."},{"location":"moz-fx-data-shared-prod/udf/#parameters_5","title":"Parameters","text":"
INPUTS
active_addons ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_map_first-udf","title":"aggregate_map_first (UDF)","text":"Returns an aggregated map with all the keys and the first corresponding value from the given maps
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_6","title":"Parameters","text":"INPUTS
maps ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_counts-udf","title":"aggregate_search_counts (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_7","title":"Parameters","text":"INPUTS
search_counts ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#aggregate_search_map-udf","title":"aggregate_search_map (UDF)","text":"Aggregates the total counts of the given search counters
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_8","title":"Parameters","text":"INPUTS
engine_searches_list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_11_zeroes_then-udf","title":"array_11_zeroes_then (UDF)","text":"An array of 11 zeroes, followed by a supplied value
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_9","title":"Parameters","text":"INPUTS
val INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_drop_first_and_append-udf","title":"array_drop_first_and_append (UDF)","text":"Drop the first element of an array, and append the given element. Result is an array with the same length as the input.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_10","title":"Parameters","text":"INPUTS
arr ANY TYPE, append ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_of_12_zeroes-udf","title":"array_of_12_zeroes (UDF)","text":"An array of 12 zeroes
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_11","title":"Parameters","text":"INPUTS
) AS ( [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#array_slice-udf","title":"array_slice (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_12","title":"Parameters","text":"INPUTS
arr ANY TYPE, start_index INT64, end_index INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitcount_lowest_7-udf","title":"bitcount_lowest_7 (UDF)","text":"This function counts the 1s in lowest 7 bits of an INT64
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_13","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_365-udf","title":"bitmask_365 (UDF)","text":"A bitmask for 365 bits
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_14","title":"Parameters","text":"INPUTS
) AS ( CONCAT(b'\\x1F', REPEAT(b'\\xFF', 45\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_28-udf","title":"bitmask_lowest_28 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_15","title":"Parameters","text":"INPUTS
) AS ( 0x0FFFFFFF\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_lowest_7-udf","title":"bitmask_lowest_7 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_16","title":"Parameters","text":"INPUTS
) AS ( 0x7F\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bitmask_range-udf","title":"bitmask_range (UDF)","text":"Returns a bitmask that can be used to return a subset of an integer representing a bit array. The start_ordinal argument is an integer specifying the starting position of the slice, with start_ordinal = 1 indicating the first bit. The length argument is the number of bits to include in the mask. The arguments were chosen to match the semantics of the SUBSTR function; see https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#substr
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_17","title":"Parameters","text":"INPUTS
start_ordinal INT64, _length INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_active_in_range-udf","title":"bits28_active_in_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_18","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_days_since_seen-udf","title":"bits28_days_since_seen (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_19","title":"Parameters","text":"INPUTS
bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_from_string-udf","title":"bits28_from_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_20","title":"Parameters","text":"INPUTS
s STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_range-udf","title":"bits28_range (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_21","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_retention-udf","title":"bits28_retention (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_22","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_dates-udf","title":"bits28_to_dates (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_23","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
OUTPUTS
ARRAY<DATE>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits28_to_string-udf","title":"bits28_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_24","title":"Parameters","text":"INPUTS
bits INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_from_offsets-udf","title":"bits_from_offsets (UDF)","text":"Returns a bit pattern of type BYTES compactly encoding the given array of positive integer offsets. This is primarily useful to generate a compact encoding of dates on which a feature was used, with arbitrarily long history. Example aggregation: sql bits_from_offsets( ARRAY_AGG(IF(foo, DATE_DIFF(anchor_date, submission_date, DAY), NULL) IGNORE NULLS) )
The resulting value can be cast to an INT64 representing the most recent 64 days via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64)
Or representing the most recent 28 days (compatible with bits28 functions) via: sql CAST(CONCAT('0x', TO_HEX(RIGHT(bits >> i, 4))) AS INT64) << 36 >> 36
INPUTS
offsets ARRAY<INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_active_n_weeks_ago-udf","title":"bits_to_active_n_weeks_ago (UDF)","text":"Given a BYTE and an INT64, return whether the user was active that many weeks ago. NULL input returns NULL output.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_26","title":"Parameters","text":"INPUTS
b BYTES, n INT64\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_seen-udf","title":"bits_to_days_seen (UDF)","text":"Given a BYTE, get the number of days the user was seen. NULL input returns NULL output.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_27","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_first_seen-udf","title":"bits_to_days_since_first_seen (UDF)","text":"Given a BYTES, return the number of days since the client was first seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the result is 0-indexed, meaning that for \\x01, it will return 0. Results showed this being between 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_first_seen(b BYTES) AS (( SELECT MAX(n) FROM UNNEST(GENERATE_ARRAY( 0, 8 * BYTE_LENGTH(b))) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_seen.sql
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_28","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bits_to_days_since_seen-udf","title":"bits_to_days_since_seen (UDF)","text":"Given a BYTES, return the number of days since the client was last seen. If no bits are set, returns NULL, indicating we don't know. Otherwise the results are 0-indexed, meaning \\x01 will return 0. Tests showed this being 5-10x faster than the simpler alternative: CREATE OR REPLACE FUNCTION udf.bits_to_days_since_seen(b BYTES) AS (( SELECT MIN(n) FROM UNNEST(GENERATE_ARRAY(0, 364)) AS n WHERE BIT_COUNT(SUBSTR(b >> n, -1) & b'\\x01') > 0)); See also: bits_to_days_since_first_seen.sql
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_29","title":"Parameters","text":"INPUTS
b BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#bool_to_365_bits-udf","title":"bool_to_365_bits (UDF)","text":"Convert a boolean to 365 bit byte array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_30","title":"Parameters","text":"INPUTS
val BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#boolean_histogram_to_boolean-udf","title":"boolean_histogram_to_boolean (UDF)","text":"Given histogram h, return TRUE if it has a value in the \"true\" bucket, or FALSE if it has a value in the \"false\" bucket, or NULL otherwise. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L309-L317
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_31","title":"Parameters","text":"INPUTS
histogram STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_28_bits-udf","title":"coalesce_adjacent_days_28_bits (UDF)","text":"We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bits representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_32","title":"Parameters","text":"INPUTS
prev INT64, curr INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#coalesce_adjacent_days_365_bits-udf","title":"coalesce_adjacent_days_365_bits (UDF)","text":"Coalesce previous data's PCD with the new data's PCD. We generally want to believe only the first reasonable profile creation date that we receive from a client. Given bytes representing usage from the previous day and the current day, this function shifts the first argument by one day and returns either that value if non-zero and non-null, the current day value if non-zero and non-null, or else 0.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_33","title":"Parameters","text":"INPUTS
prev BYTES, curr BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_28_bits-udf","title":"combine_adjacent_days_28_bits (UDF)","text":"Combines two bit patterns. The first pattern represents activity over a 28-day period ending \"yesterday\". The second pattern represents activity as observed today (usually just 0 or 1). We shift the bits in the first pattern by one to set the new baseline as \"today\", then perform a bitwise OR of the two patterns.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_34","title":"Parameters","text":"INPUTS
prev INT64, curr INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_adjacent_days_365_bits-udf","title":"combine_adjacent_days_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_35","title":"Parameters","text":"INPUTS
prev BYTES, curr BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_days_seen_maps-udf","title":"combine_days_seen_maps (UDF)","text":"The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs (aka maps) where each entry gives the bit pattern for days in which we saw a ping for a given user in a given key. We combine the bit patterns for the previous day and the current day, returning a single map. See udf.combine_experiment_days
for a more specific example of this approach.
INPUTS
-- prev ARRAY<STRUCT<key STRING, value INT64>>, -- curr ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#combine_experiment_days-udf","title":"combine_experiment_days (UDF)","text":"The \"clients_last_seen\" class of tables represent various types of client activity within a 28-day window as bit patterns. This function takes in two arrays of structs where each entry gives the bit pattern for days in which we saw a ping for a given user in a given experiment. We combine the bit patterns for the previous day and the current day, returning a single array of experiment structs.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_37","title":"Parameters","text":"INPUTS
-- prev ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>, -- curr ARRAY<STRUCT<experiment STRING, branch STRING, bits INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#country_code_to_flag-udf","title":"country_code_to_flag (UDF)","text":"For a given two-letter ISO 3166-1 alpha-2 country code, returns a string consisting of two Unicode regional indicator symbols, which is rendered in supporting fonts (such as in the BigQuery console or STMO) as flag emoji. This is just for fun. See: - https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 - https://en.wikipedia.org/wiki/Regional_Indicator_Symbol
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_38","title":"Parameters","text":"INPUTS
country_code string\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#days_seen_bytes_to_rfm-udf","title":"days_seen_bytes_to_rfm (UDF)","text":"Return the frequency, recency, and T from a BYTE array, as defined in https://lifetimes.readthedocs.io/en/latest/Quickstart.html#the-shape-of-your-data RFM refers to Recency, Frequency, and Monetary value.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_39","title":"Parameters","text":"INPUTS
days_seen_bytes BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#days_since_created_profile_as_28_bits-udf","title":"days_since_created_profile_as_28_bits (UDF)","text":"Takes in a difference between submission date and profile creation date and returns a bit pattern representing the profile creation date IFF the profile date is the same as the submission date or no more than 6 days earlier. Analysis has shown that client-reported profile creation dates are much less reliable outside of this range and cannot be used as reliable indicators of new profile creation.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_40","title":"Parameters","text":"INPUTS
days_since_created_profile INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#deanonymize_event-udf","title":"deanonymize_event (UDF)","text":"Rename struct fields in anonymous event tuples to meaningful names.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_41","title":"Parameters","text":"INPUTS
tuple STRUCT<f0_ INT64, f1_ STRING, f2_ STRING, f3_ STRING, f4_ STRING, f5_ ARRAY<STRUCT<key STRING, value STRING>>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#decode_int64-udf","title":"decode_int64 (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_42","title":"Parameters","text":"INPUTS
raw BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#dedupe_array-udf","title":"dedupe_array (UDF)","text":"Return an array containing only distinct values of the given array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_43","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_clients-udf","title":"distribution_model_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_44","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_ga_metrics-udf","title":"distribution_model_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_45","title":"Parameters","text":"INPUTS
) RETURNS STRING AS ( 'helloworld'\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#distribution_model_installs-udf","title":"distribution_model_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_46","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#event_code_points_to_string-udf","title":"event_code_points_to_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_47","title":"Parameters","text":"INPUTS
code_points ANY TYPE\n
OUTPUTS
ARRAY<INT64>\n
"},{"location":"moz-fx-data-shared-prod/udf/#experiment_search_metric_to_array-udf","title":"experiment_search_metric_to_array (UDF)","text":"Used for testing only. Reproduces the string transformations done in experiment_search_events_live_v1 materialized views.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_48","title":"Parameters","text":"INPUTS
metric ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_count_histogram_value-udf","title":"extract_count_histogram_value (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_49","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_type-udf","title":"extract_document_type (UDF)","text":"Extract the document type from a table name e.g. _TABLE_SUFFIX.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_50","title":"Parameters","text":"INPUTS
table_name STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_document_version-udf","title":"extract_document_version (UDF)","text":"Extract the document version from a table name e.g. _TABLE_SUFFIX.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_51","title":"Parameters","text":"INPUTS
table_name STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"This is a performance optimization compared to the more general mozfun.hist.extract for cases where only the histogram sum is needed. It must support all the same format variants as mozfun.hist.extract but this simplification is necessary to keep the main_summary query complexity in check.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_52","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#extract_schema_validation_path-udf","title":"extract_schema_validation_path (UDF)","text":"Return a path derived from an error message in payload_bytes_error
INPUTS
error_message STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.
Fenix originally used an 8-digit app_build format>
In short it is yDDDHHmm
:
The last date seen with an 8-digit build ID is 2020-08-10.
Newer builds use a 10-digit format> where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.
This function tolerates both formats.
After using this you may wish to DATETIME_TRUNC(result, DAY)
for grouping by build date.
INPUTS
app_build STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_clients-udf","title":"funnel_derived_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_55","title":"Parameters","text":"INPUTS
os STRING, first_seen_date DATE, build_id STRING, attribution_source STRING, attribution_ua STRING, startup_profile_selection_reason STRING, distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_ga_metrics-udf","title":"funnel_derived_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_56","title":"Parameters","text":"INPUTS
device_category STRING, browser STRING, operating_system STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#funnel_derived_installs-udf","title":"funnel_derived_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_57","title":"Parameters","text":"INPUTS
silent BOOLEAN, submission_timestamp TIMESTAMP, build_id STRING, attribution STRING, distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#ga_is_mozilla_browser-udf","title":"ga_is_mozilla_browser (UDF)","text":"Determine if a browser in a Google Analytics data is produced by Mozilla
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_58","title":"Parameters","text":"INPUTS
browser STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct-udf","title":"geo_struct (UDF)","text":"Convert geoip lookup fields to a struct, replacing '??' with NULL. Returns NULL if if required field country would be NULL. Replaces '??' with NULL because '??' is a placeholder that may be used if there was an issue during geoip lookup in hindsight.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_59","title":"Parameters","text":"INPUTS
country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#geo_struct_set_defaults-udf","title":"geo_struct_set_defaults (UDF)","text":"Convert geoip lookup fields to a struct, replacing NULLs with \"??\". This allows for better joins on those fields, but needs to be changed back to NULL at the end of the query.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_60","title":"Parameters","text":"INPUTS
country STRING, city STRING, geo_subdivision1 STRING, geo_subdivision2 STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#get_key-udf","title":"get_key (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_61","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_62","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_nanos-udf","title":"glean_timespan_nanos (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_63","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#glean_timespan_seconds-udf","title":"glean_timespan_seconds (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_64","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#gzip_length_footer-udf","title":"gzip_length_footer (UDF)","text":"Given a gzip compressed byte string, extract the uncompressed size from the footer. WARNING: THIS FUNCTION IS NOT RELIABLE FOR ARBITRARY GZIP STREAMS. It should, however, be safe to use for checking the decompressed size of payload in payload_bytes_decoded (and NOT payload_bytes_raw) because that payload is produced by the decoder and limited to conditions where the footer is accurate. From https://stackoverflow.com/a/9213826 First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.) Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.) Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_65","title":"Parameters","text":"INPUTS
compressed BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_max_key_with_nonzero_value-udf","title":"histogram_max_key_with_nonzero_value (UDF)","text":"Find the largest numeric bucket that contains a value greater than zero. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L253-L266
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_66","title":"Parameters","text":"INPUTS
histogram STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_merge-udf","title":"histogram_merge (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_67","title":"Parameters","text":"INPUTS
histogram_list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_normalize-udf","title":"histogram_normalize (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_68","title":"Parameters","text":"INPUTS
histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n
OUTPUTS
STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_percentiles-udf","title":"histogram_percentiles (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_69","title":"Parameters","text":"INPUTS
histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_mean-udf","title":"histogram_to_mean (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_70","title":"Parameters","text":"INPUTS
histogram ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#histogram_to_threshold_count-udf","title":"histogram_to_threshold_count (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_71","title":"Parameters","text":"INPUTS
histogram STRING, threshold INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#hmac_sha256-udf","title":"hmac_sha256 (UDF)","text":"Given a key and message, return the HMAC-SHA256 hash. This algorithm can be found in Wikipedia: https://en.wikipedia.org/wiki/HMAC#Implementation This implentation is validated against the NIST test vectors. See test/validation/hmac_sha256.py for more information.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_72","title":"Parameters","text":"INPUTS
key BYTES, message BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#int_to_365_bits-udf","title":"int_to_365_bits (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_73","title":"Parameters","text":"INPUTS
value INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#int_to_hex_string-udf","title":"int_to_hex_string (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_74","title":"Parameters","text":"INPUTS
value INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_75","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_extract_int_map-udf","title":"json_extract_int_map (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_76","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#json_mode_last-udf","title":"json_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_77","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#keyed_histogram_get_sum-udf","title":"keyed_histogram_get_sum (UDF)","text":"Take a keyed histogram of type STRUCT, extract the histogram of the given key, and return the sum value"},{"location":"moz-fx-data-shared-prod/udf/#parameters_78","title":"Parameters","text":"
INPUTS
keyed_histogram ANY TYPE, target_key STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_append_to_json_string-udf","title":"kv_array_append_to_json_string (UDF)","text":"Returns a JSON string which has the pair
appended to the provided input
JSON string. NULL is also valid for input
. Examples: udf.kv_array_append_to_json_string('{\"foo\":\"bar\"}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"foo\":\"bar\",\"baz\":\"boo\"}' udf.kv_array_append_to_json_string('{}', [STRUCT(\"baz\" AS key, \"boo\" AS value)]) '{\"baz\": \"boo\"}'
INPUTS
input STRING, arr ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#kv_array_to_json_string-udf","title":"kv_array_to_json_string (UDF)","text":"Returns a JSON string representing the input key-value array. Value type must be able to be represented as a string - this function will cast to a string. At Mozilla, the schema for a map is STRUCT>>. To use this with that representation, it should be as udf.kv_array_to_json_string(struct.key_value)
."},{"location":"moz-fx-data-shared-prod/udf/#parameters_80","title":"Parameters","text":"
INPUTS
kv_arr ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#main_summary_scalars-udf","title":"main_summary_scalars (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_81","title":"Parameters","text":"INPUTS
processes ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_bing_revenue_country_to_country_code-udf","title":"map_bing_revenue_country_to_country_code (UDF)","text":"For use by LTV revenue join only. Maps the Bing country to a country code. Only keeps the country codes we want to aggregate on.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_82","title":"Parameters","text":"INPUTS
country STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_mode_last-udf","title":"map_mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_83","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_revenue_country-udf","title":"map_revenue_country (UDF)","text":"Only for use by the LTV Revenue join. Maps country codes to the codes we have in the revenue dataset. Buckets small Bing countries into \"other\".
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_84","title":"Parameters","text":"INPUTS
engine STRING, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#map_sum-udf","title":"map_sum (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_85","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#marketing_attributable_desktop-udf","title":"marketing_attributable_desktop (UDF)","text":"This is a UDF to help distinguish if acquired desktop clients are attributable to marketing efforts or not
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_86","title":"Parameters","text":"INPUTS
medium STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#merge_scalar_user_data-udf","title":"merge_scalar_user_data (UDF)","text":"Given an array of scalar metric data that might have duplicate values for a metric, merge them into one value.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_87","title":"Parameters","text":"INPUTS
aggs ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n
OUTPUTS
ARRAY<STRUCT<metric STRING, metric_type STRING, key STRING, process STRING, agg_type STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mod_uint128-udf","title":"mod_uint128 (UDF)","text":"This function returns \"dividend mod divisor\" where the dividend and the result is encoded in bytes, and divisor is an integer.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_88","title":"Parameters","text":"INPUTS
dividend BYTES, divisor INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mode_last-udf","title":"mode_last (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_89","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_90","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#monetized_search-udf","title":"monetized_search (UDF)","text":"Stub monetized_search UDF for tests
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_91","title":"Parameters","text":"INPUTS
engine STRING, country STRING, distribution_id STRING, submission_date DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#new_monthly_engine_searches_struct-udf","title":"new_monthly_engine_searches_struct (UDF)","text":"This struct represents the past year's worth of searches. Each month has its own entry, hence 12.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_92","title":"Parameters","text":"INPUTS
) AS ( STRUCT( udf.array_of_12_zeroes(\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_fenix_metrics-udf","title":"normalize_fenix_metrics (UDF)","text":"Accepts a glean metrics struct as input and returns a modified struct that nulls out histograms for older versions of the Glean SDK that reported pathological binning; see https://bugzilla.mozilla.org/show_bug.cgi?id=1592930
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_93","title":"Parameters","text":"INPUTS
telemetry_sdk_build STRING, metrics ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_baseline_client_info-udf","title":"normalize_glean_baseline_client_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_94","title":"Parameters","text":"INPUTS
client_info ANY TYPE, metrics ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_glean_ping_info-udf","title":"normalize_glean_ping_info (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_95","title":"Parameters","text":"INPUTS
ping_info ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_main_payload-udf","title":"normalize_main_payload (UDF)","text":"Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_96","title":"Parameters","text":"INPUTS
payload ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_metadata-udf","title":"normalize_metadata (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_97","title":"Parameters","text":"INPUTS
metadata ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_monthly_searches-udf","title":"normalize_monthly_searches (UDF)","text":"Sum up the monthy search count arrays by normalized engine
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_98","title":"Parameters","text":"INPUTS
engine_searches ARRAY<STRUCT<key STRING, value STRUCT<total_searches ARRAY<INT64>, tagged_searches ARRAY<INT64>, search_with_ads ARRAY<INT64>, ad_click ARRAY<INT64>>>>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_os-udf","title":"normalize_os (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_99","title":"Parameters","text":"INPUTS
os STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#normalize_search_engine-udf","title":"normalize_search_engine (UDF)","text":"Return normalized engine name for recognized engines This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_100","title":"Parameters","text":"INPUTS
engine STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#null_if_empty_list-udf","title":"null_if_empty_list (UDF)","text":"Return NULL if list is empty, otherwise return list. This cannot be done with NULLIF because NULLIF does not support arrays.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_101","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#one_as_365_bits-udf","title":"one_as_365_bits (UDF)","text":"One represented as a byte array of 365 bits
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_102","title":"Parameters","text":"INPUTS
) AS ( CONCAT(REPEAT(b'\\x00', 45\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_desktop-udf","title":"organic_vs_paid_desktop (UDF)","text":"This is a UDF to help distinguish desktop client attribution as being organic or paid
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_103","title":"Parameters","text":"INPUTS
medium STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#organic_vs_paid_mobile-udf","title":"organic_vs_paid_mobile (UDF)","text":"This is a UDF to help distinguish mobile client attribution as being organic or paid
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_104","title":"Parameters","text":"INPUTS
adjust_network STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pack_event_properties-udf","title":"pack_event_properties (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf/#parameters_105","title":"Parameters","text":"INPUTS
event_properties ANY TYPE, indices ANY TYPE\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
"},{"location":"moz-fx-data-shared-prod/udf/#parquet_array_sum-udf","title":"parquet_array_sum (UDF)","text":"Sum an array from a parquet-derived field. These are lists of an element
that contain the field value.
INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#parse_desktop_telemetry_uri-udf","title":"parse_desktop_telemetry_uri (UDF)","text":"Parses and labels the components of a telemetry desktop ping submission uri Per https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html#special-handling-for-firefox-desktop-telemetry the format is /submit/telemetry/docId/docType/appName/appVersion/appUpdateChannel/appBuildID e.g. /submit/telemetry/ce39b608-f595-4c69-b6a6-f7a436604648/main/Firefox/61.0a1/nightly/20180328030202
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_107","title":"Parameters","text":"INPUTS
uri STRING\n
OUTPUTS
STRUCT<namespace STRING, document_id STRING, document_type STRING, app_name STRING, app_version STRING, app_update_channel STRING, app_build_id STRING>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#parse_iso8601_date-udf","title":"parse_iso8601_date (UDF)","text":"Take a ISO 8601 date or date and time string and return a DATE. Return null if parse fails. Possible formats: 2019-11-04, 2019-11-04T21:15:00+00:00, 2019-11-04T21:15:00Z, 20191104T211500Z
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_108","title":"Parameters","text":"INPUTS
date_str STRING\n
OUTPUTS
DATE\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_clients-udf","title":"partner_org_clients (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_109","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_ga_metrics-udf","title":"partner_org_ga_metrics (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_110","title":"Parameters","text":"INPUTS
) RETURNS STRING AS ( (SELECT 'hola_world' AS partner_org\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#partner_org_installs-udf","title":"partner_org_installs (UDF)","text":"This is a stub implementation for use with tests; real implementation is in private-bigquery-etl
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_111","title":"Parameters","text":"INPUTS
distribution_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_leading_set_bit-udf","title":"pos_of_leading_set_bit (UDF)","text":"Returns the 0-based index of the first set bit. No set bits returns NULL.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_112","title":"Parameters","text":"INPUTS
i INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pos_of_trailing_set_bit-udf","title":"pos_of_trailing_set_bit (UDF)","text":"Identical to bits28_days_since_seen. Returns a 0-based index of the rightmost set bit in the passed bit pattern or null if no bits are set (bits = 0). To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_113","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#product_info_with_baseline-udf","title":"product_info_with_baseline (UDF)","text":"Similar to mozfun.norm.product_info(), but this UDF also handles \"baseline\" apps that were introduced differentiate for certain apps whether data is sent through Glean or core pings. This UDF has been temporarily introduced as part of https://bugzilla.mozilla.org/show_bug.cgi?id=1775216
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_114","title":"Parameters","text":"INPUTS
legacy_app_name STRING, normalized_os STRING\n
OUTPUTS
STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#pseudonymize_ad_id-udf","title":"pseudonymize_ad_id (UDF)","text":"Pseudonymize Ad IDs, handling opt-outs.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_115","title":"Parameters","text":"INPUTS
hashed_ad_id STRING, key BYTES\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#quantile_search_metric_contribution-udf","title":"quantile_search_metric_contribution (UDF)","text":"This function returns how much of one metric is contributed by the quantile of another metric. Quantile variable should add an offset to get the requried percentile value. Example: udf.quantile_search_metric_contribution(sap, search_with_ads, sap_percentiles[OFFSET(9)]) It returns search_with_ads if sap value in top 10% volumn else null.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_116","title":"Parameters","text":"INPUTS
metric1 FLOAT64, metric2 FLOAT64, quantile FLOAT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#round_timestamp_to_minute-udf","title":"round_timestamp_to_minute (UDF)","text":"Floor a timestamp object to the given minute interval.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_117","title":"Parameters","text":"INPUTS
timestamp_expression TIMESTAMP, minute INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#safe_crc32_uuid-udf","title":"safe_crc32_uuid (UDF)","text":"Calculate the CRC-32 hash of a 36-byte UUID, or NULL if the value isn't 36 bytes. This implementation is limited to an exact length because recursion does not work. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_118","title":"Parameters","text":"INPUTS
) AS ( [ 0, 1996959894, 3993919788, 2567524794, 124634137, 1886057615, 3915621685, 2657392035, 249268274, 2044508324, 3772115230, 2547177864, 162941995, 2125561021, 3887607047, 2428444049, 498536548, 1789927666, 4089016648, 2227061214, 450548861, 1843258603, 4107580753, 2211677639, 325883990, 1684777152, 4251122042, 2321926636, 335633487, 1661365465, 4195302755, 2366115317, 997073096, 1281953886, 3579855332, 2724688242, 1006888145, 1258607687, 3524101629, 2768942443, 901097722, 1119000684, 3686517206, 2898065728, 853044451, 1172266101, 3705015759, 2882616665, 651767980, 1373503546, 3369554304, 3218104598, 565507253, 1454621731, 3485111705, 3099436303, 671266974, 1594198024, 3322730930, 2970347812, 795835527, 1483230225, 3244367275, 3060149565, 1994146192, 31158534, 2563907772, 4023717930, 1907459465, 112637215, 2680153253, 3904427059, 2013776290, 251722036, 2517215374, 3775830040, 2137656763, 141376813, 2439277719, 3865271297, 1802195444, 476864866, 2238001368, 4066508878, 1812370925, 453092731, 2181625025, 4111451223, 1706088902, 314042704, 2344532202, 4240017532, 1658658271, 366619977, 2362670323, 4224994405, 1303535960, 984961486, 2747007092, 3569037538, 1256170817, 1037604311, 2765210733, 3554079995, 1131014506, 879679996, 2909243462, 3663771856, 1141124467, 855842277, 2852801631, 3708648649, 1342533948, 654459306, 3188396048, 3373015174, 1466479909, 544179635, 3110523913, 3462522015, 1591671054, 702138776, 2966460450, 3352799412, 1504918807, 783551873, 3082640443, 3233442989, 3988292384, 2596254646, 62317068, 1957810842, 3939845945, 2647816111, 81470997, 1943803523, 3814918930, 2489596804, 225274430, 2053790376, 3826175755, 2466906013, 167816743, 2097651377, 4027552580, 2265490386, 503444072, 1762050814, 4150417245, 2154129355, 426522225, 1852507879, 4275313526, 2312317920, 282753626, 1742555852, 4189708143, 2394877945, 397917763, 1622183637, 3604390888, 2714866558, 953729732, 1340076626, 3518719985, 2797360999, 1068828381, 1219638859, 3624741850, 2936675148, 906185462, 1090812512, 3747672003, 2825379669, 829329135, 1181335161, 3412177804, 3160834842, 628085408, 1382605366, 3423369109, 3138078467, 570562233, 1426400815, 3317316542, 2998733608, 733239954, 1555261956, 3268935591, 3050360625, 752459403, 1541320221, 2607071920, 3965973030, 1969922972, 40735498, 2617837225, 3943577151, 1913087877, 83908371, 2512341634, 3803740692, 2075208622, 213261112, 2463272603, 3855990285, 2094854071, 198958881, 2262029012, 4057260610, 1759359992, 534414190, 2176718541, 4139329115, 1873836001, 414664567, 2282248934, 4279200368, 1711684554, 285281116, 2405801727, 4167216745, 1634467795, 376229701, 2685067896, 3608007406, 1308918612, 956543938, 2808555105, 3495958263, 1231636301, 1047427035, 2932959818, 3654703836, 1088359270, 936918000, 2847714899, 3736837829, 1202900863, 817233897, 3183342108, 3401237130, 1404277552, 615818150, 3134207493, 3453421203, 1423857449, 601450431, 3009837614, 3294710456, 1567103746, 711928724, 3020668471, 3272380065, 1510334235, 755167117 ]\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#safe_sample_id-udf","title":"safe_sample_id (UDF)","text":"Stably hash a client_id to an integer between 0 and 99, or NULL if client_id isn't 36 bytes
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_119","title":"Parameters","text":"INPUTS
client_id STRING\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#search_counts_map_sum-udf","title":"search_counts_map_sum (UDF)","text":"Calculate the sums of search counts per source and engine
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_120","title":"Parameters","text":"INPUTS
entries ARRAY<STRUCT<engine STRING, source STRING, count INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_28_bits_one_day-udf","title":"shift_28_bits_one_day (UDF)","text":"Shift input bits one day left and drop any bits beyond 28 days.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_121","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_365_bits_one_day-udf","title":"shift_365_bits_one_day (UDF)","text":"Shift input bits one day left and drop any bits beyond 365 days.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_122","title":"Parameters","text":"INPUTS
x BYTES\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#shift_one_day-udf","title":"shift_one_day (UDF)","text":"Returns the bitfield shifted by one day, 0 for NULL
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_123","title":"Parameters","text":"INPUTS
x INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#smoot_usage_from_28_bits-udf","title":"smoot_usage_from_28_bits (UDF)","text":"Calculates a variety of metrics based on bit patterns of daily usage for the smoot_usage_* tables.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_124","title":"Parameters","text":"INPUTS
bit_arrays ARRAY<STRUCT<days_created_profile_bits INT64, days_active_bits INT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#vector_add-udf","title":"vector_add (UDF)","text":"This function adds two vectors. The two vectors can have different length. If one vector is null, the other vector will be returned directly.
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_125","title":"Parameters","text":"INPUTS
a ARRAY<INT64>, b ARRAY<INT64>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#zero_as_365_bits-udf","title":"zero_as_365_bits (UDF)","text":"Zero represented as a 365-bit byte array
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_126","title":"Parameters","text":"INPUTS
) AS ( REPEAT(b'\\x00', 46\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf/#zeroed_array-udf","title":"zeroed_array (UDF)","text":"Generates an array if all zeroes, of arbitrary length
"},{"location":"moz-fx-data-shared-prod/udf/#parameters_127","title":"Parameters","text":"INPUTS
len INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/","title":"Udf js","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#bootstrap_percentile_ci-udf","title":"bootstrap_percentile_ci (UDF)","text":"Calculate a confidence interval using an efficient bootstrap sampling technique for a given percentile of a histogram. This implementation relies on the stdlib.js library and the binomial quantile function (https://github.com/stdlib-js/stats-base-dists-binomial-quantile/) for randomly sampling from a binomial distribution.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters","title":"Parameters","text":"INPUTS
percentiles ARRAY<INT64>, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>, metric STRING\n
OUTPUTS
ARRAY<STRUCT<metric STRING, statistic STRING, point FLOAT64, lower FLOAT64, upper FLOAT64, parameter STRING>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#crc32-udf","title":"crc32 (UDF)","text":"Calculate the CRC-32 hash of an input string. The implementation here could be optimized. In particular, it calculates a lookup table on every invocation which could be cached and reused. In practice, though, this implementation appears to be fast enough that further optimization is not yet warranted. Based on https://stackoverflow.com/a/18639999/1260237 See https://en.wikipedia.org/wiki/Cyclic_redundancy_check
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_1","title":"Parameters","text":"INPUTS
data STRING\n
OUTPUTS
INT64 DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#decode_uri_attribution-udf","title":"decode_uri_attribution (UDF)","text":"URL decodes the raw firefox_installer.install.attribution string to a STRUCT. The fields campaign, content, dlsource, dltoken, experiment, medium, source, ua, variation the string are extracted. If any value is (not+set) it is converted to (not set) to match the text from GA when the fields are not set.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_2","title":"Parameters","text":"INPUTS
attribution STRING\n
OUTPUTS
STRUCT<campaign STRING, content STRING, dlsource STRING, dltoken STRING, experiment STRING, medium STRING, source STRING, ua STRING, variation STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#extract_string_from_bytes-udf","title":"extract_string_from_bytes (UDF)","text":"Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function extracts string data from payload
which is in bytes.
INPUTS
payload BYTES\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#gunzip-udf","title":"gunzip (UDF)","text":"Unzips a GZIP string. This implementation relies on the zlib.js library (https://github.com/imaya/zlib.js) and the atob function for decoding base64.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_4","title":"Parameters","text":"INPUTS
input BYTES\n
OUTPUTS
STRING DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_mean_ci-udf","title":"jackknife_mean_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the mean of an array of values for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_5","title":"Parameters","text":"INPUTS
n_buckets INT64, values_per_bucket ARRAY<FLOAT64>\n
OUTPUTS
STRUCT<low FLOAT64, high FLOAT64, pm FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_percentile_ci-udf","title":"jackknife_percentile_ci (UDF)","text":"Calculate a confidence interval using a jackknife resampling technique for a given percentile of a histogram.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_6","title":"Parameters","text":"INPUTS
percentile FLOAT64, histogram STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>\n
OUTPUTS
STRUCT<low FLOAT64, high FLOAT64, percentile FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_ratio_ci-udf","title":"jackknife_ratio_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the weighted mean of an array of ratios for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(active_days_in_week) AS active_days_in_week, SUM(wau) AS wau FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_ratio_ci(20, ARRAY_AGG(STRUCT(CAST(active_days_in_week AS float64), CAST(wau as FLOAT64)))) AS intensity FROM bucketed GROUP BY submission_date
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_7","title":"Parameters","text":"INPUTS
n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n
OUTPUTS
intensity FROM bucketed GROUP BY submission_date */ CREATE OR REPLACE FUNCTION udf_js.jackknife_ratio_ci( n_buckets INT64, values_per_bucket ARRAY<STRUCT<numerator FLOAT64, denominator FLOAT64>>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#jackknife_sum_ci-udf","title":"jackknife_sum_ci (UDF)","text":"Calculates a confidence interval using a jackknife resampling technique for the sum of an array of counts for various buckets; see https://en.wikipedia.org/wiki/Jackknife_resampling Users must specify the number of expected buckets as the first parameter to guard against the case where empty buckets lead to an array with missing elements. Usage generally involves first calculating an aggregate count per bucket, then aggregating over buckets, passing ARRAY_AGG(metric) to this function. Example: WITH bucketed AS ( SELECT submission_date, SUM(dau) AS dau_sum FROM mytable GROUP BY submission_date, bucket_id ) SELECT submission_date, udf_js.jackknife_sum_ci(ARRAY_AGG(dau_sum)).* FROM bucketed GROUP BY submission_date
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_8","title":"Parameters","text":"INPUTS
n_buckets INT64, counts_per_bucket ARRAY<INT64>\n
OUTPUTS
STRUCT<total INT64, low INT64, high INT64, pm INT64>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_events-udf","title":"json_extract_events (UDF)","text":""},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_9","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<event_process STRING, event_timestamp INT64, event_category STRING, event_object STRING, event_method STRING, event_string_value STRING, event_map_values ARRAY<STRUCT<key STRING, value STRING>>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_histogram-udf","title":"json_extract_histogram (UDF)","text":"Returns a parsed struct from a JSON string representing a histogram. This implementation uses JavaScript and is provided for performance comparison; see udf/udf_json_extract_histogram for a pure SQL implementation that will likely be more usable in practice.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_10","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
STRUCT<bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_keyed_histogram-udf","title":"json_extract_keyed_histogram (UDF)","text":"Returns an array of parsed structs from a JSON string representing a keyed histogram. This is likely only useful for histograms that weren't properly parsed to fields, so ended up embedded in an additional_properties JSON blob. Normally, keyed histograms will be modeled as a key/value struct where the values are JSON representations of single histograms. There is no pure SQL equivalent to this function, since BigQuery does not provide any functions for listing or iterating over keysn in a JSON map.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_11","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, bucket_count INT64, histogram_type INT64, `sum` INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#json_extract_missing_cols-udf","title":"json_extract_missing_cols (UDF)","text":"Extract missing columns from additional properties. More generally, get a list of nodes from a JSON blob. Array elements are indicated as [...]. param input: The JSON blob to explode param indicates_node: An array of strings. If a key's value is an object, and contains one of these values, that key is returned as a node. param known_nodes: An array of strings. If a key is in this array, it is returned as a node. Notes: - Use indicates_node for things like histograms. For example ['histogram_type'] will ensure that each histogram will be returned as a missing node, rather than the subvalues within the histogram (e.g. values, sum, etc.) - Use known_nodes if you're aware of a missing section, like ['simpleMeasurements'] See here for an example usage https://sql.telemetry.mozilla.org/queries/64460/source
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_12","title":"Parameters","text":"INPUTS
input STRING, indicates_node ARRAY<STRING>, known_nodes ARRAY<STRING>\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_active_addons-udf","title":"main_summary_active_addons (UDF)","text":"Add fields from additional_attributes to active_addons in main pings. Return an array instead of a \"map\" for backwards compatibility. The INT64 columns from BigQuery may be passed as strings, so parseInt before returning them if they will be coerced to BOOL. The fields from additional_attributes due to union types: integer or boolean for foreignInstall and userDisabled; string or number for version. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L422-L449
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_13","title":"Parameters","text":"INPUTS
active_addons ARRAY<STRUCT<key STRING, value STRUCT<app_disabled BOOL, blocklisted BOOL, description STRING, foreign_install INT64, has_binary_components BOOL, install_day INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL, name STRING, scope INT64, signed_state INT64, type STRING, update_day INT64, user_disabled INT64, version STRING>>>, active_addons_json STRING\n
OUTPUTS
ARRAY<STRUCT<addon_id STRING, blocklisted BOOL, name STRING, user_disabled BOOL, app_disabled BOOL, version STRING, scope INT64, type STRING, foreign_install BOOL, has_binary_components BOOL, install_day INT64, update_day INT64, signed_state INT64, is_system BOOL, is_web_extension BOOL, multiprocess_compatible BOOL>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_addon_scalars-udf","title":"main_summary_addon_scalars (UDF)","text":"Parse scalars from payload.processes.dynamic into map columns for each value type. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L385-L399
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_14","title":"Parameters","text":"INPUTS
dynamic_scalars_json STRING, dynamic_keyed_scalars_json STRING\n
OUTPUTS
STRUCT<keyed_boolean_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value BOOL>>>>, keyed_uint_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value INT64>>>>, string_addon_scalars ARRAY<STRUCT<key STRING, value STRING>>, keyed_string_addon_scalars ARRAY<STRUCT<key STRING, value ARRAY<STRUCT<key STRING, value STRING>>>>, uint_addon_scalars ARRAY<STRUCT<key STRING, value INT64>>, boolean_addon_scalars ARRAY<STRUCT<key STRING, value BOOL>>>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#main_summary_disabled_addons-udf","title":"main_summary_disabled_addons (UDF)","text":"Report the ids of the addons which are in the addonDetails but not in the activeAddons. They are the disabled addons (possibly because they are legacy). We need this as addonDetails may contain both disabled and active addons. https://github.com/mozilla/telemetry-batch-view/blob/ea0733c00df191501b39d2c4e2ece3fe703a0ef3/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L451-L464
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_15","title":"Parameters","text":"INPUTS
active_addon_ids ARRAY<STRING>, addon_details_json STRING\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#parse_sponsored_interaction-udf","title":"parse_sponsored_interaction (UDF)","text":"Related to https://mozilla-hub.atlassian.net/browse/RS-682. The function parses the sponsored interaction column from payload_error_bytes.contextual_services table.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_16","title":"Parameters","text":"INPUTS
params STRING\n
OUTPUTS
STRUCT<`source` STRING, formFactor STRING, scenario STRING, interactionType STRING, contextId STRING, reportingUrl STRING, requestId STRING, submissionTimestamp TIMESTAMP, parsedReportingUrl JSON, originalDocType STRING, originalNamespace STRING, interactionCount INTEGER, flaggedFraud BOOLEAN>\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#sample_id-udf","title":"sample_id (UDF)","text":"Stably hash a client_id to an integer between 0 and 99. This function is technically defined in SQL, but it calls a JS UDF implementation of a CRC-32 hash, so we defined it here to make it clear that its performance may be limited by BigQuery's JavaScript UDF environment.
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_17","title":"Parameters","text":"INPUTS
client_id STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"moz-fx-data-shared-prod/udf_js/#snake_case_columns-udf","title":"snake_case_columns (UDF)","text":"This UDF takes a list of column names to snake case and transform them to be compatible with the BigQuery column naming format. Based on the existing ingestion logic https://github.com/mozilla/gcp-ingestion/blob/dad29698271e543018eddbb3b771ad7942bf4ce5/ ingestion-core/src/main/java/com/mozilla/telemetry/ingestion/core/transform/PubsubMessageToObjectNode.java#L824
"},{"location":"moz-fx-data-shared-prod/udf_js/#parameters_18","title":"Parameters","text":"INPUTS
input ARRAY<STRING>\n
OUTPUTS
ARRAY<STRING>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/about/","title":"mozfun","text":"mozfun
is a public GCP project provisioning publicly accessible user-defined functions (UDFs) and other function-like resources.
Returns whether a given Addon ID is an adblocker.
Determine if a given Addon ID is for an adblocker.
As an example, this query will give the number of users who have an adblocker installed.
SELECT\n submission_date,\n COUNT(DISTINCT client_id) AS dau,\nFROM\n mozdata.telemetry.addons\nWHERE\n mozfun.addons.is_adblocker(addon_id)\n AND submission_date >= \"2023-01-01\"\nGROUP BY\n submission_date\n
"},{"location":"mozfun/addons/#parameters","title":"Parameters","text":"INPUTS
addon_id STRING\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/","title":"Assert","text":""},{"location":"mozfun/assert/#all_fields_null-udf","title":"all_fields_null (UDF)","text":""},{"location":"mozfun/assert/#parameters","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#approx_equals-udf","title":"approx_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_1","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE, tolerance FLOAT64\n
Source | Edit
"},{"location":"mozfun/assert/#array_empty-udf","title":"array_empty (UDF)","text":""},{"location":"mozfun/assert/#parameters_2","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#array_equals-udf","title":"array_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_3","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#array_equals_any_order-udf","title":"array_equals_any_order (UDF)","text":""},{"location":"mozfun/assert/#parameters_4","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#equals-udf","title":"equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_5","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#error-udf","title":"error (UDF)","text":""},{"location":"mozfun/assert/#parameters_6","title":"Parameters","text":"INPUTS
name STRING, expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#false-udf","title":"false (UDF)","text":""},{"location":"mozfun/assert/#parameters_7","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/assert/#histogram_equals-udf","title":"histogram_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_8","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#json_equals-udf","title":"json_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_9","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#map_entries_equals-udf","title":"map_entries_equals (UDF)","text":"Like map_equals but error message contains only the offending entry
"},{"location":"mozfun/assert/#parameters_10","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#map_equals-udf","title":"map_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_11","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/assert/#not_null-udf","title":"not_null (UDF)","text":""},{"location":"mozfun/assert/#parameters_12","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
"},{"location":"mozfun/assert/#null-udf","title":"null (UDF)","text":""},{"location":"mozfun/assert/#parameters_13","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#sql_equals-udf","title":"sql_equals (UDF)","text":"Compare SQL Strings for equality
"},{"location":"mozfun/assert/#parameters_14","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#struct_equals-udf","title":"struct_equals (UDF)","text":""},{"location":"mozfun/assert/#parameters_15","title":"Parameters","text":"INPUTS
expected ANY TYPE, actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/assert/#true-udf","title":"true (UDF)","text":""},{"location":"mozfun/assert/#parameters_16","title":"Parameters","text":"INPUTS
actual ANY TYPE\n
Source | Edit
"},{"location":"mozfun/bits28/","title":"bits28","text":"The bits28
functions provide an API for working with \"bit pattern\" INT64 fields, as used in the clients_last_seen
dataset for desktop Firefox and similar datasets for other applications.
A powerful feature of the clients_last_seen
methodology is that it doesn't record specific metrics like MAU and WAU directly, but rather each row stores a history of the discrete days on which a client was active in the past 28 days. We could calculate active users in a 10 day or 25 day window just as efficiently as a 7 day (WAU) or 28 day (MAU) window. But we can also define completely new metrics based on these usage histories, such as various retention definitions.
The usage history is encoded as a \"bit pattern\" where the physical type of the field is a BigQuery INT64, but logically the integer represents an array of bits, with each 1 indicating a day where the given clients was active and each 0 indicating a day where the client was inactive.
"},{"location":"mozfun/bits28/#active_in_range-udf","title":"active_in_range (UDF)","text":"Return a boolean indicating if any bits are set in the specified range of a bit pattern. The start_offset
must be zero or a negative number indicating an offset from the rightmost bit in the pattern. n_bits is the number of bits to consider, counting right from the bit at start_offset
.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
BOOLEAN\n
Source | Edit
"},{"location":"mozfun/bits28/#days_since_seen-udf","title":"days_since_seen (UDF)","text":"Return the position of the rightmost set bit in an INT64 bit pattern.
To determine this position, we take a bitwise AND of the bit pattern and its complement, then we determine the position of the bit via base-2 logarithm; see https://stackoverflow.com/a/42747608/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n mozfun.bits28.days_since_seen(18)\n-- >> 1\n
"},{"location":"mozfun/bits28/#parameters_1","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#from_string-udf","title":"from_string (UDF)","text":"Convert a string representing individual bits into an INT64.
Implementation based on https://stackoverflow.com/a/51600210/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters_2","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#range-udf","title":"range (UDF)","text":"Return an INT64 representing a range of bits from a source bit pattern.
The start_offset must be zero or a negative number indicating an offset from the rightmost bit in the pattern.
n_bits is the number of bits to consider, counting right from the bit at start_offset.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n -- Signature is bits28.range(offset_to_day_0, start_bit, number_of_bits)\n mozfun.bits28.range(days_seen_bits, -13 + 0, 7) AS week_0_bits,\n mozfun.bits28.range(days_seen_bits, -13 + 7, 7) AS week_1_bits\nFROM\n `mozdata.telemetry.clients_last_seen`\nWHERE\n submission_date > '2020-01-01'\n
"},{"location":"mozfun/bits28/#parameters_3","title":"Parameters","text":"INPUTS
bits INT64, start_offset INT64, n_bits INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bits28/#retention-udf","title":"retention (UDF)","text":"Return a nested struct providing booleans indicating whether a given client was active various time periods based on the passed bit pattern.
"},{"location":"mozfun/bits28/#parameters_4","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
Source | Edit
"},{"location":"mozfun/bits28/#to_dates-udf","title":"to_dates (UDF)","text":"Convert a bit pattern into an array of the dates is represents.
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
"},{"location":"mozfun/bits28/#parameters_5","title":"Parameters","text":"INPUTS
bits INT64, submission_date DATE\n
OUTPUTS
ARRAY<DATE>\n
Source | Edit
"},{"location":"mozfun/bits28/#to_string-udf","title":"to_string (UDF)","text":"Convert an INT64 field into a 28-character string representing the individual bits.
Implementation based on https://stackoverflow.com/a/51600210/1260237
See detailed docs for the bits28 suite of functions: https://docs.telemetry.mozilla.org/cookbooks/clients_last_seen_bits.html#udf-reference
SELECT\n [mozfun.bits28.to_string(1), mozfun.bits28.to_string(2), mozfun.bits28.to_string(3)]\n-- >>> ['0000000000000000000000000001',\n-- '0000000000000000000000000010',\n-- '0000000000000000000000000011']\n
"},{"location":"mozfun/bits28/#parameters_6","title":"Parameters","text":"INPUTS
bits INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/bytes/","title":"bytes","text":""},{"location":"mozfun/bytes/#bit_pos_to_byte_pos-udf","title":"bit_pos_to_byte_pos (UDF)","text":"Given a bit position, get the byte that bit appears in. 1-indexed (to match substr), and accepts negative values.
"},{"location":"mozfun/bytes/#parameters","title":"Parameters","text":"INPUTS
bit_pos INT64\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/bytes/#extract_bits-udf","title":"extract_bits (UDF)","text":"Extract bits from a byte array. Roughly matches substr with three arguments: b: bytes - The byte string we need to extract from start: int - The position of the first bit we want to extract. Can be negative to start from the end of the byte array. One-indexed, like substring. length: int - The number of bits we want to extract
The return byte array will have CEIL(length/8) bytes. The bits of interest will start at the beginning of the byte string. In other words, the byte array will have trailing 0s for any non-relevant fields.
Examples: bytes.extract_bits(b'\\x0F\\xF0', 5, 8) = b'\\xFF' bytes.extract_bits(b'\\x0C\\xC0', -12, 8) = b'\\xCC'
"},{"location":"mozfun/bytes/#parameters_1","title":"Parameters","text":"INPUTS
b BYTES, `begin` INT64, length INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"mozfun/bytes/#zero_right-udf","title":"zero_right (UDF)","text":"Zero bits on the right of byte
"},{"location":"mozfun/bytes/#parameters_2","title":"Parameters","text":"INPUTS
b BYTES, length INT64\n
OUTPUTS
BYTES\n
Source | Edit
"},{"location":"mozfun/event_analysis/","title":"event_analysis","text":"These functions are specific for use with the events_daily
and event_types
tables. By themselves, these two tables are nearly impossible to use since the event history is compressed; however, these stored procedures should make the data accessible.
The events_daily
table is created as a result of two steps: 1. Map each event to a single UTF8 char which will represent it 2. Group each client-day and store a string that records, using the compressed format, that clients' event history for that day. The characters are ordered by the timestamp which they appeared that day.
The best way to access this data is to create a view to do the heavy lifting. For example, to see which clients completed a certain action, you can create a view using these functions that knows what that action's representation is (using the compressed mapping from 1.) and create a regex string that checks for the presence of that event. The view makes this transparent, and allows users to simply query a boolean field representing the presence of that event on that day.
"},{"location":"mozfun/event_analysis/#aggregate_match_strings-udf","title":"aggregate_match_strings (UDF)","text":"Given an array of strings that each match a single event, aggregate those into a single regex string that will match any of the events.
"},{"location":"mozfun/event_analysis/#parameters","title":"Parameters","text":"INPUTS
match_strings ARRAY<STRING>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_count_steps_query-stored-procedure","title":"create_count_steps_query (Stored Procedure)","text":"Generate the SQL statement that can be used to create an easily queryable view on events data.
"},{"location":"mozfun/event_analysis/#parameters_1","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n
OUTPUTS
sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_events_view-stored-procedure","title":"create_events_view (Stored Procedure)","text":"Create a view that queries the events_daily
table. This view currently supports both funnels and event counts. Funnels are created as a struct, with each step in the funnel as a boolean column in the struct, indicating whether the user completed that step on that day. Event counts are simply integers.
create_events_view(\n view_name STRING,\n project STRING,\n dataset STRING,\n funnels ARRAY<STRUCT<\n funnel_name STRING,\n funnel ARRAY<STRUCT<\n step_name STRING,\n events ARRAY<STRUCT<\n category STRING,\n event_name STRING>>>>>>,\n counts ARRAY<STRUCT<\n count_name STRING,\n events ARRAY<STRUCT<\n category STRING,\n event_name STRING>>>>\n )\n
view_name
: The name of the view that will be created. This view will be in the shared-prod project, in the analysis bucket, and so will be queryable at: `moz-fx-data-shared-prod`.analysis.{view_name}\n
project
: The project where the dataset
is located.dataset
: The dataset that must contain both the events_daily
and event_types
tables.funnels
: An array of funnels that will be created. Each funnel has two parts: 1. funnel_name
: The name of the funnel is what the column representing the funnel will be named in the view. For example, with the value \"onboarding\"
, the view can be selected as follows: SELECT onboarding\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. funnel
: The ordered series of steps that make up a funnel. Each step also has: 1. step_name
: Used to name the column within the funnel and represents whether the user completed that step on that day. For example, within onboarding
a user may have completed_first_card
as a step; this can be queried at SELECT onboarding.completed_first_step\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. events
: The set of events which indicate the user completed that step of the funnel. Most of the time this is a single event. Each event has a category
and event_name
.counts
: An array of counts. Each count has two parts, similar to funnel steps: 1. count_name
: Used to name the column representing the event count. E.g. \"clicked_settings_count\"
would be queried at SELECT clicked_settings_count\nFROM `moz-fx-data-shared-prod`.analysis.{view_name}\n
2. events
: The set of events you want to count. Each event has a category
and event_name
.Because the view definitions themselves are not informative about the contents of the events fields, it is best to put your query immediately after the procedure invocation, rather than invoking the procedure and running a separate query.
This STMO query is an example of doing so. This allows viewers of the query to easily interpret what the funnel and count columns represent.
"},{"location":"mozfun/event_analysis/#structure-of-the-resulting-view","title":"Structure of the Resulting View","text":"The view will be created at
`moz-fx-data-shared-prod`.analysis.{event_name}.\n
The view will have a schema roughly matching the following:
root\n |-- submission_date: date\n |-- client_id: string\n |-- {funnel_1_name}: record\n | |-- {funnel_step_1_name} boolean\n | |-- {funnel_step_2_name} boolean\n ...\n |-- {funnel_N_name}: record\n | |-- {funnel_step_M_name}: boolean\n |-- {count_1_name}: integer\n ...\n |-- {count_N_name}: integer\n ...dimensions...\n
"},{"location":"mozfun/event_analysis/#funnels","title":"Funnels","text":"Each funnel will be a STRUCT
with nested columns representing completion of each step The types of those columns are boolean, and represent whether the user completed that step on that day.
STRUCT(\n completed_step_1 BOOLEAN,\n completed_step_2 BOOLEAN,\n ...\n) AS funnel_name\n
With one row per-user per-day, you can use COUNTIF(funnel_name.completed_step_N)
to query these fields. See below for an example.
Each event count is simply an INT64
representing the number of times the user completed those events on that day. If there are multiple events represented within one count, the values are summed. For example, if you wanted to know the number of times a user opened or closed the app, you could create a single event count with those two events.
event_count_name INT64\n
"},{"location":"mozfun/event_analysis/#examples","title":"Examples","text":"The following creates a few fields: - collection_flow
is a funnel for those that started creating a collection within Fenix, and then finished, either by adding those tabs to an existing collection or saving it as a new collection. - collection_flow_saved
represents users who started the collection flow then saved it as a new collection. - number_of_collections_created
is the number of collections created - number_of_collections_deleted
is the number of collections deleted
CALL mozfun.event_analysis.create_events_view(\n 'fenix_collection_funnels',\n 'moz-fx-data-shared-prod',\n 'org_mozilla_firefox',\n\n -- Funnels\n [\n STRUCT(\n \"collection_flow\" AS funnel_name,\n [STRUCT(\n \"started_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n STRUCT(\n \"completed_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'saved' AS event_name),\n STRUCT('collections' AS category, 'tabs_added' AS event_name)] AS events)\n ] AS funnel),\n\n STRUCT(\n \"collection_flow_saved\" AS funnel_name,\n [STRUCT(\n \"started_collection_creation\" AS step_name,\n [STRUCT('collections' AS category, 'tab_select_opened' AS event_name)] AS events),\n STRUCT(\n \"saved_collection\" AS step_name,\n [STRUCT('collections' AS category, 'saved' AS event_name)] AS events)\n ] AS funnel)\n ],\n\n -- Event Counts\n [\n STRUCT(\n \"number_of_collections_created\" AS count_name,\n [STRUCT('collections' AS category, 'saved' AS event_name)] AS events\n ),\n STRUCT(\n \"number_of_collections_deleted\" AS count_name,\n [STRUCT('collections' AS category, 'removed' AS event_name)] AS events\n )\n ]\n);\n
From there, you can query a few things. For example, the fraction of users who completed each step of the collection flow over time:
SELECT\n submission_date,\n COUNTIF(collection_flow.started_collection_creation) / COUNT(*) AS started_collection_creation,\n COUNTIF(collection_flow.completed_collection_creation) / COUNT(*) AS completed_collection_creation,\nFROM\n `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n submission_date\n
Or you can see the number of collections created and deleted:
SELECT\n submission_date,\n SUM(number_of_collections_created) AS number_of_collections_created,\n SUM(number_of_collections_deleted) AS number_of_collections_deleted,\nFROM\n `moz-fx-data-shared-prod`.analysis.fenix_collection_funnels\nWHERE\n submission_date >= DATE_SUB(current_date, INTERVAL 28 DAY)\nGROUP BY\n submission_date\n
"},{"location":"mozfun/event_analysis/#parameters_2","title":"Parameters","text":"INPUTS
view_name STRING, project STRING, dataset STRING, funnels ARRAY<STRUCT<funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>>>, counts ARRAY<STRUCT<count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_funnel_regex-udf","title":"create_funnel_regex (UDF)","text":"Given an array of match strings, each representing a single funnel step, aggregate them into a regex string that will match only against the entire funnel. If intermediate_steps is TRUE, this allows for there to be events that occur between the funnel steps.
"},{"location":"mozfun/event_analysis/#parameters_3","title":"Parameters","text":"INPUTS
step_regexes ARRAY<STRING>, intermediate_steps BOOLEAN\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#create_funnel_steps_query-stored-procedure","title":"create_funnel_steps_query (Stored Procedure)","text":"Generate the SQL statement that can be used to create an easily queryable view on events data.
"},{"location":"mozfun/event_analysis/#parameters_4","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, funnel ARRAY<STRUCT<list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
OUTPUTS
sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#escape_metachars-udf","title":"escape_metachars (UDF)","text":"Escape all metachars from a regex string. This will make the string an exact match, no matter what it contains.
"},{"location":"mozfun/event_analysis/#parameters_5","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_index_to_match_string-udf","title":"event_index_to_match_string (UDF)","text":"Given an event index string, create a match string that is an exact match in the events_daily table.
"},{"location":"mozfun/event_analysis/#parameters_6","title":"Parameters","text":"INPUTS
index STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_property_index_to_match_string-udf","title":"event_property_index_to_match_string (UDF)","text":"Given an event index and property index from an event_types
table, returns a regular expression to match corresponding events within an events_daily
table's events
string that aren't missing the specified property.
INPUTS
event_index STRING, property_index INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#event_property_value_to_match_string-udf","title":"event_property_value_to_match_string (UDF)","text":"Given an event index, property index, and property value from an event_types
table, returns a regular expression to match corresponding events within an events_daily
table's events
string.
INPUTS
event_index STRING, property_index INTEGER, property_value STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#extract_event_counts-udf","title":"extract_event_counts (UDF)","text":"Extract the events and their counts from an events string. This function explicitly ignores event properties, and retrieves just the counts of the top-level events.
"},{"location":"mozfun/event_analysis/#usage_1","title":"Usage","text":"extract_event_counts(\n events STRING\n)\n
events
- A comma-separated events string, where each event is represented as a string of unicode chars.
See this dashboard for example usage.
"},{"location":"mozfun/event_analysis/#parameters_9","title":"Parameters","text":"INPUTS
events STRING\n
OUTPUTS
ARRAY<STRUCT<index STRING, count INT64>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#extract_event_counts_with_properties-udf","title":"extract_event_counts_with_properties (UDF)","text":"Extract events with event properties and their associated counts. Also extracts raw events and their counts. This allows for querying with and without properties in the same dashboard.
"},{"location":"mozfun/event_analysis/#usage_2","title":"Usage","text":"extract_event_counts_with_properties(\n events STRING\n)\n
events
- A comma-separated events string, where each event is represented as a string of unicode chars.
See this query for example usage.
"},{"location":"mozfun/event_analysis/#caveats","title":"Caveats","text":"This function extracts both counts for events with each property, and for all events without their properties.
This allows us to include both total counts for an event (with any property value), and events that don't have properties.
"},{"location":"mozfun/event_analysis/#parameters_10","title":"Parameters","text":"INPUTS
events STRING\n
OUTPUTS
ARRAY<STRUCT<event_index STRING, property_index INT64, property_value_index STRING, count INT64>>\n
Source | Edit
"},{"location":"mozfun/event_analysis/#get_count_sql-stored-procedure","title":"get_count_sql (Stored Procedure)","text":"For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.
"},{"location":"mozfun/event_analysis/#parameters_11","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, count_name STRING, events ARRAY<STRUCT<category STRING, event_name STRING>>\n
OUTPUTS
count_sql STRING\n
Source | Edit
"},{"location":"mozfun/event_analysis/#get_funnel_steps_sql-stored-procedure","title":"get_funnel_steps_sql (Stored Procedure)","text":"For a given funnel, get a SQL statement that can be used to determine if an events string contains that funnel.
"},{"location":"mozfun/event_analysis/#parameters_12","title":"Parameters","text":"INPUTS
project STRING, dataset STRING, funnel_name STRING, funnel ARRAY<STRUCT<step_name STRING, list ARRAY<STRUCT<category STRING, event_name STRING>>>>\n
OUTPUTS
funnel_sql STRING\n
Source | Edit
"},{"location":"mozfun/ga/","title":"Ga","text":""},{"location":"mozfun/ga/#nullify_string-udf","title":"nullify_string (UDF)","text":"Nullify a GA string, which sometimes come in \"(not set)\" or simply \"\"
UDF for handling empty Google Analytics data.
"},{"location":"mozfun/ga/#parameters","title":"Parameters","text":"INPUTS
s STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/","title":"Glam","text":""},{"location":"mozfun/glam/#build_hour_to_datetime-udf","title":"build_hour_to_datetime (UDF)","text":"Parses the custom build id used for Fenix builds in GLAM to a datetime.
"},{"location":"mozfun/glam/#parameters","title":"Parameters","text":"INPUTS
build_hour STRING\n
OUTPUTS
DATETIME\n
Source | Edit
"},{"location":"mozfun/glam/#build_seconds_to_hour-udf","title":"build_seconds_to_hour (UDF)","text":"Returns a custom build id generated from the build seconds of a FOG build.
"},{"location":"mozfun/glam/#parameters_1","title":"Parameters","text":"INPUTS
build_hour STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#fenix_build_to_build_hour-udf","title":"fenix_build_to_build_hour (UDF)","text":"Returns a custom build id generated from the build hour of a Fenix build.
"},{"location":"mozfun/glam/#parameters_2","title":"Parameters","text":"INPUTS
app_build_id STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_bucket_from_value-udf","title":"histogram_bucket_from_value (UDF)","text":""},{"location":"mozfun/glam/#parameters_3","title":"Parameters","text":"INPUTS
buckets ARRAY<STRING>, val FLOAT64\n
OUTPUTS
FLOAT64\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_buckets_cast_string_array-udf","title":"histogram_buckets_cast_string_array (UDF)","text":"Cast histogram buckets into a string array.
"},{"location":"mozfun/glam/#parameters_4","title":"Parameters","text":"INPUTS
buckets ARRAY<INT64>\n
OUTPUTS
ARRAY<STRING>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_cast_json-udf","title":"histogram_cast_json (UDF)","text":"Cast a histogram into a JSON blob.
"},{"location":"mozfun/glam/#parameters_5","title":"Parameters","text":"INPUTS
histogram ARRAY<STRUCT<key STRING, value FLOAT64>>\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_cast_struct-udf","title":"histogram_cast_struct (UDF)","text":"Cast a String-based JSON histogram to an Array of Structs
"},{"location":"mozfun/glam/#parameters_6","title":"Parameters","text":"INPUTS
json_str STRING\n
OUTPUTS
ARRAY<STRUCT<KEY STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_fill_buckets-udf","title":"histogram_fill_buckets (UDF)","text":"Interpolate missing histogram buckets with empty buckets.
"},{"location":"mozfun/glam/#parameters_7","title":"Parameters","text":"INPUTS
input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_fill_buckets_dirichlet-udf","title":"histogram_fill_buckets_dirichlet (UDF)","text":"Interpolate missing histogram buckets with empty buckets so it becomes a valid estimator for the dirichlet distribution.
See: https://docs.google.com/document/d/1ipy1oFIKDvHr3R6Ku0goRjS11R1ZH1z2gygOGkSdqUg
To use this, you must first: Aggregate the histograms to the client level, to get a histogram {k1: p1, k2:p2, ..., kK: pN} where the p's are proportions(and p1, p2, ... sum to 1) and Kis the number of buckets.
This is then the client's estimated density, and every client has been reduced to one row (i.e the client's histograms are reduced to this single one and normalized).
Then add all of these across clients to get {k1: P1, k2:P2, ..., kK: PK} where P1 = sum(p1 across N clients) and P2 = sum(p2 across N clients).
Calculate the total number of buckets K, as well as the total number of profiles N reporting
Then our estimate for final density is: [{k1: ((P1 + 1/K) / (nreporting+1)), k2: ((P2 + 1/K) /(nreporting+1)), ... }
"},{"location":"mozfun/glam/#parameters_8","title":"Parameters","text":"INPUTS
input_map ARRAY<STRUCT<key STRING, value FLOAT64>>, buckets ARRAY<STRING>, total_users INT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_filter_high_values-udf","title":"histogram_filter_high_values (UDF)","text":"Prevent overflows by only keeping buckets where value is less than 2^40 allowing 2^24 entries. This value was chosen somewhat abitrarily, typically the max histogram value is somewhere on the order of ~20 bits. Negative values are incorrect and should not happen but were observed, probably due to some bit flips.
"},{"location":"mozfun/glam/#parameters_9","title":"Parameters","text":"INPUTS
aggs ARRAY<STRUCT<key STRING, value INT64>>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value INT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_from_buckets_uniform-udf","title":"histogram_from_buckets_uniform (UDF)","text":"Create an empty histogram from an array of buckets.
"},{"location":"mozfun/glam/#parameters_10","title":"Parameters","text":"INPUTS
buckets ARRAY<STRING>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_exponential_buckets-udf","title":"histogram_generate_exponential_buckets (UDF)","text":"Generate exponential buckets for a histogram.
"},{"location":"mozfun/glam/#parameters_11","title":"Parameters","text":"INPUTS
min FLOAT64, max FLOAT64, nBuckets FLOAT64\n
OUTPUTS
ARRAY<FLOAT64>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_functional_buckets-udf","title":"histogram_generate_functional_buckets (UDF)","text":"Generate functional buckets for a histogram. This is specific to Glean.
See: https://github.com/mozilla/glean/blob/main/glean-core/src/histogram/functional.rs
A functional bucketing algorithm. The bucket index of a given sample is determined with the following function:
i = $$ \\lfloor{n log_{\\text{base}}{(x)}}\\rfloor $$
In other words, there are n buckets for each power of base
magnitude.
INPUTS
log_base INT64, buckets_per_magnitude INT64, range_max INT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_linear_buckets-udf","title":"histogram_generate_linear_buckets (UDF)","text":"Generate linear buckets for a histogram.
"},{"location":"mozfun/glam/#parameters_13","title":"Parameters","text":"INPUTS
min FLOAT64, max FLOAT64, nBuckets FLOAT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_generate_scalar_buckets-udf","title":"histogram_generate_scalar_buckets (UDF)","text":"Generate scalar buckets for a histogram using a fixed number of buckets.
"},{"location":"mozfun/glam/#parameters_14","title":"Parameters","text":"INPUTS
min_bucket FLOAT64, max_bucket FLOAT64, num_buckets INT64\n
OUTPUTS
ARRAY<FLOAT64>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_normalized_sum-udf","title":"histogram_normalized_sum (UDF)","text":"Compute the normalized sum of an array of histograms.
"},{"location":"mozfun/glam/#parameters_15","title":"Parameters","text":"INPUTS
arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#histogram_normalized_sum_with_original-udf","title":"histogram_normalized_sum_with_original (UDF)","text":"Compute the normalized and the non-normalized sum of an array of histograms.
"},{"location":"mozfun/glam/#parameters_16","title":"Parameters","text":"INPUTS
arrs ARRAY<STRUCT<key STRING, value INT64>>, weight FLOAT64\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64, non_norm_value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#map_from_array_offsets-udf","title":"map_from_array_offsets (UDF)","text":""},{"location":"mozfun/glam/#parameters_17","title":"Parameters","text":"INPUTS
required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#map_from_array_offsets_precise-udf","title":"map_from_array_offsets_precise (UDF)","text":""},{"location":"mozfun/glam/#parameters_18","title":"Parameters","text":"INPUTS
required ARRAY<FLOAT64>, `values` ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value FLOAT64>>\n
Source | Edit
"},{"location":"mozfun/glam/#percentile-udf","title":"percentile (UDF)","text":"Get the value of the approximate CDF at the given percentile.
"},{"location":"mozfun/glam/#parameters_19","title":"Parameters","text":"INPUTS
pct FLOAT64, histogram ARRAY<STRUCT<key STRING, value FLOAT64>>, type STRING\n
OUTPUTS
FLOAT64\n
Source | Edit
"},{"location":"mozfun/glean/","title":"glean","text":"Functions for working with Glean data.
"},{"location":"mozfun/glean/#legacy_compatible_experiments-udf","title":"legacy_compatible_experiments (UDF)","text":"Formats a Glean experiments field into a Legacy Telemetry experiments field by dropping the extra information that Glean collects
This UDF transforms the ping_info.experiments
field from Glean pings into the format for experiments
used by Legacy Telemetry pings. In particular, it drops the exta information that Glean pings collect.
If you need to combine Glean data with Legacy Telemetry data, then you can use this UDF to transform a Glean experiments field into the structure of a Legacy Telemetry one.
"},{"location":"mozfun/glean/#parameters","title":"Parameters","text":"INPUTS
ping_info__experiments ARRAY<STRUCT<key STRING, value STRUCT<branch STRING, extra STRUCT<type STRING, enrollment_id STRING>>>>\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/glean/#parse_datetime-udf","title":"parse_datetime (UDF)","text":"Parses a Glean datetime metric string value as a BigQuery timestamp.
See https://mozilla.github.io/glean/book/reference/metrics/datetime.html
"},{"location":"mozfun/glean/#parameters_1","title":"Parameters","text":"INPUTS
datetime_string STRING\n
OUTPUTS
TIMESTAMP\n
Source | Edit
"},{"location":"mozfun/glean/#timespan_nanos-udf","title":"timespan_nanos (UDF)","text":"Returns the number of nanoseconds represented by a Glean timespan struct.
See https://mozilla.github.io/glean/book/user/metrics/timespan.html
"},{"location":"mozfun/glean/#parameters_2","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/glean/#timespan_seconds-udf","title":"timespan_seconds (UDF)","text":"Returns the number of seconds represented by a Glean timespan struct, rounded down to full seconds.
See https://mozilla.github.io/glean/book/user/metrics/timespan.html
"},{"location":"mozfun/glean/#parameters_3","title":"Parameters","text":"INPUTS
timespan STRUCT<time_unit STRING, value INT64>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/google_ads/","title":"Google ads","text":""},{"location":"mozfun/google_ads/#extract_segments_from_campaign_name-udf","title":"extract_segments_from_campaign_name (UDF)","text":"Extract Segments from a campaign name. Includes region, country_code, and language.
"},{"location":"mozfun/google_ads/#parameters","title":"Parameters","text":"INPUTS
campaign_name STRING\n
OUTPUTS
STRUCT<campaign_region STRING, campaign_country_code STRING, campaign_language STRING>\n
Source | Edit
"},{"location":"mozfun/google_search_console/","title":"google_search_console","text":"Functions for use with Google Search Console data.
"},{"location":"mozfun/google_search_console/#classify_site_query-udf","title":"classify_site_query (UDF)","text":"Classify a Google search query for a site as \"Anonymized\", \"Firefox Brand\", \"Pocket Brand\", \"Mozilla Brand\", or \"Non-Brand\".
"},{"location":"mozfun/google_search_console/#parameters","title":"Parameters","text":"INPUTS
site_domain_name STRING, query STRING, search_type STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_country_code-udf","title":"extract_url_country_code (UDF)","text":"Extract the country code from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_1","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_domain_name-udf","title":"extract_url_domain_name (UDF)","text":"Extract the domain name from a URL.
"},{"location":"mozfun/google_search_console/#parameters_2","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_language_code-udf","title":"extract_url_language_code (UDF)","text":"Extract the language code from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_3","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_locale-udf","title":"extract_url_locale (UDF)","text":"Extract the locale from a URL if it's present.
"},{"location":"mozfun/google_search_console/#parameters_4","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_path-udf","title":"extract_url_path (UDF)","text":"Extract the path from a URL.
"},{"location":"mozfun/google_search_console/#parameters_5","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/google_search_console/#extract_url_path_segment-udf","title":"extract_url_path_segment (UDF)","text":"Extract a particular path segment from a URL.
"},{"location":"mozfun/google_search_console/#parameters_6","title":"Parameters","text":"INPUTS
url STRING, segment_number INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/hist/","title":"hist","text":"Functions for working with string encodings of histograms from desktop telemetry.
"},{"location":"mozfun/hist/#count-udf","title":"count (UDF)","text":"Given histogram h, return the count of all measurements across all buckets.
Given histogram h, return the count of all measurements across all buckets.
Extracts the values from the histogram and sums them, returning the total_count.
"},{"location":"mozfun/hist/#parameters","title":"Parameters","text":"INPUTS
histogram STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract-udf","title":"extract (UDF)","text":"Return a parsed struct from a string-encoded histogram.
We support a variety of compact encodings as well as the classic JSON representation as sent in main pings.
The built-in BigQuery JSON parsing functions are not powerful enough to handle all the logic here, so we resort to some string processing. This function could behave unexpectedly on poorly-formatted histogram JSON, but we expect that payload validation in the data pipeline should ensure that histograms are well formed, which gives us some flexibility.
For more on desktop telemetry histogram structure, see:
The compact encodings were originally proposed in:
SELECT\n mozfun.hist.extract(\n '{\"bucket_count\":3,\"histogram_type\":4,\"sum\":1,\"range\":[1,2],\"values\":{\"0\":1,\"1\":0}}'\n ).sum\n-- 1\n
SELECT\n mozfun.hist.extract('5').sum\n-- 5\n
"},{"location":"mozfun/hist/#parameters_1","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract_histogram_sum-udf","title":"extract_histogram_sum (UDF)","text":"Extract a histogram sum from a JSON str representation
"},{"location":"mozfun/hist/#parameters_2","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#extract_keyed_hist_sum-udf","title":"extract_keyed_hist_sum (UDF)","text":"Sum of a keyed histogram, across all keys it contains.
"},{"location":"mozfun/hist/#extract-keyed-histogram-sum","title":"Extract Keyed Histogram Sum","text":"Takes a keyed histogram and returns a single number: the sum of all keys it contains. The expected input type is ARRAY<STRUCT<key STRING, value STRING>>
The return type is INT64
.
The key
field will be ignored, and the `value is expected to be the compact histogram representation.
INPUTS
keyed_histogram ARRAY<STRUCT<key STRING, value STRING>>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#mean-udf","title":"mean (UDF)","text":"Given histogram h, return floor(mean) of the measurements in the bucket. That is, the histogram sum divided by the number of measurements taken.
https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L292-L307
"},{"location":"mozfun/hist/#parameters_4","title":"Parameters","text":"INPUTS
histogram ANY TYPE\n
OUTPUTS
STRUCT<sum INT64, VALUES ARRAY<STRUCT<value INT64>>>\n
Source | Edit
"},{"location":"mozfun/hist/#merge-udf","title":"merge (UDF)","text":"Merge an array of histograms into a single histogram.
INPUTS
histogram_list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/hist/#normalize-udf","title":"normalize (UDF)","text":"Normalize a histogram. Set sum to 1, and normalize to 1 the histogram bucket counts.
"},{"location":"mozfun/hist/#parameters_6","title":"Parameters","text":"INPUTS
histogram STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value INT64>>>\n
OUTPUTS
STRUCT<bucket_count INT64, `sum` INT64, histogram_type INT64, `range` ARRAY<INT64>, `values` ARRAY<STRUCT<key INT64, value FLOAT64>>>\n
Source | Edit
"},{"location":"mozfun/hist/#percentiles-udf","title":"percentiles (UDF)","text":"Given histogram and list of percentiles,calculate what those percentiles are for the histogram. If the histogram is empty, returns NULL.
"},{"location":"mozfun/hist/#parameters_7","title":"Parameters","text":"INPUTS
histogram ANY TYPE, percentiles ARRAY<FLOAT64>\n
OUTPUTS
ARRAY<STRUCT<percentile FLOAT64, value INT64>>\n
Source | Edit
"},{"location":"mozfun/hist/#string_to_json-udf","title":"string_to_json (UDF)","text":"Convert a histogram string (in JSON or compact format) to a full histogram JSON blob.
"},{"location":"mozfun/hist/#parameters_8","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/hist/#threshold_count-udf","title":"threshold_count (UDF)","text":"Return the number of recorded observations greater than threshold for the histogram. CAUTION: Does not count any buckets that have any values less than the threshold. For example, a bucket with range (1, 10) will not be counted for a threshold of 2. Use threshold that are not bucket boundaries with caution.
https://github.com/mozilla/telemetry-batch-view/blob/ea0733c/src/main/scala/com/mozilla/telemetry/utils/MainPing.scala#L213-L239
"},{"location":"mozfun/hist/#parameters_9","title":"Parameters","text":"INPUTS
histogram STRING, threshold INT64\n
Source | Edit
"},{"location":"mozfun/iap/","title":"iap","text":""},{"location":"mozfun/iap/#derive_apple_subscription_interval-udf","title":"derive_apple_subscription_interval (UDF)","text":"Take output purchase_date and expires_date from mozfun.iap.parse_apple_receipt and return the subscription interval to use for accounting. Values must be DATETIME in America/Los_Angeles to get correct results because of how timezone and daylight savings impact the time of day and the length of a month.
"},{"location":"mozfun/iap/#parameters","title":"Parameters","text":"INPUTS
start DATETIME, `end` DATETIME\n
OUTPUTS
STRUCT<`interval` STRING, interval_count INT64>\n
Source | Edit
"},{"location":"mozfun/iap/#parse_android_receipt-udf","title":"parse_android_receipt (UDF)","text":"Used to parse data
field from firestore export of fxa dataset iap_google_raw. The content is documented at https://developer.android.com/google/play/billing/subscriptions and https://developers.google.com/android-publisher/api-ref/rest/v3/purchases.subscriptions
INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/iap/#parse_apple_event-udf","title":"parse_apple_event (UDF)","text":"Used to parse data
field from firestore export of fxa dataset iap_app_store_purchases_raw. The content is documented at https://developer.apple.com/documentation/appstoreservernotifications/responsebodyv2decodedpayload and https://github.com/mozilla/fxa/blob/700ed771860da450add97d62f7e6faf2ead0c6ba/packages/fxa-shared/payments/iap/apple-app-store/subscription-purchase.ts#L115-L171
INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/iap/#parse_apple_receipt-udf","title":"parse_apple_receipt (UDF)","text":"Used to parse provider_receipt_json in mozilla vpn subscriptions where provider is \"APPLE\". The content is documented at https://developer.apple.com/documentation/appstorereceipts/responsebody
"},{"location":"mozfun/iap/#parameters_3","title":"Parameters","text":"INPUTS
provider_receipt_json STRING\n
OUTPUTS
STRUCT<environment STRING, latest_receipt BYTES, latest_receipt_info ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, pending_renewal_info ARRAY<STRUCT<auto_renew_product_id STRING, auto_renew_status INT64, expiration_intent INT64, is_in_billing_retry_period INT64, original_transaction_id STRING, product_id STRING>>, receipt STRUCT<adam_id INT64, app_item_id INT64, application_version STRING, bundle_id STRING, download_id INT64, in_app ARRAY<STRUCT<cancellation_date STRING, cancellation_date_ms INT64, cancellation_date_pst STRING, cancellation_reason STRING, expires_date STRING, expires_date_ms INT64, expires_date_pst STRING, in_app_ownership_type STRING, is_in_intro_offer_period STRING, is_trial_period STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, original_transaction_id STRING, product_id STRING, promotional_offer_id STRING, purchase_date STRING, purchase_date_ms INT64, purchase_date_pst STRING, quantity INT64, subscription_group_identifier INT64, transaction_id INT64, web_order_line_item_id INT64>>, original_application_version STRING, original_purchase_date STRING, original_purchase_date_ms INT64, original_purchase_date_pst STRING, receipt_creation_date STRING, receipt_creation_date_ms INT64, receipt_creation_date_pst STRING, receipt_type STRING, request_date STRING, request_date_ms INT64, request_date_pst STRING, version_external_identifier INT64>, status INT64>DETERMINISTIC\n
Source | Edit
"},{"location":"mozfun/iap/#scrub_apple_receipt-udf","title":"scrub_apple_receipt (UDF)","text":"Take output from mozfun.iap.parse_apple_receipt and remove fields or reduce their granularity so that the returned value can be exposed to all employees via redash.
"},{"location":"mozfun/iap/#parameters_4","title":"Parameters","text":"INPUTS
apple_receipt ANY TYPE\n
OUTPUTS
STRUCT<environment STRING, active_period STRUCT<start_date DATE, end_date DATE, start_time TIMESTAMP, end_time TIMESTAMP, `interval` STRING, interval_count INT64>, trial_period STRUCT<start_time TIMESTAMP, end_time TIMESTAMP>>\n
Source | Edit
"},{"location":"mozfun/json/","title":"json","text":"Functions for parsing Mozilla-specific JSON data types.
"},{"location":"mozfun/json/#extract_int_map-udf","title":"extract_int_map (UDF)","text":"Returns an array of key/value structs from a string representing a JSON map. Both keys and values are cast to integers.
This is the format for the \"values\" field in the desktop telemetry histogram JSON representation.
"},{"location":"mozfun/json/#parameters","title":"Parameters","text":"INPUTS
input STRING\n
Source | Edit
"},{"location":"mozfun/json/#from_map-udf","title":"from_map (UDF)","text":"Converts a standard \"map\" like datastructure array<struct<key, value>>
into a JSON value.
Convert the standard Array<Struct<key, value>>
style maps to JSON
values.
INPUTS
input JSON\n
OUTPUTS
json\n
Source | Edit
"},{"location":"mozfun/json/#from_nested_map-udf","title":"from_nested_map (UDF)","text":"Converts a nested JSON object with repeated key/value pairs into a nested JSON object.
Convert a JSON object like { \"metric\": [ {\"key\": \"extra\", \"value\": 2 } ] }
to a JSON
object like { \"metric\": { \"key\": 2 } }
.
This only works on JSON types.
"},{"location":"mozfun/json/#parameters_2","title":"Parameters","text":"OUTPUTS
json\n
Source | Edit
"},{"location":"mozfun/json/#js_extract_string_map-udf","title":"js_extract_string_map (UDF)","text":"Returns an array of key/value structs from a string representing a JSON map.
BigQuery Standard SQL JSON functions are insufficient to implement this function, so JS is being used and it may not perform well with large or numerous inputs.
Non-string non-null values are encoded as json.
"},{"location":"mozfun/json/#parameters_3","title":"Parameters","text":"INPUTS
input STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/json/#mode_last-udf","title":"mode_last (UDF)","text":"Returns the most frequently occuring element in an array of json-compatible elements. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored.
"},{"location":"mozfun/json/#parameters_4","title":"Parameters","text":"INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/ltv/","title":"Ltv","text":""},{"location":"mozfun/ltv/#android_states_v1-udf","title":"android_states_v1 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"
"},{"location":"mozfun/ltv/#parameters","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_v2-udf","title":"android_states_v2 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_2_1\" and \"0_dow1_1_1\"
"},{"location":"mozfun/ltv/#parameters_1","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_with_paid_v1-udf","title":"android_states_with_paid_v1 (UDF)","text":"LTV states for Android. Results in strings like: \"1_dow3_organic_2_1\" and \"0_dow1_paid_1_1\"
These states include whether a client was paid or organic.
"},{"location":"mozfun/ltv/#parameters_2","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#android_states_with_paid_v2-udf","title":"android_states_with_paid_v2 (UDF)","text":"Get the state of a user on a day, with paid/organic cohorts included. Compared to V1, these states have a \"dead\" state, determined by \"dead_time\". The model can use this state as a sink, where the client will never return if they are dead.
"},{"location":"mozfun/ltv/#parameters_3","title":"Parameters","text":"INPUTS
adjust_network STRING, days_since_first_seen INT64, days_since_seen INT64, death_time INT64, submission_date DATE, first_seen_date DATE, pattern INT64, active INT64, max_weeks INT64, country STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#desktop_states_v1-udf","title":"desktop_states_v1 (UDF)","text":"LTV states for Desktop. Results in strings like: \"0_1_1_1_1\" Where each component is 1. the age in days of the client 2. the day of week of first_seen_date 3. the day of week of submission_date 4. the activity level, possible values are 0-3, plus \"00\" for \"dead\" 5. whether the client is active on submission_date
"},{"location":"mozfun/ltv/#parameters_4","title":"Parameters","text":"INPUTS
days_since_first_seen INT64, days_since_active INT64, submission_date DATE, first_seen_date DATE, death_time INT64, pattern INT64, active INT64, max_days INT64, lookback INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/ltv/#get_state_ios_v2-udf","title":"get_state_ios_v2 (UDF)","text":"LTV states for iOS.
"},{"location":"mozfun/ltv/#parameters_5","title":"Parameters","text":"INPUTS
days_since_first_seen INT64, days_since_seen INT64, submission_date DATE, death_time INT64, pattern INT64, active INT64, max_weeks INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/","title":"map","text":"Functions for working with arrays of key/value structs.
"},{"location":"mozfun/map/#extract_keyed_scalar_sum-udf","title":"extract_keyed_scalar_sum (UDF)","text":"Sums all values in a keyed scalar.
"},{"location":"mozfun/map/#extract-keyed-scalar-sum","title":"Extract Keyed Scalar Sum","text":"Takes a keyed scalar and returns a single number: the sum of all values it contains. The expected input type is ARRAY<STRUCT<key STRING, value INT64>>
The return type is INT64
.
The key
field will be ignored.
INPUTS
keyed_scalar ARRAY<STRUCT<key STRING, value INT64>>\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/map/#from_lists-udf","title":"from_lists (UDF)","text":"Create a map from two arrays (like zipping)
"},{"location":"mozfun/map/#parameters_1","title":"Parameters","text":"INPUTS
keys ANY TYPE, `values` ANY TYPE\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/map/#get_key-udf","title":"get_key (UDF)","text":"Fetch the value associated with a given key from an array of key/value structs.
Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields.
"},{"location":"mozfun/map/#parameters_2","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
Source | Edit
"},{"location":"mozfun/map/#get_key_with_null-udf","title":"get_key_with_null (UDF)","text":"Fetch the value associated with a given key from an array of key/value structs.
Because map types aren't available in BigQuery, we model maps as arrays of structs instead, and this function provides map-like access to such fields. This version matches NULL keys as well.
"},{"location":"mozfun/map/#parameters_3","title":"Parameters","text":"INPUTS
map ANY TYPE, k ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/#mode_last-udf","title":"mode_last (UDF)","text":"Combine entries from multiple maps, determine the value for each key using mozfun.stats.mode_last.
"},{"location":"mozfun/map/#parameters_4","title":"Parameters","text":"INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"mozfun/map/#set_key-udf","title":"set_key (UDF)","text":"Set a key to a value in a map. If you call map.get_key after setting, the value you set will be returned.
map.set_key
Set a key to a specific value in a map. We represent maps as Arrays of Key/Value structs: ARRAY<STRUCT<key ANY TYPE, value ANY TYPE>>
.
The type of the key and value you are setting must match the types in the map itself.
"},{"location":"mozfun/map/#parameters_5","title":"Parameters","text":"INPUTS
map ANY TYPE, new_key ANY TYPE, new_value ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/map/#sum-udf","title":"sum (UDF)","text":"Return the sum of values by key in an array of map entries. The expected schema for entries is ARRAY>, where the type for value must be supported by SUM, which allows numeric data types INT64, NUMERIC, and FLOAT64."},{"location":"mozfun/map/#parameters_6","title":"Parameters","text":"
INPUTS
entries ANY TYPE\n
Source | Edit
"},{"location":"mozfun/marketing/","title":"Marketing","text":""},{"location":"mozfun/marketing/#parse_ad_group_name-udf","title":"parse_ad_group_name (UDF)","text":"Please provide a description for the routine
"},{"location":"mozfun/marketing/#parse-ad-group-name-udf","title":"Parse Ad Group Name UDF","text":"This function takes a ad group name and parses out known segments. These segments are things like country, language, or audience; multiple ad groups can share segments.
We use versioned ad group names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the ad group name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
See also: marketing.parse_campaign_name
, which does the same, but for campaign names.
INPUTS
ad_group_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/marketing/#parse_campaign_name-udf","title":"parse_campaign_name (UDF)","text":"Parse a campaign name. Extracts things like region, country_code, and language.
"},{"location":"mozfun/marketing/#parse-campaign-name-udf","title":"Parse Campaign Name UDF","text":"This function takes a campaign name and parses out known segments. These segments are things like country, language, or audience; multiple campaigns can share segments.
We use versioned campaign names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the campaign name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
"},{"location":"mozfun/marketing/#parameters_1","title":"Parameters","text":"INPUTS
campaign_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/marketing/#parse_creative_name-udf","title":"parse_creative_name (UDF)","text":"Parse segments from a creative name.
"},{"location":"mozfun/marketing/#parse-creative-name-udf","title":"Parse Creative Name UDF","text":"This function takes a creative name and parses out known segments. These segments are things like country, language, or audience; multiple creatives can share segments.
We use versioned creative names to define segments, where the ad network (e.g. gads) and the version (e.g. v1, v2) correspond to certain available segments in the creative name. We track the versions in this spreadsheet.
For a history of this naming scheme, see the original proposal.
See also: marketing.parse_campaign_name
, which does the same, but for campaign names.
INPUTS
creative_name STRING\n
OUTPUTS
ARRAY<STRUCT<key STRING, value STRING>>\n
Source | Edit
"},{"location":"mozfun/mobile_search/","title":"Mobile search","text":""},{"location":"mozfun/mobile_search/#normalize_app_name-udf","title":"normalize_app_name (UDF)","text":"Returns normalized_app_name and normalized_app_name_os (for mobile search tables only).
"},{"location":"mozfun/mobile_search/#normalized-app-and-os-name-for-mobile-search-related-tables","title":"Normalized app and os name for mobile search related tables","text":"Takes app name and os as input : Returns a struct of normalized_app_name and normalized_app_name_os based on discussion provided here
"},{"location":"mozfun/mobile_search/#parameters","title":"Parameters","text":"INPUTS
app_name STRING, os STRING\n
OUTPUTS
STRUCT<normalized_app_name STRING, normalized_app_name_os STRING>\n
Source | Edit
"},{"location":"mozfun/norm/","title":"norm","text":"Functions for normalizing data.
"},{"location":"mozfun/norm/#browser_version_info-udf","title":"browser_version_info (UDF)","text":"Adds metadata related to the browser version in a struct.
This is a temporary solution that allows browser version analysis. It should eventually be replaced with one or more browser version tables that serves as a source of truth for version releases.
"},{"location":"mozfun/norm/#parameters","title":"Parameters","text":"INPUTS
version_string STRING\n
OUTPUTS
STRUCT<version STRING, major_version NUMERIC, minor_version NUMERIC, patch_revision NUMERIC, is_major_release BOOLEAN>\n
Source | Edit
"},{"location":"mozfun/norm/#diff_months-udf","title":"diff_months (UDF)","text":"Determine the number of whole months after grace period between start and end. Month is dependent on timezone, so start and end must both be datetimes, or both be dates, in the correct timezone. Grace period can be used to account for billing delay, usually 1 day, and is counted after months. When inclusive is FALSE, start and end are not included in whole months. For example, diff_months(start => '2021-01-01', end => '2021-03-01', grace_period => INTERVAL 0 day, inclusive => FALSE) returns 1, because start plus two months plus grace period is not less than end. Changing inclusive to TRUE returns 2, because start plus two months plus grace period is less than or equal to end. diff_months(start => '2021-01-01', end => '2021-03-02 00:00:00.000001', grace_period => INTERVAL 1 DAY, inclusive => FALSE) returns 2, because start plus two months plus grace period is less than end.
"},{"location":"mozfun/norm/#parameters_1","title":"Parameters","text":"INPUTS
start DATETIME, `end` DATETIME, grace_period INTERVAL, inclusive BOOLEAN\n
Source | Edit
"},{"location":"mozfun/norm/#extract_version-udf","title":"extract_version (UDF)","text":"Extracts numeric version data from a version string like <major>.<minor>.<patch>
.
Note: Non-zero minor and patch versions will be floating point Numeric
.
Usage:
SELECT\n mozfun.norm.extract_version(version_string, 'major') as major_version,\n mozfun.norm.extract_version(version_string, 'minor') as minor_version,\n mozfun.norm.extract_version(version_string, 'patch') as patch_version\n
Example using \"96.05.01\"
:
SELECT\n mozfun.norm.extract_version('96.05.01', 'major') as major_version, -- 96\n mozfun.norm.extract_version('96.05.01', 'minor') as minor_version, -- 5\n mozfun.norm.extract_version('96.05.01', 'patch') as patch_version -- 1\n
"},{"location":"mozfun/norm/#parameters_2","title":"Parameters","text":"INPUTS
version_string STRING, extraction_level STRING\n
OUTPUTS
NUMERIC\n
Source | Edit
"},{"location":"mozfun/norm/#fenix_app_info-udf","title":"fenix_app_info (UDF)","text":"Returns canonical, human-understandable identification info for Fenix sources.
The Glean telemetry library for Android by design routes pings based on the Play Store appId value of the published application. As of August 2020, there have been 5 separate Play Store appId values associated with different builds of Fenix, each corresponding to different datasets in BigQuery, and the mapping of appId to logical app names (Firefox vs. Firefox Preview) and channel names (nightly, beta, or release) has changed over time; see the spreadsheet of naming history for Mozilla's mobile browsers.
This function is intended as the source of truth for how to map a specific ping in BigQuery to a logical app names and channel. It should be expected that the output of this function may evolve over time. If we rename a product or channel, we may choose to update the values here so that analyses consistently get the new name.
The first argument (app_id
) can be fairly fuzzy; it is tolerant of actual Google Play Store appId values like 'org.mozilla.firefox_beta' (mix of periods and underscores) as well as BigQuery dataset names with suffixes like 'org_mozilla_firefox_beta_stable'.
The second argument (app_build_id
) should be the value in client_info.app_build.
The function returns a STRUCT
that contains the logical app_name
and channel
as well as the Play Store app_id
in the canonical form which would appear in Play Store URLs.
Note that the naming of Fenix applications changed on 2020-07-03, so to get a continuous view of the pings associated with a logical app channel, you may need to union together tables from multiple BigQuery datasets. To see data for all Fenix channels together, it is necessary to union together tables from all 5 datasets. For basic usage information, consider using telemetry.fenix_clients_last_seen
which already handles the union. Otherwise, see the example below as a template for how construct a custom union.
Mapping of channels to datasets:
org_mozilla_firefox
org_mozilla_firefox_beta
(current) and org_mozilla_fenix
org_mozilla_fenix
(current), org_mozilla_fennec_aurora
, and org_mozilla_fenix_nightly
-- Example of a query over all Fenix builds advertised as \"Firefox Beta\"\nCREATE TEMP FUNCTION extract_fields(app_id STRING, m ANY TYPE) AS (\n (\n SELECT AS STRUCT\n m.submission_timestamp,\n m.metrics.string.geckoview_version,\n mozfun.norm.fenix_app_info(app_id, m.client_info.app_build).*\n )\n);\n\nWITH base AS (\n SELECT\n extract_fields('org_mozilla_firefox_beta', m).*\n FROM\n `mozdata.org_mozilla_firefox_beta.metrics` AS m\n UNION ALL\n SELECT\n extract_fields('org_mozilla_fenix', m).*\n FROM\n `mozdata.org_mozilla_fenix.metrics` AS m\n)\nSELECT\n DATE(submission_timestamp) AS submission_date,\n geckoview_version,\n COUNT(*)\nFROM\n base\nWHERE\n app_name = 'Fenix' -- excludes 'Firefox Preview'\n AND channel = 'beta'\n AND DATE(submission_timestamp) = '2020-08-01'\nGROUP BY\n submission_date,\n geckoview_version\n
"},{"location":"mozfun/norm/#parameters_3","title":"Parameters","text":"INPUTS
app_id STRING, app_build_id STRING\n
OUTPUTS
STRUCT<app_name STRING, channel STRING, app_id STRING>\n
Source | Edit
"},{"location":"mozfun/norm/#fenix_build_to_datetime-udf","title":"fenix_build_to_datetime (UDF)","text":"Convert the Fenix client_info.app_build-format string to a DATETIME. May return NULL on failure.
Fenix originally used an 8-digit app_build format
In short it is yDDDHHmm
:
The last date seen with an 8-digit build ID is 2020-08-10.
Newer builds use a 10-digit format where the integer represents a pattern consisting of 32 bits. The 17 bits starting 13 bits from the left represent a number of hours since UTC midnight beginning 2014-12-28.
This function tolerates both formats.
After using this you may wish to DATETIME_TRUNC(result, DAY)
for grouping by build date.
INPUTS
app_build STRING\n
OUTPUTS
INT64\n
Source | Edit
"},{"location":"mozfun/norm/#firefox_android_package_name_to_channel-udf","title":"firefox_android_package_name_to_channel (UDF)","text":"Map Fenix package name to the channel name
"},{"location":"mozfun/norm/#parameters_5","title":"Parameters","text":"INPUTS
package_name STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/norm/#get_earliest_value-udf","title":"get_earliest_value (UDF)","text":"This UDF returns the earliest not-null value pair and datetime from a list of values and their corresponding timestamp.
The function will return the first value pair in the input array, that is not null and has the earliest timestamp.
Because there may be more than one value on the same date e.g. more than one value reported by different pings on the same date, the dates must be given as TIMESTAMPS and the values as STRING.
Usage:
SELECT\n mozfun.norm.get_earliest_value(ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>) AS <alias>\n
"},{"location":"mozfun/norm/#parameters_6","title":"Parameters","text":"INPUTS
value_set ARRAY<STRUCT<value STRING, value_source STRING, value_date DATETIME>>\n
OUTPUTS
STRUCT<earliest_value STRING, earliest_value_source STRING, earliest_date DATETIME>\n
Source | Edit
"},{"location":"mozfun/norm/#get_windows_info-udf","title":"get_windows_info (UDF)","text":"Exract the name, the version name, the version number, and the build number corresponding to a Microsoft Windows operating system version string in the form of .. or ... for most release versions of Windows after 2007."},{"location":"mozfun/norm/#windows-names-versions-and-builds","title":"Windows Names, Versions, and Builds","text":""},{"location":"mozfun/norm/#summary","title":"Summary","text":"
This function is primarily designed to parse the field os_version
in table mozdata.default_browser_agent.default_browser
. Given a Microsoft Windows OS version string, the function returns the name of the operating system, the version name, the version number, and the build number corresponding to the operating system. As of November 2022, the parser can handle 99.89% of the os_version
values collected in table mozdata.default_browser_agent.default_browser
.
As of November 2022, the expected valid values of os_version
are either x.y.z
or w.x.y.z
where w
, x
, y
, and z
are integers.
As of November 2022, the return values for Windows 10 and Windows 11 are based on Windows 10 release information and Windows 11 release information. For 3-number version strings, the parser assumes the valid values of z
in x.y.z
are at most 5 digits in length. For 4-number version strings, the parser assumes the valid values of z
in w.x.y.z
are at most 6 digits in length. The function makes an educated effort to handle Windows Vista, Windows 7, Windows 8, and Windows 8.1 information, but does not guarantee the return values are absolutely accurate. The function assumes the presence of undocumented non-release versions of Windows 10 and Windows 11, and will return an estimated name, version number, build number but not the version name. The function does not handle other versions of Windows.
As of November 2022, the parser currently handles just over 99.89% of data in the field os_version
in table mozdata.default_browser_agent.default_browser
.
Note: Microsoft convention for build numbers for Windows 10 and 11 include two numbers, such as build number 22621.900
for version 22621
. The first number repeats the version number and the second number uniquely identifies the build within the version. To simplify data processing and data analysis, this function returns the second unique identifier as an integer instead of returning the full build number as a string.
SELECT\n `os_version`,\n mozfun.norm.get_windows_info(`os_version`) AS windows_info\nFROM `mozdata.default_browser_agent.default_browser`\nWHERE `submission_timestamp` > (CURRENT_TIMESTAMP() - INTERVAL 7 DAY) AND LEFT(document_id, 2) = '00'\nLIMIT 1000\n
"},{"location":"mozfun/norm/#mapping","title":"Mapping","text":"os_version windows_name windows_version_name windows_version_number windows_build_number 6.0.z Windows Vista 6.0 6.0 z 6.1.z Windows 7 7.0 6.1 z 6.2.z Windows 8 8.0 6.2 z 6.3.z Windows 8.1 8.1 6.3 z 10.0.10240.z Windows 10 1507 10240 z 10.0.10586.z Windows 10 1511 10586 z 10.0.14393.z Windows 10 1607 14393 z 10.0.15063.z Windows 10 1703 15063 z 10.0.16299.z Windows 10 1709 16299 z 10.0.17134.z Windows 10 1803 17134 z 10.0.17763.z Windows 10 1809 17763 z 10.0.18362.z Windows 10 1903 18362 z 10.0.18363.z Windows 10 1909 18363 z 10.0.19041.z Windows 10 2004 19041 z 10.0.19042.z Windows 10 20H2 19042 z 10.0.19043.z Windows 10 21H1 19043 z 10.0.19044.z Windows 10 21H2 19044 z 10.0.19045.z Windows 10 22H2 19045 z 10.0.y.z Windows 10 UNKNOWN y z 10.0.22000.z Windows 11 21H2 22000 z 10.0.22621.z Windows 11 22H2 22621 z 10.0.y.z Windows 11 UNKNOWN y z all other values (null) (null) (null) (null)"},{"location":"mozfun/norm/#parameters_7","title":"Parameters","text":"INPUTS
os_version STRING\n
OUTPUTS
STRUCT<name STRING, version_name STRING, version_number DECIMAL, build_number INT64>\n
Source | Edit
"},{"location":"mozfun/norm/#glean_baseline_client_info-udf","title":"glean_baseline_client_info (UDF)","text":"Accepts a glean client_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
"},{"location":"mozfun/norm/#parameters_8","title":"Parameters","text":"INPUTS
client_info ANY TYPE, metrics ANY TYPE\n
OUTPUTS
string\n
Source | Edit
"},{"location":"mozfun/norm/#glean_ping_info-udf","title":"glean_ping_info (UDF)","text":"Accepts a glean ping_info struct as input and returns a modified struct that includes a few parsed or normalized variants of the input fields.
"},{"location":"mozfun/norm/#parameters_9","title":"Parameters","text":"INPUTS
ping_info ANY TYPE\n
Source | Edit
"},{"location":"mozfun/norm/#metadata-udf","title":"metadata (UDF)","text":"Accepts a pipeline metadata struct as input and returns a modified struct that includes a few parsed or normalized variants of the input metadata fields.
"},{"location":"mozfun/norm/#parameters_10","title":"Parameters","text":"INPUTS
metadata ANY TYPE\n
OUTPUTS
`date`, CAST(NULL\n
Source | Edit
"},{"location":"mozfun/norm/#os-udf","title":"os (UDF)","text":"Normalize an operating system string to one of the three major desktop platforms, one of the two major mobile platforms, or \"Other\".
This is a reimplementation of logic used in the data pipeline to populate normalized_os
.
INPUTS
os STRING\n
Source | Edit
"},{"location":"mozfun/norm/#product_info-udf","title":"product_info (UDF)","text":"Returns a normalized app_name
and canonical_app_name
for a product based on legacy_app_name
and normalized_os
values. Thus, this function serves as a bridge to get from legacy application identifiers to the consistent identifiers we are using for reporting in 2021.
As of 2021, most Mozilla products are sending telemetry via the Glean SDK, with Glean telemetry in active development for desktop Firefox as well. The probeinfo
API is the single source of truth for metadata about applications sending Glean telemetry; the values for app_name
and canonical_app_name
returned here correspond to the \"end-to-end identifier\" values documented in the v2 Glean app listings endpoint . For non-Glean telemetry, we provide values in the same style to provide continuity as we continue the migration to Glean.
For legacy telemetry pings like main
ping for desktop and core
ping for mobile products, the legacy_app_name
given as input to this function should come from the submission URI (stored as metadata.uri.app_name
in BigQuery ping tables). For Glean pings, we have invented product
values that can be passed in to this function as the legacy_app_name
parameter.
The returned app_name
values are intended to be readable and unambiguous, but short and easy to type. They are suitable for use as a key in derived tables. product
is a deprecated field that was similar in intent.
The returned canonical_app_name
is more verbose and is suited for displaying in visualizations. canonical_name
is a synonym that we provide for historical compatibility with previous versions of this function.
The returned struct also contains boolean contributes_to_2021_kpi
as the canonical reference for whether the given application is included in KPI reporting. Additional fields may be added for future years.
The normalized_os
value that's passed in should be the top-level normalized_os
value present in any ping table or you may want to wrap a raw value in mozfun.norm.os
like mozfun.norm.product_info(app_name, mozfun.norm.os(os))
.
This function also tolerates passing in a product
value as legacy_app_name
so that this function is still useful for derived tables which have thrown away the raw app_name
value from legacy pings.
The mappings are as follows:
legacy_app_name normalized_os app_name product canonical_app_name 2019 2020 2021 Firefox * firefox_desktop Firefox Firefox for Desktop true true true Fenix Android fenix Fenix Firefox for Android (Fenix) true true true Fennec Android fennec Fennec Firefox for Android (Fennec) true true true Firefox Preview Android firefox_preview Firefox Preview Firefox Preview for Android true true true Fennec iOS firefox_ios Firefox iOS Firefox for iOS true true true FirefoxForFireTV Android firefox_fire_tv Firefox Fire TV Firefox for Fire TV false false false FirefoxConnect Android firefox_connect Firefox Echo Firefox for Echo Show true true false Zerda Android firefox_lite Firefox Lite Firefox Lite true true false Zerda_cn Android firefox_lite_cn Firefox Lite CN Firefox Lite (China) false false false Focus Android focus_android Focus Android Firefox Focus for Android true true true Focus iOS focus_ios Focus iOS Firefox Focus for iOS true true true Klar Android klar_android Klar Android Firefox Klar for Android false false false Klar iOS klar_ios Klar iOS Firefox Klar for iOS false false false Lockbox Android lockwise_android Lockwise Android Lockwise for Android true true false Lockbox iOS lockwise_ios Lockwise iOS Lockwise for iOS true true false FirefoxReality* Android firefox_reality Firefox Reality Firefox Reality false false false"},{"location":"mozfun/norm/#parameters_12","title":"Parameters","text":"INPUTS
legacy_app_name STRING, normalized_os STRING\n
OUTPUTS
STRUCT<app_name STRING, product STRING, canonical_app_name STRING, canonical_name STRING, contributes_to_2019_kpi BOOLEAN, contributes_to_2020_kpi BOOLEAN, contributes_to_2021_kpi BOOLEAN>\n
Source | Edit
"},{"location":"mozfun/norm/#result_type_to_product_name-udf","title":"result_type_to_product_name (UDF)","text":"Convert urlbar result types into product-friendly names
This UDF converts result types from urlbar events (engagement, impression, abandonment) into product-friendly names.
"},{"location":"mozfun/norm/#parameters_13","title":"Parameters","text":"INPUTS
res STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/norm/#truncate_version-udf","title":"truncate_version (UDF)","text":"Truncates a version string like <major>.<minor>.<patch>
to either the major or minor version. The return value is NUMERIC
, which means that you can sort the results without fear (e.g. 100 will be categorized as greater than 80, which isn't the case when sorting lexigraphically).
For example, \"5.1.0\" would be translated to 5.1
if the parameter is \"minor\" or 5
if the parameter is major.
If the version is only a major and/or minor version, then it will be left unchanged (for example \"10\" would stay as 10
when run through this function, no matter what the arguments).
This is useful for grouping Linux and Mac operating system versions inside aggregate datasets or queries where there may be many different patch releases in the field.
"},{"location":"mozfun/norm/#parameters_14","title":"Parameters","text":"INPUTS
os_version STRING, truncation_level STRING\n
OUTPUTS
NUMERIC\n
Source | Edit
"},{"location":"mozfun/norm/#vpn_attribution-udf","title":"vpn_attribution (UDF)","text":"Accepts vpn attribution fields as input and returns a struct of normalized fields.
"},{"location":"mozfun/norm/#parameters_15","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRUCT<normalized_acquisition_channel STRING, normalized_campaign STRING, normalized_content STRING, normalized_medium STRING, normalized_source STRING, website_channel_group STRING>\n
Source | Edit
"},{"location":"mozfun/norm/#windows_version_info-udf","title":"windows_version_info (UDF)","text":"Given an unnormalized set off Windows identifiers, return a friendly version of the operating system name.
Requires os, os_version and windows_build_number.
E.G. from windows_build_number >= 22000 return Windows 11
"},{"location":"mozfun/norm/#parameters_16","title":"Parameters","text":"INPUTS
os STRING, os_version STRING, windows_build_number INT64\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/serp_events/","title":"serp_events","text":"Functions for working with Glean SERP events.
"},{"location":"mozfun/serp_events/#ad_blocker_inferred-udf","title":"ad_blocker_inferred (UDF)","text":"Determine whether an ad blocker is inferred to be in use on a SERP. True if all loaded ads are blocked.
"},{"location":"mozfun/serp_events/#parameters","title":"Parameters","text":"INPUTS
num_loaded INT, num_blocked INT\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/serp_events/#is_ad_component-udf","title":"is_ad_component (UDF)","text":"Determine whether a SERP display component referenced in the serp events contains monetizable ads
"},{"location":"mozfun/serp_events/#parameters_1","title":"Parameters","text":"INPUTS
component STRING\n
OUTPUTS
BOOL\n
Source | Edit
"},{"location":"mozfun/stats/","title":"stats","text":"Statistics functions.
"},{"location":"mozfun/stats/#mode_last-udf","title":"mode_last (UDF)","text":"Returns the most frequently occuring element in an array.
In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are ignored. See also: stats.mode_last_retain_nulls
, which retains nulls.
INPUTS
list ANY TYPE\n
Source | Edit
"},{"location":"mozfun/stats/#mode_last_retain_nulls-udf","title":"mode_last_retain_nulls (UDF)","text":"Returns the most frequently occuring element in an array. In the case of multiple values tied for the highest count, it returns the value that appears latest in the array. Nulls are retained. See also: `stats.mode_last, which ignores nulls.
"},{"location":"mozfun/stats/#parameters_1","title":"Parameters","text":"INPUTS
list ANY TYPE\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/utils/","title":"Utils","text":""},{"location":"mozfun/utils/#diff_query_schemas-stored-procedure","title":"diff_query_schemas (Stored Procedure)","text":"Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.
Diff the schemas of two queries. Especially useful when the BigQuery error is truncated, and the schemas of e.g. a UNION don't match.
Use it like:
DECLARE res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>;\nCALL mozfun.utils.diff_query_schemas(\"\"\"SELECT * FROM a\"\"\", \"\"\"SELECT * FROM b\"\"\", res);\n-- See entire schema entries, if you need context\nSELECT res;\n-- See just the elements that differ\nSELECT * FROM UNNEST(res) WHERE differs;\n
You'll be able to view the results of \"res\" to compare the schemas of the two queries, and hopefully find what doesn't match.
"},{"location":"mozfun/utils/#parameters","title":"Parameters","text":"INPUTS
query_a STRING, query_b STRING\n
OUTPUTS
res ARRAY<STRUCT<i INT64, differs BOOL, a_col STRING, a_data_type STRING, b_col STRING, b_data_type STRING>>\n
Source | Edit
"},{"location":"mozfun/utils/#extract_utm_from_url-udf","title":"extract_utm_from_url (UDF)","text":"Extract UTM parameters from URL. Returns a STRUCT UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.
This UDF extracts UTM parameters from a URL string.
UTM (Urchin Tracking Module) parameters are URL parameters used by marketing to track the effectiveness of online marketing campaigns.
"},{"location":"mozfun/utils/#parameters_1","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRUCT<utm_source STRING, utm_medium STRING, utm_campaign STRING, utm_content STRING, utm_term STRING>\n
Source | Edit
"},{"location":"mozfun/utils/#get_url_path-udf","title":"get_url_path (UDF)","text":"Extract the Path from a URL
This UDF extracts path from a URL string.
The path is everything after the host and before parameters. This function returns \"/\" if there is no path.
"},{"location":"mozfun/utils/#parameters_2","title":"Parameters","text":"INPUTS
url STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/","title":"vpn","text":"Functions for processing VPN data.
"},{"location":"mozfun/vpn/#acquisition_channel-udf","title":"acquisition_channel (UDF)","text":"Assign an acquisition channel based on utm parameters
"},{"location":"mozfun/vpn/#parameters","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/#channel_group-udf","title":"channel_group (UDF)","text":"Assign a channel group based on utm parameters
"},{"location":"mozfun/vpn/#parameters_1","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"mozfun/vpn/#normalize_utm_parameters-udf","title":"normalize_utm_parameters (UDF)","text":"Normalize utm parameters to use the same NULL placeholders as Google Analytics
"},{"location":"mozfun/vpn/#parameters_2","title":"Parameters","text":"INPUTS
utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING\n
OUTPUTS
STRUCT<utm_campaign STRING, utm_content STRING, utm_medium STRING, utm_source STRING>\n
Source | Edit
"},{"location":"mozfun/vpn/#pricing_plan-udf","title":"pricing_plan (UDF)","text":"Combine the pricing and interval for a subscription plan into a single field
"},{"location":"mozfun/vpn/#parameters_3","title":"Parameters","text":"INPUTS
provider STRING, amount INTEGER, currency STRING, `interval` STRING, interval_count INTEGER\n
OUTPUTS
STRING\n
Source | Edit
"},{"location":"reference/airflow_tags/","title":"Airflow Tags","text":""},{"location":"reference/airflow_tags/#why","title":"Why","text":"Airflow tags enable DAGs to be filtered in the web ui view to reduce the number of DAGs shown to just those that you are interested in.
Additionally, their objective is to provide a little bit more information such as their impact to make it easier to understand the DAG and impact of failures when doing Airflow triage.
More information and the discussions can be found the the original Airflow Tags Proposal (can be found within data org proposals/
folder).
We borrow the tiering system used by our integration and testing sheriffs. This is to maintain a level of consistency across different systems to ensure common language and understanding across teams. Valid tier tags include:
This tag is meant to provide guidance to a triage engineer on how to respond to a specific DAG failure when the job owner does not want the standard process to be followed.
The behaviour of bqetl
can be configured via the bqetl_project.yaml
file. This file, for example, specifies the queries that should be skipped during dryrun, views that should not be published and contains various other configurations.
The general structure of bqetl_project.yaml
is as follows:
dry_run:\n function: https://us-central1-moz-fx-data-shared-prod.cloudfunctions.net/bigquery-etl-dryrun\n test_project: bigquery-etl-integration-test\n skip:\n - sql/moz-fx-data-shared-prod/account_ecosystem_derived/desktop_clients_daily_v1/query.sql\n - sql/**/apple_ads_external*/**/query.sql\n # - ...\n\nviews:\n skip_validation:\n - sql/moz-fx-data-test-project/test/simple_view/view.sql\n - sql/moz-fx-data-shared-prod/mlhackweek_search/events/view.sql\n - sql/moz-fx-data-shared-prod/**/client_deduplication/view.sql\n # - ...\n skip_publishing:\n - activity_stream/tile_id_types/view.sql\n - pocket/pocket_reach_mau/view.sql\n # - ...\n non_user_facing_suffixes:\n - _derived\n - _external\n # - ...\n\nschema:\n skip_update:\n - sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/schema.yaml\n # - ...\n skip_prefixes:\n - pioneer\n - rally\n\nroutines:\n skip_publishing:\n - sql/moz-fx-data-shared-prod/udf/main_summary_scalars/udf.sql\n\nformatting:\n skip:\n - bigquery_etl/glam/templates/*.sql\n - sql/moz-fx-data-shared-prod/telemetry/fenix_events_v1/view.sql\n - stored_procedures/safe_crc32_uuid.sql\n # - ...\n
"},{"location":"reference/configuration/#accessing-configurations","title":"Accessing configurations","text":"ConfigLoader
can be used in the bigquery_etl tooling codebase to access configuration parameters. bqetl_project.yaml
is automatically loaded in ConfigLoader
and parameters can be accessed via a get()
method:
from bigquery_etl.config import ConfigLoader\n\nskipped_formatting = cfg.get(\"formatting\", \"skip\", fallback=[])\ndry_run_function = cfg.get(\"dry_run\", \"function\", fallback=None)\nschema_config_dict = cfg.get(\"schema\")\n
The ConfigLoader.get()
method allows multiple string parameters to reference a configuration value that is stored in a nested structure. A fallback
value can be optionally provided in case the configuration parameter is not set.
New configuration parameters can simply be added to bqetl_project.yaml
. ConfigLoader.get()
allows for these new parameters simply to be referenced without needing to be changed or updated.
Instructions on how to add data checks can be found in the Adding data checks section below.
"},{"location":"reference/data_checks/#background","title":"Background","text":"To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.
"},{"location":"reference/data_checks/#bqetl-data-checks-to-the-rescue","title":"bqetl Data Checks to the Rescue","text":"bqetl data checks aim to provide this ability by providing a simple interface for specifying our \"assumptions\" about the data the query should produce and checking them against the actual result.
This easy interface is achieved by providing a number of jinja templates providing \"out-of-the-box\" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the checks.sql
files. This allows to \"configure\" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.
It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.
"},{"location":"reference/data_checks/#two-categories-of-checks","title":"Two categories of checks","text":"Each check needs to be categorised with a marker, currently following markers are available:
#fail
indicates that the ETL pipeline should stop if this check fails (circuit-breaker pattern) and a notification is sent out. This marker should be used for checks that indicate a serious data issue.#warn
indicates that the ETL pipeline should continue even if this check fails. These type of checks can be used to indicate potential issues that might require more manual investigation.Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.
"},{"location":"reference/data_checks/#adding-data-checks","title":"Adding Data Checks","text":""},{"location":"reference/data_checks/#create-checkssql","title":"Create checks.sql","text":"Inside the query directory, which usually contains query.sql
or query.py
, metadata.yaml
and schema.yaml
, create a new file called checks.sql
(unless already exists).
Please make sure each check you add contains a marker (see: the Two categories of checks section above).
Once checks have been added, we need to regenerate the DAG
responsible for scheduling the query.
If checks.sql
already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.
When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.
"},{"location":"reference/data_checks/#removing-checkssql","title":"Removing checks.sql","text":"All checks can be removed by deleting the checks.sql
file and regenerating the DAG responsible for scheduling the query.
Alternatively, specific checks can be removed by deleting them from the checks.sql
file.
Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in tests/checks
which may take different parameters used to generate the SQL check expression.
Example of what a checks.sql
may look like:
-- raw SQL checks\n#fail\nASSERT (\n SELECT\n COUNTIF(ISNULL(country)) / COUNT(*)\n FROM telemetry.table_v1\n WHERE submission_date = @submission_date\n ) > 0.2\n) AS \"More than 20% of clients have country set to NULL\";\n\n-- macro checks\n#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n\n#warn\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n\n#fail\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n\n#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#data-checks-available-with-examples","title":"Data Checks Available with Examples","text":""},{"location":"reference/data_checks/#accepted_values-source","title":"accepted_values (source)","text":"Usage:
Arguments:\n\ncolumn: str - name of the column to check\nvalues: List[str] - list of accepted values\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ accepted_values(\"column_1\", [\"value_1\", \"value_2\"],\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#in_range-source","title":"in_range (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which we want to check the values of.\nmin: Optional[int] - Minimum value we should observe in the specified columns.\nmax: Optional[int] - Maximum value we should observe in the specified columns.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#is_unique-source","title":"is_unique (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which should produce a unique record.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#warn\n{{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n
"},{"location":"reference/data_checks/#min_row_countsource","title":"min_row_count(source)","text":"Usage:
Arguments:\n\nthreshold: Optional[int] - What is the minimum number of rows we expect (default: 1)\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#fail\n{{ min_row_count(1, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#not_null-source","title":"not_null (source)","text":"Usage:
Arguments:\n\ncolumns: List[str] - A list of columns which should not contain a null value.\nwhere: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, \"submission_date = @submission_date\" so that the check is only executed against a specific partition.\n
Example:
#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n
Please keep in mind the below checks can be combined and specified in the same checks.sql
file. For example:
#fail\n{{ not_null([\"submission_date\", \"os\"], \"submission_date = @submission_date\") }}\n #fail\n {{ min_row_count(1, \"submission_date = @submission_date\") }}\n #fail\n {{ is_unique([\"submission_date\", \"os\", \"country\"], \"submission_date = @submission_date\")}}\n #warn\n {{ in_range([\"non_ssl_loads\", \"ssl_loads\", \"reporting_ratio\"], 0, none, \"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#row_count_within_past_partitions_avgsource","title":"row_count_within_past_partitions_avg(source)","text":"Compares the row count of the current partition to the average of number_of_days
past partitions and checks if the row count is within the average +- threshold_percentage
%
Usage:
Arguments:\n\nnumber_of_days: int - Number of days we are comparing the row count to\nthreshold_percentage: int - How many percent above or below the average row count is ok.\npartition_field: Optional[str] - What column is the partition_field (default = \"submission_date\")\n
Example:
#fail\n{{ row_count_within_past_partitions_avg(7, 5, \"submission_date\") }}\n
"},{"location":"reference/data_checks/#value_lengthsource","title":"value_length(source)","text":"Checks that the column has values of specific character length.
Usage:
Arguments:\n\ncolumn: str - Column which will be checked against the `expected_length`.\nexpected_length: int - Describes the expected character length of the value inside the specified columns.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\n
Example:
#warn\n{{ value_length(column=\"country\", expected_length=2, where=\"submission_date = @submission_date\") }}\n
"},{"location":"reference/data_checks/#matches_patternsource","title":"matches_pattern(source)","text":"Checks that the column values adhere to a pattern based on a regex expression.
Usage:
Arguments:\n\ncolumn: str - Column which values will be checked against the regex.\npattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.\nwhere: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.\nthreshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.\nmessage: Optional[str]: Custom error message.\n
Example:
#warn\n{{ matches_pattern(column=\"country\", pattern=\"^[A-Z]{2}$\", where=\"submission_date = @submission_date\", threshold_fail_percentage=10, message=\"Oops\") }}\n
"},{"location":"reference/data_checks/#running-checks-locally-commands","title":"Running checks locally / Commands","text":"To list all available commands in the bqetl data checks CLI:
$ ./bqetl check\n\nUsage: bqetl check [OPTIONS] COMMAND [ARGS]...\n\n Commands for managing and running bqetl data checks.\n\n \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\n IN ACTIVE DEVELOPMENT\n\n The current progress can be found under:\n\n https://mozilla-hub.atlassian.net/browse/DENG-919\n\n \u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\u2013\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n render Renders data check query using parameters provided (OPTIONAL).\n run Runs data checks defined for the dataset (checks.sql).\n
To see see how to use a specific command use:
$ ./bqetl check [command] --help\n
render
$ ./bqetl check render [OPTIONS] DATASET [ARGS]\n\nRenders data check query using parameters provided (OPTIONAL). The result\nis what would be used to run a check to ensure that the specified dataset\nadheres to the assumptions defined in the corresponding checks.sql file\n\nOptions:\n --project-id, --project_id TEXT\n GCP project ID\n --sql_dir, --sql-dir DIRECTORY Path to directory which contains queries.\n --help Show this message and exit.\n
"},{"location":"reference/data_checks/#example","title":"Example","text":"./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01\n
run
$ ./bqetl check run [OPTIONS] DATASET\n\nRuns data checks defined for the dataset (checks.sql).\n\nChecks can be validated using the `--dry_run` flag without executing them:\n\nOptions:\n --project-id, --project_id TEXT\n GCP project ID\n --sql_dir, --sql-dir DIRECTORY Path to directory which contains queries.\n --dry_run, --dry-run To dry run the query to make sure it is\n valid\n --marker TEXT Marker to filter checks.\n --help Show this message and exit.\n
"},{"location":"reference/data_checks/#examples","title":"Examples","text":"# to run checks for a specific dataset\n$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn\n\n# to only dry_run the checks\n$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail\n
"},{"location":"reference/incremental/","title":"Incremental Queries","text":""},{"location":"reference/incremental/#benefits","title":"Benefits","text":"WRITE_TRUNCATE
mode or bq query --replace
to replace partitions atomically to prevent duplicate data@submission_date
query parametersubmission_date
matching the query parametersql/moz-fx-data-shared-prod/clients_last_seen_v1.sql
can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding daysFor background, see Accessing Public Data on docs.telemetry.mozilla.org
.
public_bigquery
flag must be set in metadata.yaml
mozilla-public-data
GCP project which is accessible by everyone, also external userspublic_json
flag must be set in metadata.yaml
000000000000.json
, 000000000001.json
, ...)incremental_export
controls how data should be exported as JSON:false
: all data of the source table gets exported to a single locationtrue
: only data that matches the submission_date
parameter is exported as JSON to a separate directory for this datemetadata.json
gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.jsonlast_updated
, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updatedsql/<project>/<dataset>/<table>_<version>/query.sql
e.g.<project>
defines both where the destination table resides and in which project the query job runs sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql
sql/<project>/<dataset>/<table>_<version>/query.sql
as abovesql/<project>/query_type.sql.py
e.g. sql/moz-fx-data-shared-prod/clients_daily.sql.py
--source telemetry_core_parquet_v3
to generate sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql
and using --source main_summary_v4
to generate sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet\n
moz-fx-data-shared-prod
the project prefix should be omitted to simplify testing. (Other projects do need the project prefix)_
prefix in generated column names not meant for output_bits
suffix for any integer column that represents a bit patternDATETIME
type, due to incompatibility with spark-bigquery-connector*_stable
tables instead of including custom deduplicationdocument_id
by submission_timestamp
where filtering duplicates is necessarymozdata
project which are duplicates of views in another project (commonly moz-fx-data-shared-prod
). Refer to the original view instead.{{ metrics.calculate() }}
: SELECT\n *\nFROM\n {{ metrics.calculate(\n metrics=['days_of_use', 'active_hours'],\n platform='firefox_desktop',\n group_by={'sample_id': 'sample_id', 'channel': 'application.channel'},\n where='submission_date = \"2023-01-01\"'\n ) }}\n\n-- this translates to\nSELECT\n *\nFROM\n (\n WITH clients_daily AS (\n SELECT\n client_id AS client_id,\n submission_date AS submission_date,\n COALESCE(SUM(active_hours_sum), 0) AS active_hours,\n COUNT(submission_date) AS days_of_use,\n FROM\n mozdata.telemetry.clients_daily\n GROUP BY\n client_id,\n submission_date\n )\n SELECT\n clients_daily.client_id,\n clients_daily.submission_date,\n active_hours,\n days_of_use,\n FROM\n clients_daily\n )\n
metrics
: unique reference(s) to metric definition, all metric definitions are aggregations (e.g. SUM, AVG, ...)platform
: platform to compute metrics for (e.g. firefox_desktop
, firefox_ios
, fenix
, ...)group_by
: fields used in the GROUP BY statement; this is a dictionary where the key represents the alias, the value is the field path; GROUP BY
always includes the configured client_id
and submission_date
fieldswhere
: SQL filter clausegroup_by_client_id
: Whether the field configured as client_id
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by defaultgroup_by_submission_date
: Whether the field configured as submission_date
(defined as part of the data source specification in metric-hub) should be part of the GROUP BY
. True
by default{{ metrics.data_source() }}
: SELECT\n *\nFROM\n {{ metrics.data_source(\n data_source='main',\n platform='firefox_desktop',\n where='submission_date = \"2023-01-01\"'\n ) }}\n\n-- this translates to\nSELECT\n *\nFROM\n (\n SELECT *\n FROM `mozdata.telemetry.main`\n WHERE submission_date = \"2023-01-01\"\n )\n
./bqetl query render path/to/query.sql
generated-sql
branch has rendered queries/views/UDFs./bqetl query run
does support running Jinja queriesmetadata.yaml
file should be created in the same directoryfriendly_name: SSL Ratios\ndescription: >\n Percentages of page loads Firefox users have performed that were\n conducted over SSL broken down by country.\nowners:\n - example@mozilla.com\nlabels:\n application: firefox\n incremental: true # incremental queries add data to existing tables\n schedule: daily # scheduled in Airflow to run daily\n public_json: true\n public_bigquery: true\n review_bugs:\n - 1414839 # Bugzilla bug ID of data review\n incremental_export: false # non-incremental JSON export writes all data to a single location\n
sql/<project>/<dataset>/<table>/view.sql
e.g. sql/moz-fx-data-shared-prod/telemetry/core/view.sql
fx-data-dev@mozilla.org
moz-fx-data-shared-prod
project; the scripts/publish_views
tooling can handle parsing the definitions to publish to other projects such as derived-datasets
mozdata
project which are duplicates of views in another project (commonly moz-fx-data-shared-prod
). Refer to the original view instead.BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
.sql
e.g. mode_last.sql
udf/
directory and JS UDFs must be defined in the udf_js
directoryudf_legacy/
directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.CREATE OR REPLACE FUNCTION
syntax<dir_name>.
so, for example, all functions in udf/*.sql
are part of the udf
datasetCREATE OR REPLACE FUNCTION <dir_name>.<file_name>
scripts/publish_persistent_udfs
for publishing these UDFs to BigQuerySQL
over js
for performanceNULL
for new data and EXCEPT
to exclude from views until droppedSELECT\n job_type,\n state,\n submission_date,\n destination_dataset_id,\n destination_table_id,\n total_terabytes_billed,\n total_slot_ms,\n error_location,\n error_reason,\n error_message\nFROM\n moz-fx-data-shared-prod.monitoring.bigquery_usage\nWHERE\n submission_date <= CURRENT_DATE()\n AND destination_dataset_id LIKE \"%backfills_staging_derived%\"\n AND destination_table_id LIKE \"%{{ your table name }}%\"\nORDER BY\n submission_date DESC\n
dags.yaml
dags.yaml
, e.g., by adding the following: bqetl_ssl_ratios: # name of the DAG; must start with bqetl_\n schedule_interval: 0 2 * * * # query schedule\n description: The DAG schedules SSL ratios queries.\n default_args:\n owner: example@mozilla.com\n start_date: \"2020-04-05\" # YYYY-MM-DD\n email: [\"example@mozilla.com\"]\n retries: 2 # number of retries if the query execution fails\n retry_delay: 30m\n
bqetl_
as prefix.schedule_interval
is either defined as a CRON expression or alternatively as one of the following CRON presets: once
, hourly
, daily
, weekly
, monthly
start_date
defines the first date for which the query should be executedstart_date
is set in the past, backfilling can be done via the Airflow web interfaceemail
lists email addresses alerts should be sent to in case of failures when running the querybqetl
CLI by running bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner=\"example@mozilla.com\" --start_date=\"2020-04-05\" --description=\"This DAG generates SSL ratios.\"
metadata.yaml
file that includes a scheduling
section, for example: friendly_name: SSL ratios\n# ... more metadata, see Query Metadata section above\nscheduling:\n dag_name: bqetl_ssl_ratios\n
depends_on_past
keeps query from getting executed if the previous schedule for the query hasn't succeededdate_partition_parameter
- by default set to submission_date
; can be set to null
if query doesn't write to a partitioned tableparameters
specifies a list of query parameters, e.g. [\"n_clients:INT64:500\"]
arguments
- a list of arguments passed when running the query, for example: [\"--append_table\"]
referenced_tables
- manually curated list of tables a Python or BigQuery script depends on; for query.sql
files dependencies will get determined automatically and should only be overwritten manually if really necessarymultipart
indicates whether a query is split over multiple files part1.sql
, part2.sql
, ...depends_on
defines external dependencies in telemetry-airflow that are not detected automatically: depends_on:\n - task_id: external_task\n dag_name: external_dag\n execution_delta: 1h\n
task_id
: name of task query depends ondag_name
: name of the DAG the external task is part ofexecution_delta
: time difference between the schedule_intervals
of the external DAG and the DAG the query is part ofdepends_on_tables_existing
defines tables that the ETL will await the existence of via an Airflow sensor before running: depends_on_tables_existing:\n - task_id: wait_for_foo_bar_baz\n table_id: 'foo.bar.baz_{{ ds_nodash }}'\n poke_interval: 30m\n timeout: 12h\n retries: 1\n retry_delay: 10m\n
task_id
: ID to use for the generated Airflow sensor task.table_id
: Fully qualified ID of the table to wait for, including the project and dataset.poke_interval
: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default poke interval is 5 minutes).timeout
: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default timeout is 8 hours).retries
: The number of retries that should be performed if the sensor times out or otherwise fails. This parameter is optional (the default depends on how the DAG is configured).retry_delay
: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default depends on how the DAG is configured).depends_on_table_partitions_existing
defines table partitions that the ETL will await the existence of via an Airflow sensor before running: depends_on_table_partitions_existing:\n - task_id: wait_for_foo_bar_baz\n table_id: foo.bar.baz\n partition_id: '{{ ds_nodash }}'\n poke_interval: 30m\n timeout: 12h\n retries: 1\n retry_delay: 10m\n
task_id
: ID to use for the generated Airflow sensor task.table_id
: Fully qualified ID of the table to check, including the project and dataset. Note that the service account airflow-access@moz-fx-data-shared-prod.iam.gserviceaccount.com
will need to have the BigQuery Job User role on the project and read access to the dataset.partition_id
: ID of the partition to wait for.poke_interval
: Time that the sensor should wait in between each check, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default poke interval is 5 minutes).timeout
: Time allowed before the sensor times out and fails, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default timeout is 8 hours).retries
: The number of retries that should be performed if the sensor times out or otherwise fails. This parameter is optional (the default depends on how the DAG is configured).retry_delay
: Time delay between retries, formatted as a timedelta string like \"2h\" or \"30m\". This parameter is optional (the default depends on how the DAG is configured).trigger_rule
: The rule that determines when the airflow task that runs this query should run. The default is all_success
(\"trigger this task when all directly upstream tasks have succeeded\"); other rules can allow a task to run even if not all preceding tasks have succeeded. See the Airflow docs for the list of trigger rule options.destination_table
: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the submission_date
parameter manuallyexternal_downstream_tasks
defines external downstream dependencies for which ExternalTaskMarker
s will be added to the generated DAG. These task markers ensure that when the task is cleared for triggering a rerun, all downstream tasks are automatically cleared as well. external_downstream_tasks:\n - task_id: external_downstream_task\n dag_name: external_dag\n execution_delta: 1h\n
bqetl
CLI: ./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios
./bqetl dag generate
dags/
directory./bqetl dag generate bqetl_ssl_ratios
main
. CI automatically generates DAGs and writes them to the telemetry-airflow-dags repo from where Airflow will pick them updepends_on_fivetran:\n - task_id: fivetran_import_1\n - task_id: another_fivetran_import\n
<task_id>_connector_id
in the Airflow admin interface for each import taskBefore changes, such as adding new fields to existing datasets or adding new datasets, can be deployed to production, bigquery-etl's CI (continuous integration) deploys these changes to a stage environment and uses these stage artifacts to run its various checks.
Currently, the bigquery-etl-integration-test
project serves as the stage environment. CI does have read and write access, but does at no point publish actual data to this project. Only UDFs, table schemas and views are published. The project itself does not have access to any production project, like mozdata
, so stage artifacts cannot reference any other artifacts that live in production.
Deploying artifacts to stage follows the following steps: 1. Once a new pull-request gets created in bigquery-etl, CI will pull in the generated-sql
branch to determine all files that show any changes compared to what is deployed in production (it is assumed that the generated-sql
branch reflects the artifacts currently deployed in production). All of these changed artifacts (UDFs, tables and views) will be deployed to the stage environment. * This CI step runs after the generate-sql
CI step to ensure that checks will also be executed on generated queries and to ensure schema.yaml
files have been automatically created for queries. 2. The bqetl
CLI has a command to run stage deploys, which is called in the CI: ./bqetl stage deploy --dataset-suffix=$CIRCLE_SHA1 $FILE_PATHS
* --dataset-suffix
will result in the artifacts being deployed to datasets that are suffixed by the current commit hash. This is to prevent any conflicts when deploying changes for the same artifacts in parallel and helps with debugging deployed artifacts. 3. For every artifacts that gets deployed to stage all dependencies need to be determined and deployed to the stage environment as well since the stage environment doesn't have access to production. Before these artifacts get actually deployed, they need to be determined first by traversing artifact definitions. * Determining dependencies is only relevant for UDFs and views. For queries, available schema.yaml
files will simply be deployed. * For UDFs, if a UDF does call another UDF then this UDF needs to be deployed to stage as well. * For views, if a view references another view, table or UDF then each of these referenced artifacts needs to be available on stage as well, otherwise the view cannot even be deployed to stage. * If artifacts are referenced that are not defined as part of the bigquery-etl repo (like stable or live tables) then their schema will get determined and a placeholder query.sql
file will be created * Also dependencies of dependencies need to be deployed, and so on 4. Once all artifacts that need to be deployed have been determined, all references to these artifacts in existing SQL files need to be updated. These references will need to point to the stage project and the temporary datasets that artifacts will be published to. * Artifacts that get deployed are determined from the files that got changed and any artifacts that are referenced in the SQL definitions of these files, as well as their references and so on. 5. To run the deploy, all artifacts will be copied to sql/bigquery-etl-integration-test
into their corresponding temporary datasets. * Also if any existing SQL tests the are related to changed artifacts will have their referenced artifacts updated and will get copied to a bigquery-etl-integration-test
folder * The deploy is executed in the order of: UDFs, tables, views * UDFs and views get deployed in a way that ensures that the right order of deployments (e.g. dependencies need to be deployed before the views referencing them) 6. Once the deploy has been completed, the CI will use these staged artifacts to run its tests 7. After checks have succeeded, the deployed artifacts will be removed from stage * By default the table expiration is set to 1 hour * This step will also automatically remove any tables and datasets that got previously deployed, are older than an hour but haven't been removed (for example due to some CI check failing)
After CI checks have passed and the pull-request has been approved, changes can be merged to main
. Once a new version of bigquery-etl has been published the changes can be deployed to production through the bqetl_artifact_deployment
Airflow DAG. For more information on artifact deployments to production see: https://docs.telemetry.mozilla.org/concepts/pipeline/artifact_deployment.html
Local changes can be deployed to stage using the ./bqetl stage deploy
command:
./bqetl stage deploy \\\n --dataset-suffix=test \\\n --copy-sql-to-tmp-dir \\\n sql/moz-fx-data-shared-prod/firefox_ios/new_profile_activation/view.sql \\\n sql/mozfun/map/sum/udf.sql\n
Files (for example ones with changes) that should be deployed to stage need to be specified. The stage deploy
accepts the following parameters: * --dataset-suffix
is an optional suffix that will be added to the datasets deployed to stage * --copy-sql-to-tmp-dir
copies SQL stored in sql/
to a temporary folder. Reference updates and any other modifications required to run the stage deploy will be performed in this temporary directory. This is an optional parameter. If not specified, changes get applied to the files directly and can be reverted, for example, by running git checkout -- sql/
* (optional) --remove-updated-artifacts
removes artifact files that have been deployed from the \"prod\" folders. This ensures that tests don't run on outdated or undeployed artifacts.
Deployed stage artifacts can be deleted from bigquery-etl-integration-test
by running:
./bqetl stage clean --delete-expired --dataset-suffix=test\n
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 9e000f5347f..7607e6e6328 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ