Skip to content

Commit

Permalink
Merge pull request #1 from bxparks/develop
Browse files Browse the repository at this point in the history
Initial merge of develop to master, mostly to export the README.md
  • Loading branch information
bxparks authored Jan 1, 2018
2 parents 0b2bc7b + e457834 commit 409bebd
Show file tree
Hide file tree
Showing 6 changed files with 1,448 additions and 2 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
241 changes: 239 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,239 @@
# bigquery-schema-generator
BigQuery Schema Generator
# BigQuery Schema Generator

## Summary

This script generates the BigQuery schema from the data records on the STDIN.
The BigQuery data importer uses only the first 100 lines when the schema
auto-detection feature is enabled. In contrast, this script uses all data
records to generate the schema.

Usage:
```
$ generate_schema.py < file.data.json > file.schema.json
```

## Background

Data can be imported into [BigQuery](https://cloud.google.com/bigquery/) using
the [bq](https://cloud.google.com/bigquery/bq-command-line-tool) command line
tool. It accepts a number of data formats including CSV or newline-delimited
JSON. The data can be loaded into an existing table or a new table can be
created during the loading process. The structure of the table is defined by
its [schema](https://cloud.google.com/bigquery/docs/schemas). The table's
schema can be defined manually or the schema can be
[auto-detected](https://cloud.google.com/bigquery/docs/schema-detect#auto-detect).

When the auto-detect feature is used, the BigQuery data importer examines only
the first 100 records of the input data. In many cases, this is sufficient
because the data records were dumped from another database and the exact schema
of the source table was known. However, for data extracted from a service
(e.g. using a REST API) the record fields were organically at later dates. In
this case, the first 100 records do not contain fields which are present in
later records. The **bq load** auto-detection fails and the data fails to
load.

The **bq load** tool does not support the ability to process the entire dataset
to determine a more accurate schema. This script fills in that gap. It
processes the entire dataset given in the STDIN and outputs the BigQuery schema
in JSON format on the STDOUT. This schema file can be fed back into the **bq
load** tool to create a table that is more compatible with the data fields in
the input dataset.

## Usage

The `generate_schema.py` script accepts a newline-delimited JSON data file on
the STDIN. (CSV is not supported currently.) It scans every record in the
input data file to deduce the table's schema. It prints the JSON formatted
schema file on the STDOUT:
```
$ generate_schema.py < file.data.json > file.schema.json
```

The schema file can be used in the **bq** command using:
```
$ bq load --schema file.schema.json mydataset.mytable file.data.json
```

where `mydataset.mytable` is the target table in BigQuery.

A useful flag for **bq load** is `--ignore_unknown_values`, which causes `bq load`
to ignore fields in the input data which are not defined in the schema. When
`generate_schema.py` detects an inconsistency in the definition of a particular
field in the input data, it removes the field from the schema definition.
Without the `--ignore_unknown_values`, the **bq load** fails when the
inconsistent data record is read.

After the BigQuery table is loaded, the schema can be retrieved using:
```
$ bq show --schema mydataset.mytable | python -m json.tool
```
(The `python -m json.tool` command will pretty-print the JSON formatted schema
file.) This schema file should be identical to `file.schema.json`.

### Options

The `generate_schema.py` script supports a handful of command line flags:

* `--keep_nulls` Print the schema for null values, empty arrays or empty records.
* `--debugging_interval lines` Number of lines between heartbeat debugging messages. Default 1000.
* `--debugging_map` Print the metadata schema map for debugging purposes

#### Null Values

Normally when the input data file contains a field which has a null, empty
array or empty record as its value, the field is suppressed in the schema file.
This flag enables this field to be included in the schema file. In other words,
for the data file:
```
{ "s": null, "a": [], "m": {} }
```
the schema would normally be:
```
[]
```
With the ``keep_nulls``, the resulting schema file will be:
```
[
{
"mode": "REPEATED",
"type": "STRING",
"name": "a"
},
{
"mode": "NULLABLE",
"fields": [
{
"mode": "NULLABLE",
"type": "STRING",
"name": "__unknown__"
}
],
"type": "RECORD",
"name": "d"
},
{
"mode": "NULLABLE",
"type": "STRING",
"name": "s"
}
]
```

#### Debugging Interval

By default, the `generate_schema.py` script prints a short progress message
every 1000 lines of input data. This interval can be changed using the
`--debugging_interval` flag.

#### Debugging Map

Instead of printing out the BigQuery schema, the `--debugging_map` prints out
the bookkeeping metadata map which is used internally to keep track of the
various fields and theirs types that was inferred using the data file. This
flag is intended to be used for debugging.

## Examples

Here is an example of a single JSON data record on the STDIN:

```
$ ./generate_schema.py
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
^D
INFO:root:Processed 1 lines
[
{
"mode": "NULLABLE",
"name": "b",
"type": "BOOLEAN"
},
{
"mode": "NULLABLE",
"name": "i",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "s",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "t",
"type": "TIMESTAMP"
},
{
"mode": "NULLABLE",
"name": "x",
"type": "FLOAT"
}
]
```

In most cases, the data file will be stored in a file:
```
cat > file.data.json
{ "a": [1, 2] }
{ "i": 3 }
^D
$ ./generate_schema.py < file.data.json > file.schema.json
INFO:root:Processed 2 lines
$ cat file.schema.json
[
{
"mode": "REPEATED",
"name": "a",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "i",
"type": "INTEGER"
}
]
```

## Unit Tests

Instead of embeddeding the input data records and the expected schema file into
the `test_generate_schema.py` file, we placed them into the `testdata.txt`
file. This has two advantages:

* we can more easily update the input and output data records, and
* the `testdata.txt` data could be reused for versions written in other languages

The output of `test_generate_schema.py` should look something like this:
```
----------------------------------------------------------------------
Ran 4 tests in 0.002s
OK
Test chunk 1: First record: { "s": null, "a": [], "m": {} }
Test chunk 2: First record: { "s": null, "a": [], "m": {} }
Test chunk 3: First record: { "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
Test chunk 4: First record: { "a": [1, 2], "r": { "r0": "r0", "r1": "r1" } }
Test chunk 5: First record: { "s": "string", "x": 3.2, "i": 3, "b": true, "a": [ "a", 1] }
Test chunk 6: First record: { "a": [1, 2] }
Test chunk 7: First record: { "r" : { "a": [1, 2] } }
Test chunk 8: First record: { "i": 1 }
Test chunk 9: First record: { "i": null }
Test chunk 10: First record: { "i": 3 }
Test chunk 11: First record: { "i": [1, 2] }
Test chunk 12: First record: { "r" : { "i": 3 } }
Test chunk 13: First record: { "r" : [{ "i": 4 }] }
```

## System Requirements

This project was developed on Ubuntu 17.04 using Python 3.5. It is likely
compatible with other python environments but I have not yet verified those.

## Author

Created by Brian T. Park (brian@xparks.net).

## License

Apache License 2.0
Loading

0 comments on commit 409bebd

Please sign in to comment.