-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from bxparks/develop
Initial merge of develop to master, mostly to export the README.md
- Loading branch information
Showing
6 changed files
with
1,448 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,239 @@ | ||
# bigquery-schema-generator | ||
BigQuery Schema Generator | ||
# BigQuery Schema Generator | ||
|
||
## Summary | ||
|
||
This script generates the BigQuery schema from the data records on the STDIN. | ||
The BigQuery data importer uses only the first 100 lines when the schema | ||
auto-detection feature is enabled. In contrast, this script uses all data | ||
records to generate the schema. | ||
|
||
Usage: | ||
``` | ||
$ generate_schema.py < file.data.json > file.schema.json | ||
``` | ||
|
||
## Background | ||
|
||
Data can be imported into [BigQuery](https://cloud.google.com/bigquery/) using | ||
the [bq](https://cloud.google.com/bigquery/bq-command-line-tool) command line | ||
tool. It accepts a number of data formats including CSV or newline-delimited | ||
JSON. The data can be loaded into an existing table or a new table can be | ||
created during the loading process. The structure of the table is defined by | ||
its [schema](https://cloud.google.com/bigquery/docs/schemas). The table's | ||
schema can be defined manually or the schema can be | ||
[auto-detected](https://cloud.google.com/bigquery/docs/schema-detect#auto-detect). | ||
|
||
When the auto-detect feature is used, the BigQuery data importer examines only | ||
the first 100 records of the input data. In many cases, this is sufficient | ||
because the data records were dumped from another database and the exact schema | ||
of the source table was known. However, for data extracted from a service | ||
(e.g. using a REST API) the record fields were organically at later dates. In | ||
this case, the first 100 records do not contain fields which are present in | ||
later records. The **bq load** auto-detection fails and the data fails to | ||
load. | ||
|
||
The **bq load** tool does not support the ability to process the entire dataset | ||
to determine a more accurate schema. This script fills in that gap. It | ||
processes the entire dataset given in the STDIN and outputs the BigQuery schema | ||
in JSON format on the STDOUT. This schema file can be fed back into the **bq | ||
load** tool to create a table that is more compatible with the data fields in | ||
the input dataset. | ||
|
||
## Usage | ||
|
||
The `generate_schema.py` script accepts a newline-delimited JSON data file on | ||
the STDIN. (CSV is not supported currently.) It scans every record in the | ||
input data file to deduce the table's schema. It prints the JSON formatted | ||
schema file on the STDOUT: | ||
``` | ||
$ generate_schema.py < file.data.json > file.schema.json | ||
``` | ||
|
||
The schema file can be used in the **bq** command using: | ||
``` | ||
$ bq load --schema file.schema.json mydataset.mytable file.data.json | ||
``` | ||
|
||
where `mydataset.mytable` is the target table in BigQuery. | ||
|
||
A useful flag for **bq load** is `--ignore_unknown_values`, which causes `bq load` | ||
to ignore fields in the input data which are not defined in the schema. When | ||
`generate_schema.py` detects an inconsistency in the definition of a particular | ||
field in the input data, it removes the field from the schema definition. | ||
Without the `--ignore_unknown_values`, the **bq load** fails when the | ||
inconsistent data record is read. | ||
|
||
After the BigQuery table is loaded, the schema can be retrieved using: | ||
``` | ||
$ bq show --schema mydataset.mytable | python -m json.tool | ||
``` | ||
(The `python -m json.tool` command will pretty-print the JSON formatted schema | ||
file.) This schema file should be identical to `file.schema.json`. | ||
|
||
### Options | ||
|
||
The `generate_schema.py` script supports a handful of command line flags: | ||
|
||
* `--keep_nulls` Print the schema for null values, empty arrays or empty records. | ||
* `--debugging_interval lines` Number of lines between heartbeat debugging messages. Default 1000. | ||
* `--debugging_map` Print the metadata schema map for debugging purposes | ||
|
||
#### Null Values | ||
|
||
Normally when the input data file contains a field which has a null, empty | ||
array or empty record as its value, the field is suppressed in the schema file. | ||
This flag enables this field to be included in the schema file. In other words, | ||
for the data file: | ||
``` | ||
{ "s": null, "a": [], "m": {} } | ||
``` | ||
the schema would normally be: | ||
``` | ||
[] | ||
``` | ||
With the ``keep_nulls``, the resulting schema file will be: | ||
``` | ||
[ | ||
{ | ||
"mode": "REPEATED", | ||
"type": "STRING", | ||
"name": "a" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"fields": [ | ||
{ | ||
"mode": "NULLABLE", | ||
"type": "STRING", | ||
"name": "__unknown__" | ||
} | ||
], | ||
"type": "RECORD", | ||
"name": "d" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"type": "STRING", | ||
"name": "s" | ||
} | ||
] | ||
``` | ||
|
||
#### Debugging Interval | ||
|
||
By default, the `generate_schema.py` script prints a short progress message | ||
every 1000 lines of input data. This interval can be changed using the | ||
`--debugging_interval` flag. | ||
|
||
#### Debugging Map | ||
|
||
Instead of printing out the BigQuery schema, the `--debugging_map` prints out | ||
the bookkeeping metadata map which is used internally to keep track of the | ||
various fields and theirs types that was inferred using the data file. This | ||
flag is intended to be used for debugging. | ||
|
||
## Examples | ||
|
||
Here is an example of a single JSON data record on the STDIN: | ||
|
||
``` | ||
$ ./generate_schema.py | ||
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" } | ||
^D | ||
INFO:root:Processed 1 lines | ||
[ | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "b", | ||
"type": "BOOLEAN" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "i", | ||
"type": "INTEGER" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "s", | ||
"type": "STRING" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "t", | ||
"type": "TIMESTAMP" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "x", | ||
"type": "FLOAT" | ||
} | ||
] | ||
``` | ||
|
||
In most cases, the data file will be stored in a file: | ||
``` | ||
cat > file.data.json | ||
{ "a": [1, 2] } | ||
{ "i": 3 } | ||
^D | ||
$ ./generate_schema.py < file.data.json > file.schema.json | ||
INFO:root:Processed 2 lines | ||
$ cat file.schema.json | ||
[ | ||
{ | ||
"mode": "REPEATED", | ||
"name": "a", | ||
"type": "INTEGER" | ||
}, | ||
{ | ||
"mode": "NULLABLE", | ||
"name": "i", | ||
"type": "INTEGER" | ||
} | ||
] | ||
``` | ||
|
||
## Unit Tests | ||
|
||
Instead of embeddeding the input data records and the expected schema file into | ||
the `test_generate_schema.py` file, we placed them into the `testdata.txt` | ||
file. This has two advantages: | ||
|
||
* we can more easily update the input and output data records, and | ||
* the `testdata.txt` data could be reused for versions written in other languages | ||
|
||
The output of `test_generate_schema.py` should look something like this: | ||
``` | ||
---------------------------------------------------------------------- | ||
Ran 4 tests in 0.002s | ||
OK | ||
Test chunk 1: First record: { "s": null, "a": [], "m": {} } | ||
Test chunk 2: First record: { "s": null, "a": [], "m": {} } | ||
Test chunk 3: First record: { "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" } | ||
Test chunk 4: First record: { "a": [1, 2], "r": { "r0": "r0", "r1": "r1" } } | ||
Test chunk 5: First record: { "s": "string", "x": 3.2, "i": 3, "b": true, "a": [ "a", 1] } | ||
Test chunk 6: First record: { "a": [1, 2] } | ||
Test chunk 7: First record: { "r" : { "a": [1, 2] } } | ||
Test chunk 8: First record: { "i": 1 } | ||
Test chunk 9: First record: { "i": null } | ||
Test chunk 10: First record: { "i": 3 } | ||
Test chunk 11: First record: { "i": [1, 2] } | ||
Test chunk 12: First record: { "r" : { "i": 3 } } | ||
Test chunk 13: First record: { "r" : [{ "i": 4 }] } | ||
``` | ||
|
||
## System Requirements | ||
|
||
This project was developed on Ubuntu 17.04 using Python 3.5. It is likely | ||
compatible with other python environments but I have not yet verified those. | ||
|
||
## Author | ||
|
||
Created by Brian T. Park (brian@xparks.net). | ||
|
||
## License | ||
|
||
Apache License 2.0 |
Oops, something went wrong.