Skip to content

Commit

Permalink
Merge pull request #99 from bxparks/develop
Browse files Browse the repository at this point in the history
merge 1.6.1 into master
  • Loading branch information
bxparks authored Jan 12, 2024
2 parents d8fb050 + a016ce4 commit 6405d35
Show file tree
Hide file tree
Showing 13 changed files with 252 additions and 14 deletions.
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
# Changelog

* Unreleased
* 1.6.1 (2024-01-12)
* **Bug Fix**: Prevent amnesia that causes multiple type mismatches warnings
* If a data set contains multiple records with a column which do not
match each other, then the old code would *remove* the corresponding
internal `schema_entry` for that column, and print a warning message.
* This means that subsequent records would recreate the `schema_entry`,
and a subsequent mismatch would print another warning message.
* This also meant that if there was a second record after the most
recent mismatch, the script would output a schema entry for the
mismatching column, corresponding to the type of the last record which
was not marked as a mismatch.
* The fix is to use a tombstone entry for the offending column, instead
of deleting the `schema_entry` completely. Only a single warning
message is printed, and the column is ignored for all subsequent
records in the input data set.
* See
[Issue#98](https://github.com/bxparks/bigquery-schema-generator/issues/98]
which identified this problem which seems to have existed from the
very beginning.
* 1.6.0 (2023-04-01)
* Allow `null` fields to convert to `REPEATED` because `bq load` seems
to interpret null fields to be equivalent to an empty array `[]`.
Expand Down
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ $ generate-schema < file.data.json > file.schema.json
$ generate-schema --input_format csv < file.data.csv > file.schema.json
```

**Version**: 1.6.0 (2023-04-01)
**Version**: 1.6.1 (2024-01-12)

**Changelog**: [CHANGELOG.md](CHANGELOG.md)

Expand Down Expand Up @@ -267,6 +267,18 @@ csv` flag. The support is not as robust as JSON file. For example, CSV format
supports only the comma-separator, and does not support the pipe (`|`) or tab
(`\t`) character.

**Side Note**: The `input_format` parameter now supports (v1.6.0) the
`csvdictreader` option which allows using the
[csv.DictReader](https://docs.python.org/3/library/csv.html) class that can be
customized to handle different delimiters such as tabs. But this requires
creating a custom Python script using `bigquery_schema_generator` as a library.
See [SchemaGenerator.deduce_schema() from
csv.DictReader](#SchemaGeneratorDeduceSchemaFromCsvDictReader) section below. It
is probably possible to enable this functionality through the command line
script, but it was not obvious how to expose the various options of
`csv.DictReader` through the command line flags. I didn't spend any time on this
problem because this is not a feature that I use personally.)

Unlike `bq load`, the `generate_schema.py` script reads every record in the
input data file to deduce the table's schema. It prints the JSON formatted
schema file on the STDOUT.
Expand Down
53 changes: 41 additions & 12 deletions bigquery_schema_generator/generate_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,13 +136,15 @@ def deduce_schema(self, input_data, *, schema_map=None):
key: schema_entry
}
The 'key' is the name of the table column.
The 'key' is the canonical column name, which is set to be the
lower-cased version of the sanitized key because BigQuery is
case-insensitive to its column name.
schema_entry := {
'status': 'hard | soft',
'status': 'hard | soft | ignore',
'filled': True | False,
'info': {
'name': key,
'name': column_name,
'type': 'STRING | TIMESTAMP | DATE | TIME
| FLOAT | INTEGER | BOOLEAN | RECORD'
'mode': 'NULLABLE | REQUIRED | REPEATED',
Expand All @@ -160,6 +162,13 @@ def deduce_schema(self, input_data, *, schema_map=None):
'hard'. The status can transition from 'soft' to 'hard' but not the
reverse.
The status of 'ignore' identifies a column where the type of one record
conflicts with the type of another record. The column will be ignored
in the final JSON schema. (Early versions of this script *removed* the
offending column entry completely upon the first mismatch. But that
caused subsequent records to recreate the schema entry, which would be
incorrect.)
The 'filled' entry indicates whether all input data records contained
the given field. If the --infer_mode flag is given, the 'filled' entry
is used to convert a NULLABLE schema entry to a REQUIRED schema entry or
Expand Down Expand Up @@ -277,8 +286,7 @@ def merge_schema_entry(
Returns the merged schema_entry. This method assumes that both
'old_schema_entry' and 'new_schema_entry' can be modified in place and
returned as the new schema_entry. Returns None if the field should
be removed from the schema due to internal consistency errors.
returned as the new schema_entry.
'base_path' is the string representing the current path within the
nested record that leads to this specific entry. This is used during
Expand All @@ -302,13 +310,19 @@ def merge_schema_entry(
old_status = old_schema_entry['status']
new_status = new_schema_entry['status']

# new 'soft' does not clobber old 'hard'
# If the field was previously determined to be inconsistent, hence set
# to 'ignore', do nothing and return immediately.
if old_status == 'ignore':
return old_schema_entry

# new 'soft' retains the old 'hard'
if old_status == 'hard' and new_status == 'soft':
mode = self.merge_mode(old_schema_entry,
new_schema_entry,
base_path)
if mode is None:
return None
old_schema_entry['status'] = 'ignore'
return old_schema_entry
old_schema_entry['info']['mode'] = mode
return old_schema_entry

Expand All @@ -318,7 +332,8 @@ def merge_schema_entry(
new_schema_entry,
base_path)
if mode is None:
return None
old_schema_entry['status'] = 'ignore'
return old_schema_entry
new_schema_entry['info']['mode'] = mode
return new_schema_entry

Expand Down Expand Up @@ -389,7 +404,8 @@ def merge_schema_entry(
new_schema_entry,
base_path)
if new_mode is None:
return None
old_schema_entry['status'] = 'ignore'
return old_schema_entry
new_schema_entry['info']['mode'] = new_mode

# For all other types...
Expand All @@ -402,7 +418,8 @@ def merge_schema_entry(
f'old=({old_status},{full_old_name},{old_mode},{old_type});'
f' new=({new_status},{full_new_name},{new_mode},{new_type})'
)
return None
old_schema_entry['status'] = 'ignore'
return old_schema_entry

new_info['type'] = candidate_type
return new_schema_entry
Expand All @@ -414,6 +431,11 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
flag), because REQUIRED is created only in the flatten_schema()
method. Therefore, a NULLABLE->REQUIRED transition cannot occur.
Returns the merged mode.
Returning None means that the modes of the old_schema and new_schema are
not compatible.
We have the following sub cases for the REQUIRED -> NULLABLE
transition:
Expand All @@ -425,8 +447,6 @@ def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
REQUIRED -> NULLABLE transition.
b) If --infer_mode is not given, then we log an error and ignore
this field from the schema.
Returning a 'None' causes the field to be dropped from the schema.
"""
old_info = old_schema_entry['info']
new_info = new_schema_entry['info']
Expand Down Expand Up @@ -778,6 +798,10 @@ def convert_type(atype, btype):
* [Q]FLOAT + [Q]INTEGER => FLOAT (except QFLOAT + QINTEGER)
* (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
(DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
The "Q" refers to the quoted (i.e. string) versions of the various types,
which are needed to emulate the type inference inside quoted strings
performed by BigQuery.
"""
# type + type => type
if atype == btype:
Expand Down Expand Up @@ -884,6 +908,11 @@ def flatten_schema_map(
filled = meta['filled']
info = meta['info']

# An 'ignore' status means different records had different types for
# this field, so should be ignored.
if status == 'ignore':
continue

# Schema entries with a status of 'soft' are caused by 'null' or
# empty fields. Don't print those out if the 'keep_nulls' flag is
# False.
Expand Down
2 changes: 1 addition & 1 deletion bigquery_schema_generator/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '1.6.0'
__version__ = '1.6.1'
20 changes: 20 additions & 0 deletions tests/testdata.txt
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,26 @@ SCHEMA
[]
END

# If there are multiple records with a column with a mismatched type, only the
# first one mismatch should trigger a warning message. All subsequent records
# should have that column ignored, even if the mismatch occurs multiple times.
# See Issue 98.
#
# Before that fix, the following would generate a warning each time there is a
# transition between matching and mismatching type for the problematic column.
# So for the following 4 records, the previous version would print 2 warnings,
# one for line #2, and one for line #4.
DATA
{ "ts": "2017-05-22T17:10:00-07:00" }
{ "ts": 1.0 }
{ "ts": 2.0 }
{ "ts": "2017-05-22T17:10:00-07:00" }
ERRORS
2: Ignoring field with mismatched type: old=(hard,ts,NULLABLE,TIMESTAMP); new=(hard,ts,NULLABLE,FLOAT)
SCHEMA
[]
END

# DATE cannot change into a non-String type.
DATA
{ "d": "2017-01-01" }
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions tests/testdata/issue94.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
150 changes: 150 additions & 0 deletions tests/testdata/issue98.fmt.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
{
"landingPageClicks": 2,
"costInLocalCurrency": 3.211797061516033,
"impressions": 331,
"clicks": 2,
"totalEngagements": 3,
"account_id": 0,
"account_name": "dummy",
"date": "2021-07-10",
"dimensionValue": [
{
"servingHoldReasons": [
"STOPPED",
"CAMPAIGN_STOPPED"
],
"lastModifiedAt": 1656605739000,
"content": {
"reference": "urn:<snip>"
},
"createdAt": 1622099537000,
"review": {
"status": "APPROVED"
},
"id": "urn:<snip>",
"lastModifiedBy": "urn:<snip>",
"createdBy": "urn:<snip>",
"isTest": false,
"isServing": false,
"campaign": "urn:<snip>",
"intendedStatus": "PAUSED",
"account": "urn:<snip>"
}
],
"dimension": "creative"
}
{
"landingPageClicks": 17,
"pivotValues_": [
{
"message": "<snip>",
"status": 400
}
],
"costInLocalCurrency": 25.41813751523781,
"impressions": 2300,
"clicks": 17,
"totalEngagements": 45,
"account_id": 0,
"account_name": "dummy",
"date": "2021-06-03",
"dimensionValue": [
"urn:<snip>"
],
"dimension": "creative"
}
{
"landingPageClicks": 11,
"pivotValues_": [
{
"message": "<snip>",
"status": 400
}
],
"costInLocalCurrency": 8.094764692519716,
"impressions": 602,
"clicks": 11,
"totalEngagements": 15,
"account_id": 0,
"account_name": "dummy",
"date": "2021-06-27",
"dimensionValue": [
"urn:<snip>"
],
"dimension": "creative"
}
{
"landingPageClicks": 10,
"pivotValues_": [
{
"message": "<snip>",
"status": 400
}
],
"costInLocalCurrency": 15.095423445027421,
"impressions": 999,
"clicks": 10,
"totalEngagements": 19,
"account_id": 0,
"account_name": "dummy",
"date": "2021-06-06",
"dimensionValue": [
"urn:<snip>"
],
"dimension": "creative"
}
{
"landingPageClicks": 19,
"pivotValues_": [
{
"message": "<snip>",
"status": 400
}
],
"costInLocalCurrency": 26.39521675040559,
"impressions": 2982,
"clicks": 19,
"totalEngagements": 45,
"account_id": 0,
"account_name": "dummy",
"date": "2021-07-19",
"dimensionValue": [
"urn:<snip>"
],
"dimension": "creative"
}
{
"landingPageClicks": 5,
"costInLocalCurrency": 7.54,
"impressions": 430,
"clicks": 5,
"totalEngagements": 12,
"account_id": 0,
"account_name": "dummy",
"date": "2021-12-25",
"dimensionValue": [
{
"servingHoldReasons": [
"CAMPAIGN_STOPPED",
"CAMPAIGN_TOTAL_BUDGET_HOLD"
],
"lastModifiedAt": 1656583173000,
"content": {
"reference": "urn:<snip>"
},
"createdAt": 1640163564000,
"review": {
"status": "APPROVED"
},
"id": "urn:<snip>",
"lastModifiedBy": "urn:<snip>",
"createdBy": "urn:<snip>",
"isTest": false,
"isServing": false,
"campaign": "urn:<snip>",
"intendedStatus": "ACTIVE",
"account": "urn:<snip>"
}
],
"dimension": "creative"
}
6 changes: 6 additions & 0 deletions tests/testdata/issue98.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{"landingPageClicks":2.0,"costInLocalCurrency":3.211797061516033,"impressions":331.0,"clicks":2.0,"totalEngagements":3.0,"account_id":0,"account_name":"dummy","date":"2021-07-10","dimensionValue":[{"servingHoldReasons":["STOPPED","CAMPAIGN_STOPPED"],"lastModifiedAt":1656605739000.0,"content":{"reference":"urn:<snip>"},"createdAt":1622099537000.0,"review":{"status":"APPROVED"},"id":"urn:<snip>","lastModifiedBy":"urn:<snip>","createdBy":"urn:<snip>","isTest":false,"isServing":false,"campaign":"urn:<snip>","intendedStatus":"PAUSED","account":"urn:<snip>"}],"dimension":"creative"}
{"landingPageClicks":17.0,"pivotValues_":[{"message":"<snip>","status":400.0}],"costInLocalCurrency":25.41813751523781,"impressions":2300.0,"clicks":17.0,"totalEngagements":45.0,"account_id":0,"account_name":"dummy","date":"2021-06-03","dimensionValue":["urn:<snip>"],"dimension":"creative"}
{"landingPageClicks":11.0,"pivotValues_":[{"message":"<snip>","status":400.0}],"costInLocalCurrency":8.094764692519716,"impressions":602.0,"clicks":11.0,"totalEngagements":15.0,"account_id":0,"account_name":"dummy","date":"2021-06-27","dimensionValue":["urn:<snip>"],"dimension":"creative"}
{"landingPageClicks":10.0,"pivotValues_":[{"message":"<snip>","status":400.0}],"costInLocalCurrency":15.095423445027421,"impressions":999.0,"clicks":10.0,"totalEngagements":19.0,"account_id":0,"account_name":"dummy","date":"2021-06-06","dimensionValue":["urn:<snip>"],"dimension":"creative"}
{"landingPageClicks":19.0,"pivotValues_":[{"message":"<snip>","status":400.0}],"costInLocalCurrency":26.39521675040559,"impressions":2982.0,"clicks":19.0,"totalEngagements":45.0,"account_id":0,"account_name":"dummy","date":"2021-07-19","dimensionValue":["urn:<snip>"],"dimension":"creative"}
{"landingPageClicks":5.0,"costInLocalCurrency":7.54,"impressions":430.0,"clicks":5.0,"totalEngagements":12.0,"account_id":0,"account_name":"dummy","date":"2021-12-25","dimensionValue":[{"servingHoldReasons":["CAMPAIGN_STOPPED","CAMPAIGN_TOTAL_BUDGET_HOLD"],"lastModifiedAt":1656583173000.0,"content":{"reference":"urn:<snip>"},"createdAt":1640163564000.0,"review":{"status":"APPROVED"},"id":"urn:<snip>","lastModifiedBy":"urn:<snip>","createdBy":"urn:<snip>","isTest":false,"isServing":false,"campaign":"urn:<snip>","intendedStatus":"ACTIVE","account":"urn:<snip>"}],"dimension":"creative"}

0 comments on commit 6405d35

Please sign in to comment.