Skip to content

Commit

Permalink
Support undocumented 'UTC' suffix (same as 'Z') in TIMESTAMP field; f…
Browse files Browse the repository at this point in the history
…ix for #19
  • Loading branch information
bxparks committed Jan 17, 2019
1 parent 3a601a4 commit 1e7045a
Show file tree
Hide file tree
Showing 4 changed files with 49 additions and 2 deletions.
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,29 @@ The supported types are:
* `RECORD`

The `generate-schema` script supports both `NULLABLE` and `REPEATED` modes of
all of the above types. The following types are _not_ supported:
all of the above types.

The supported format of `TIMESTAMP` is as close as practical to the
[bq load format](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#timestamp-type):
```
YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]
```
which appears to be an extension of the
[ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601).
The difference from `bq load` is that the `[time zone]` component can be only
* `Z`
* `UTC` (same as `Z`)
* `(+|-)H[H][:M[M]]`

The suffix `UTC` is not standard ISO 8601 nor
[documented by Google](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time-zones)
but the `UTC` suffix is used by `bq extract` and the web interface. (See
[Issue 19](https://github.com/bxparks/bigquery-schema-generator/issues/19).

Timezone names from the [tz database](http://www.iana.org/time-zones) (e.g.
"America/Los_Angeles") are _not_ supported by `generate-schema`.

The following types are _not_ supported at all:

* `BYTES`
* `DATETIME` (unable to distinguish from `TIMESTAMP`)
Expand Down
2 changes: 1 addition & 1 deletion bigquery_schema_generator/generate_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ class SchemaGenerator:
# YYYY-[M]M-[D]D[( |T)[H]H:[M]M:[S]S[.DDDDDD]][time zone]
TIMESTAMP_MATCHER = re.compile(
r'^\d{4}-\d{1,2}-\d{1,2}[T ]\d{1,2}:\d{1,2}:\d{1,2}(\.\d{1,6})?'
r'(([+-]\d{1,2}(:\d{1,2})?)|Z)?$')
r' *(([+-]\d{1,2}(:\d{1,2})?)|Z|UTC)?$')

# Detect a DATE field of the form YYYY-[M]M-[D]D.
DATE_MATCHER = re.compile(
Expand Down
8 changes: 8 additions & 0 deletions tests/test_generate_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ def test_timestamp_matcher_valid(self):
'2017-05-22 12:33:01.123456'))
self.assertTrue(
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01Z'))
self.assertTrue(
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01 Z'))
self.assertTrue(
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01UTC'))
self.assertTrue(
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22 12:33:01 UTC'))
self.assertTrue(
SchemaGenerator.TIMESTAMP_MATCHER.match(
'2017-05-22 12:33:01-7:00'))
Expand Down Expand Up @@ -73,6 +79,8 @@ def test_timestamp_matcher_invalid(self):
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-5-2A2:3:0'))
self.assertFalse(
SchemaGenerator.TIMESTAMP_MATCHER.match('17-05-22T12:33:01'))
self.assertFalse(
SchemaGenerator.TIMESTAMP_MATCHER.match('2017-05-22T12:33:01 UT'))

def test_date_matcher_valid(self):
self.assertTrue(SchemaGenerator.DATE_MATCHER.match('2017-05-22'))
Expand Down
17 changes: 17 additions & 0 deletions tests/testdata.txt
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,7 @@ SCHEMA
END

# (Overflowing integer inside quotes) + STRING -> STRING
# See https://github.com/bxparks/bigquery-schema-generator/issues/18.
DATA
{"name": "9223372036854775808"}
{"name": "hello"}
Expand All @@ -690,3 +691,19 @@ SCHEMA
}
]
END

# TIMESTAMP recognizes Z, UTC, +/-offset suffixes.
# See https://github.com/bxparks/bigquery-schema-generator/issues/19
DATA
{"date": "2019-01-16T12:46:02Z"}
{"date": "2019-01-16T12:46:03 -05:00"}
{"date": "2019-01-16 12:46:01 UTC"}
SCHEMA
[
{
"mode": "NULLABLE",
"name": "date",
"type": "TIMESTAMP"
}
]
END

0 comments on commit 1e7045a

Please sign in to comment.