fix(bigquery): use pyarrow fallback for improved schema detection #9321

plamut · 2019-09-26T17:00:00Z

Closes #9206.

This PR adds additional logic to schema autodetection if pyarrow is available, and the types could not be detected for all columns.

How to test

Make sure that pyarrow is installed
Try loading data from a dataframe to a new table when the dataframe contains columns of type string, date, etc. (those that end up being the Pandas dtype "object"). Do not provide an explicit schema to the load job.

Actual result (before the fix):
The backend responds with an error (incompatible types).

Expected result (after the fix):
The load job completes successfully, and column types are correctly detected (strings, dates...).

Misc

The code might not work for all scalar types, depending on how good the pyarrow's schema detection logic is. Which types to we care about the most?

We should update the test case to include all of them (we currently test strings and dates).

bigquery/google/cloud/bigquery/_pandas_helpers.py

tswast · 2019-09-26T19:46:07Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

@@ -110,8 +110,13 @@ def pyarrow_timestamp():
        "TIME": pyarrow_time,
        "TIMESTAMP": pyarrow_timestamp,
    }
+    ARROW_SCALARS_TO_BQ = {
+        arrow_type(): bq_type  # TODO: explain wht calling arrow_type()


I don't think we're supposed to use Arrow type objects as dictionary keys. The factory functions all describe instances of data types (https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions), so I don't think we're guaranteed on equality / consistent hash between two Arrow data types.

That said, I'm not sure what alternatives we have. We could have a function with a whole bunch of if statements calling the type checking functions: https://arrow.apache.org/docs/python/api/datatypes.html#type-checking but that sounds dirty and slow (though more guaranteed to be correct). For pandas, I use the string name of the dtype, but I'm not seeing an equivalent for Arrow.

Each pyarrow type has a numeric ID that we can use instead. It's true that we don't know if these IDs are going to stay the same forever, but the map is built dynamically, and the IDs are consistent within a particular pyarrow version.

What do you think?

Edit: Well, for integers, for example, we assume pyarrow.int64, but an integer in arrow could also be of another type such as pyarrow.int32, and we need to map that to BigQuery's INT64, too.

I extended the types map in a provisional new commit, feel free to scrutinize. :)

bigquery/google/cloud/bigquery/_pandas_helpers.py

tswast · 2019-09-26T19:58:44Z

bigquery/google/cloud/bigquery/_pandas_helpers.py

+                currated_schema.append(schema_field)
+                continue
+
+            detected_type = ARROW_SCALARS_TO_BQ.get(


Probably best if we make a function for Arrow type + field name -> BQ field now, since we know we'll want to handle repeated / nested fields at some point.

Add more pyarrow types, convert to pyarrow only the columns the schema could not be detected for, etc.

plamut · 2019-10-19T09:39:05Z

@tswast Updated the code and extended the tests.

Also, I wonder if we should amend our return type to include the converted arrays?

Looking at the code, that would require changing the signatures of the following Pandas helpers:

currate_schema() - return type,
dataframe_to_bq_schema() - return type,
dataframe_to_parquet() and dataframe_to_arrow() - additional parameter holding any dataframe columns already converted pyarrow arrays.

Returning (schema, Optional[arrow_arrays]) does not feel like a clean design, and these changes would be added for a use case that we discourage (and will likely deprecate in the future), i.e. relying on implicit schema detection. And even that only for loading the data into a new table when mode != WRITE_TRUNCATE.

Unless there are serious performance problems reported, we are probably better off promoting the explicit schema usage IMO, and discouraging implicit schemas.

bigquery/google/cloud/bigquery/_pandas_helpers.py

plamut added the api: bigquery Issues related to the BigQuery API. label Sep 26, 2019

plamut requested a review from a team September 26, 2019 17:00

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Sep 26, 2019

plamut commented Sep 26, 2019

View reviewed changes

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved

plamut force-pushed the iss-9206 branch from 8ab716f to e462037 Compare September 26, 2019 17:19

tswast reviewed Sep 26, 2019

View reviewed changes

plamut added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 1, 2019

fix(bigquery): use pyarrow fallback in schema autodetect

dd43f6b

plamut force-pushed the iss-9206 branch from fe2b6b9 to c5fbd1b Compare October 18, 2019 15:25

Improve and refactor pyarrow schema detection

a07397a

Add more pyarrow types, convert to pyarrow only the columns the schema could not be detected for, etc.

plamut force-pushed the iss-9206 branch from c5fbd1b to a07397a Compare October 19, 2019 07:34

plamut removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 19, 2019

plamut requested a review from tswast October 19, 2019 09:39

tswast requested changes Oct 23, 2019

View reviewed changes

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved

plamut added 2 commits October 24, 2019 10:06

Merge branch 'master' into iss-9206

f45200e

Use the word "augment" in helper's name

d2333da

plamut requested a review from tswast October 24, 2019 08:10

plamut added 2 commits November 4, 2019 09:27

Merge branch 'master' into iss-9206

a317e57

Fix failed import in one of the tests

8556fdb

tswast approved these changes Nov 4, 2019

View reviewed changes

plamut merged commit ed37540 into googleapis:master Nov 4, 2019

plamut deleted the iss-9206 branch November 4, 2019 20:56

yoshi-automation mentioned this pull request Jan 29, 2020

chore(bigquery): bump copyright year to 2020, tweak docstring formatting (via synth) #10225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bigquery): use pyarrow fallback for improved schema detection #9321

fix(bigquery): use pyarrow fallback for improved schema detection #9321

plamut commented Sep 26, 2019 •

edited

Loading

tswast Sep 26, 2019

plamut Sep 27, 2019 •

edited

Loading

plamut Oct 1, 2019

tswast Sep 26, 2019

plamut commented Oct 19, 2019

fix(bigquery): use pyarrow fallback for improved schema detection #9321

fix(bigquery): use pyarrow fallback for improved schema detection #9321

Conversation

plamut commented Sep 26, 2019 • edited Loading

How to test

Misc

tswast Sep 26, 2019

Choose a reason for hiding this comment

plamut Sep 27, 2019 • edited Loading

Choose a reason for hiding this comment

plamut Oct 1, 2019

Choose a reason for hiding this comment

tswast Sep 26, 2019

Choose a reason for hiding this comment

plamut commented Oct 19, 2019

plamut commented Sep 26, 2019 •

edited

Loading

plamut Sep 27, 2019 •

edited

Loading