Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bigquery): use pyarrow fallback for improved schema detection #9321

Merged
merged 6 commits into from
Nov 4, 2019

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Sep 26, 2019

Closes #9206.

This PR adds additional logic to schema autodetection if pyarrow is available, and the types could not be detected for all columns.

How to test

  • Make sure that pyarrow is installed
  • Try loading data from a dataframe to a new table when the dataframe contains columns of type string, date, etc. (those that end up being the Pandas dtype "object"). Do not provide an explicit schema to the load job.

Actual result (before the fix):
The backend responds with an error (incompatible types).

Expected result (after the fix):
The load job completes successfully, and column types are correctly detected (strings, dates...).

Misc

The code might not work for all scalar types, depending on how good the pyarrow's schema detection logic is. Which types to we care about the most?

We should update the test case to include all of them (we currently test strings and dates).

@plamut plamut added the api: bigquery Issues related to the BigQuery API. label Sep 26, 2019
@plamut plamut requested a review from a team September 26, 2019 17:00
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Sep 26, 2019
@@ -110,8 +110,13 @@ def pyarrow_timestamp():
"TIME": pyarrow_time,
"TIMESTAMP": pyarrow_timestamp,
}
ARROW_SCALARS_TO_BQ = {
arrow_type(): bq_type # TODO: explain wht calling arrow_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're supposed to use Arrow type objects as dictionary keys. The factory functions all describe instances of data types (https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions), so I don't think we're guaranteed on equality / consistent hash between two Arrow data types.

That said, I'm not sure what alternatives we have. We could have a function with a whole bunch of if statements calling the type checking functions: https://arrow.apache.org/docs/python/api/datatypes.html#type-checking but that sounds dirty and slow (though more guaranteed to be correct). For pandas, I use the string name of the dtype, but I'm not seeing an equivalent for Arrow.

Copy link
Contributor Author

@plamut plamut Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each pyarrow type has a numeric ID that we can use instead. It's true that we don't know if these IDs are going to stay the same forever, but the map is built dynamically, and the IDs are consistent within a particular pyarrow version.

What do you think?

Edit: Well, for integers, for example, we assume pyarrow.int64, but an integer in arrow could also be of another type such as pyarrow.int32, and we need to map that to BigQuery's INT64, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the types map in a provisional new commit, feel free to scrutinize. :)

bigquery/google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
currated_schema.append(schema_field)
continue

detected_type = ARROW_SCALARS_TO_BQ.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best if we make a function for Arrow type + field name -> BQ field now, since we know we'll want to handle repeated / nested fields at some point.

@plamut plamut added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 1, 2019
Add more pyarrow types, convert to pyarrow only the columns the schema
could not be detected for, etc.
@plamut
Copy link
Contributor Author

plamut commented Oct 19, 2019

@tswast Updated the code and extended the tests.

Also, I wonder if we should amend our return type to include the converted arrays?

Looking at the code, that would require changing the signatures of the following Pandas helpers:

  • currate_schema() - return type,
  • dataframe_to_bq_schema() - return type,
  • dataframe_to_parquet() and dataframe_to_arrow() - additional parameter holding any dataframe columns already converted pyarrow arrays.

Returning (schema, Optional[arrow_arrays]) does not feel like a clean design, and these changes would be added for a use case that we discourage (and will likely deprecate in the future), i.e. relying on implicit schema detection. And even that only for loading the data into a new table when mode != WRITE_TRUNCATE.

Unless there are serious performance problems reported, we are probably better off promoting the explicit schema usage IMO, and discouraging implicit schemas.

@plamut plamut removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 19, 2019
@plamut plamut requested a review from tswast October 19, 2019 09:39
@plamut plamut requested a review from tswast October 24, 2019 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER
3 participants