BigQuery to_dataframe() ArrowNotImplementedError #63

jvschoen · 2020-03-17T20:06:59Z

I'm working with the Google Vision API and trying to do some analysis in pandas. When I try to get the output into pandas there is an issue and I get the NotImplementedError Below.

Environment details

Specify the API at the beginning of the title (for example, "BigQuery: ...")
General, Core, and Other are also allowed as types
OS type and version: GCP AI Notebook Python framework
Python version and virtual environment information: 3.7.6
google-cloud-bigquery version: 1.24.0

Steps to reproduce

Query Vision API output results in nested structs. Our output has RECORD datatypes of mode REPEATED

Code example

def get_data(_project_id, _table_id):
    
    client = bq.Client(project=_project_id)
    try:
        df = client.query(sql_)
    except OperationalError as oe:
        print(oe.msg)
    print("Query Complete. Converting to Dataframe")

    df = df.to_dataframe(progress_bar_type='tqdm') # Converts to dataframe
    return df

sql_ = (
    '''
    SELECT
      *
      FROM `{}`
    '''.format(_table_id)
    )

df = get_data(project_id, table_id)

Stack trace

ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-17-15263a59b273> in <module>
----> 1 df = get_data(project_id, table_id)

<ipython-input-16-83704d37548f> in get_data(_project_id, _table_id, verbose)
     18     print("Query Complete. Converting to Dataframe")
     19 
---> 20     df = df.to_dataframe(progress_bar_type='tqdm') # Converts to dataframe
     21     return df

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/job.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   3372             dtypes=dtypes,
   3373             progress_bar_type=progress_bar_type,
-> 3374             create_bqstorage_client=create_bqstorage_client,
   3375         )
   3376 

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/table.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   1729                 create_bqstorage_client=create_bqstorage_client,
   1730             )
-> 1731             df = record_batch.to_pandas()
   1732             for column in dtypes:
   1733                 df[column] = pandas.Series(df[column], dtype=dtypes[column])

/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

/opt/conda/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    764     _check_data_column_metadata_consistency(all_columns)
    765     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 766     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    767 
    768     axes = [columns, index]

/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1099     columns = block_table.column_names
   1100     result = pa.lib.table_to_blocks(options, block_table, categories,
-> 1101                                     list(extension_columns.keys()))
   1102     return [_reconstruct_block(item, columns, extension_columns)
   1103             for item in result]

/opt/conda/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: struct<score: double, description: string>

The text was updated successfully, but these errors were encountered:

plamut · 2020-03-18T10:55:57Z

@jvschoen Thanks for the report (I moved the issue from the old repository to here).

Which pyarrow version do you use? Is it 0.16.0+ or something less recent? If found a possibly related issue in the pyarrow bugtracker, and upgrading pyarrow might get rid of it.

Also, could you share the schema of the source table and the query that fetches data from it? That could also be useful for diagnosing the issue, thanks!

tswast · 2020-03-19T11:26:33Z

We may want to consider using Fletcher for struct and array BigQuery data types, since pandas needs to use (slow) Python objects in these cases.

https://fletcher.readthedocs.io/en/latest/

jvschoen · 2020-03-19T15:17:31Z

This is the pip show on my AI notebook

Name: pyarrow
Version: 0.16.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: numpy, six
Required-by:

Here's pip show of the google.cloud.bigquery

Name: google-cloud-bigquery
Version: 1.24.0
Summary: Google BigQuery API client library
Home-page: https://github.com/GoogleCloudPlatform/google-cloud-python
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: six, google-api-core, protobuf, google-cloud-core, google-auth, google-resumable-media
Required-by: pandas-gbq

Here's output of the table schema in a google sheets:
https://docs.google.com/spreadsheets/d/189O5x5C18pIj5PywpCOVPQeJEygB83Vt8BmRRGFY1cg/edit?usp=sharing

I think it has something to do with ARRAY<STRUCT<score FLOAT64, description STRING>>

jvschoen · 2020-03-19T17:21:53Z

If you look at that sheet and I try to just query the face_data repeated record:

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: struct<face_id: int64, joy_likelihood: int64, surprise_likelihood: int64, anger_likelihood: int64, blurred_likelihood: int64, sorrow_likelihood: int64, headwear_likelihood: int64, underexposed_likelihood: int64, is_happy: bool, is_surprised: bool, is_angry: bool, is_blurred: bool, is_sad: bool, has_headwear: bool, is_underexposed: bool, face_area: double, face_ratio: double, face_ninth: string>

emkornfield · 2020-03-29T06:11:34Z

The underlying Arrow Issue has been fixed on master and will be available with the next release (https://issues.apache.org/jira/browse/ARROW-7872)

plamut · 2020-03-29T08:50:24Z

@emkornfield This sounds good, thanks! Looking forward to trying it out.

emkornfield · 2020-03-29T15:24:16Z

Also would using avro be a short term workaround for this?

plamut · 2020-03-31T09:50:52Z

@emkornfield If a release is indeed planned some time in the next few weeks, I think that's soon enough, considering the fact that this limitation has already been around for awhile.

I checked the test case from the fix commit, and while it fails in pyarrow 0.16.0, it indeed passes in version 0.16.1.dev383+g0facdc77b (I manually compiled it from source).

FWIW, I did encounter quite a few problems with dependencies and versions, but eventually managed to compile pyarrow for Python 3.6 and tested with that.

~~I did notice, however, that the following snippet from BigQuery internals still fails:~~

import pandas
import pyarrow

series = pandas.Series([1, 2, 3], name="foo")

inner_type = pyarrow.int64()
arrow_type = pyarrow.list_(inner_type)
pyarrow.ListArray.from_pandas(series, type=arrow_type) 
# ArrowNotImplementedError: NumPyConverter doesn't implement <list<item: int64>> conversion.

~~This seems like a bug in BigQuery, as type argument should actually be the series element type (DataType(int64)), not the type of the series itself (ListType(list<item: int64>)).~~

Update: Actually, this "bug" above was an error in my test, there was a mismatch between the schema and the test data, nevermind.

emkornfield · 2020-03-31T15:46:03Z

I think there are now nightly builds getting published. You should be able to use

pip install -U --extra-index-url \
https://pypi.fury.io/arrow-nightlies/ --pre pyarrow

plamut · 2020-04-02T12:27:17Z

It indeed works, thanks. I was able to install and run a development version with Python 3.7.

jvschoen · 2020-05-06T15:48:13Z

I know this is delayed, but that pyarrow update worked for me. I'm closing this issue.

plamut transferred this issue from googleapis/google-cloud-python Mar 18, 2020

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Mar 18, 2020

plamut added the type: question Request for information or clarification. Not an issue. label Mar 18, 2020

plamut mentioned this issue Apr 10, 2020

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

Closed

jvschoen closed this as completed May 6, 2020

JustinBeckwith assigned jvschoen Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery to_dataframe() ArrowNotImplementedError #63

BigQuery to_dataframe() ArrowNotImplementedError #63

jvschoen commented Mar 17, 2020

plamut commented Mar 18, 2020 •

edited

Loading

tswast commented Mar 19, 2020

jvschoen commented Mar 19, 2020 •

edited

Loading

jvschoen commented Mar 19, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 29, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 31, 2020 •

edited

Loading

emkornfield commented Mar 31, 2020 •

edited

Loading

plamut commented Apr 2, 2020

jvschoen commented May 6, 2020

BigQuery to_dataframe() ArrowNotImplementedError #63

BigQuery to_dataframe() ArrowNotImplementedError #63

Comments

jvschoen commented Mar 17, 2020

Environment details

Steps to reproduce

Code example

Stack trace

plamut commented Mar 18, 2020 • edited Loading

tswast commented Mar 19, 2020

jvschoen commented Mar 19, 2020 • edited Loading

jvschoen commented Mar 19, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 29, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 31, 2020 • edited Loading

emkornfield commented Mar 31, 2020 • edited Loading

plamut commented Apr 2, 2020

jvschoen commented May 6, 2020

plamut commented Mar 18, 2020 •

edited

Loading

jvschoen commented Mar 19, 2020 •

edited

Loading

plamut commented Mar 31, 2020 •

edited

Loading

emkornfield commented Mar 31, 2020 •

edited

Loading