Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery to_dataframe() ArrowNotImplementedError #63

Closed
jvschoen opened this issue Mar 17, 2020 · 11 comments
Closed

BigQuery to_dataframe() ArrowNotImplementedError #63

jvschoen opened this issue Mar 17, 2020 · 11 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: question Request for information or clarification. Not an issue.

Comments

@jvschoen
Copy link

I'm working with the Google Vision API and trying to do some analysis in pandas. When I try to get the output into pandas there is an issue and I get the NotImplementedError Below.

Environment details

  1. Specify the API at the beginning of the title (for example, "BigQuery: ...")
    General, Core, and Other are also allowed as types
  2. OS type and version: GCP AI Notebook Python framework
  3. Python version and virtual environment information: 3.7.6
  4. google-cloud-bigquery version: 1.24.0

Steps to reproduce

  1. Query Vision API output results in nested structs. Our output has RECORD datatypes of mode REPEATED

Code example

def get_data(_project_id, _table_id):
    
    client = bq.Client(project=_project_id)
    try:
        df = client.query(sql_)
    except OperationalError as oe:
        print(oe.msg)
    print("Query Complete. Converting to Dataframe")

    df = df.to_dataframe(progress_bar_type='tqdm') # Converts to dataframe
    return df

sql_ = (
    '''
    SELECT
      *
      FROM `{}`
    '''.format(_table_id)
    )

df = get_data(project_id, table_id)

Stack trace

ArrowNotImplementedError                  Traceback (most recent call last)
<ipython-input-17-15263a59b273> in <module>
----> 1 df = get_data(project_id, table_id)

<ipython-input-16-83704d37548f> in get_data(_project_id, _table_id, verbose)
     18     print("Query Complete. Converting to Dataframe")
     19 
---> 20     df = df.to_dataframe(progress_bar_type='tqdm') # Converts to dataframe
     21     return df

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/job.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   3372             dtypes=dtypes,
   3373             progress_bar_type=progress_bar_type,
-> 3374             create_bqstorage_client=create_bqstorage_client,
   3375         )
   3376 

/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/table.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client)
   1729                 create_bqstorage_client=create_bqstorage_client,
   1730             )
-> 1731             df = record_batch.to_pandas()
   1732             for column in dtypes:
   1733                 df[column] = pandas.Series(df[column], dtype=dtypes[column])

/opt/conda/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

/opt/conda/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    764     _check_data_column_metadata_consistency(all_columns)
    765     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 766     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    767 
    768     axes = [columns, index]

/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1099     columns = block_table.column_names
   1100     result = pa.lib.table_to_blocks(options, block_table, categories,
-> 1101                                     list(extension_columns.keys()))
   1102     return [_reconstruct_block(item, columns, extension_columns)
   1103             for item in result]

/opt/conda/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: struct<score: double, description: string>
@plamut plamut transferred this issue from googleapis/google-cloud-python Mar 18, 2020
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Mar 18, 2020
@plamut plamut added the type: question Request for information or clarification. Not an issue. label Mar 18, 2020
@plamut
Copy link
Contributor

plamut commented Mar 18, 2020

@jvschoen Thanks for the report (I moved the issue from the old repository to here).

Which pyarrow version do you use? Is it 0.16.0+ or something less recent? If found a possibly related issue in the pyarrow bugtracker, and upgrading pyarrow might get rid of it.

Also, could you share the schema of the source table and the query that fetches data from it? That could also be useful for diagnosing the issue, thanks!

@tswast
Copy link
Contributor

tswast commented Mar 19, 2020

We may want to consider using Fletcher for struct and array BigQuery data types, since pandas needs to use (slow) Python objects in these cases.

https://fletcher.readthedocs.io/en/latest/

@jvschoen
Copy link
Author

jvschoen commented Mar 19, 2020

This is the pip show on my AI notebook

Name: pyarrow
Version: 0.16.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: numpy, six
Required-by: 

Here's pip show of the google.cloud.bigquery

Name: google-cloud-bigquery
Version: 1.24.0
Summary: Google BigQuery API client library
Home-page: https://github.com/GoogleCloudPlatform/google-cloud-python
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: six, google-api-core, protobuf, google-cloud-core, google-auth, google-resumable-media
Required-by: pandas-gbq

Here's output of the table schema in a google sheets:
https://docs.google.com/spreadsheets/d/189O5x5C18pIj5PywpCOVPQeJEygB83Vt8BmRRGFY1cg/edit?usp=sharing

I think it has something to do with ARRAY<STRUCT<score FLOAT64, description STRING>>

@jvschoen
Copy link
Author

If you look at that sheet and I try to just query the face_data repeated record:

ArrowNotImplementedError: Not implemented type for Arrow list to pandas: struct<face_id: int64, joy_likelihood: int64, surprise_likelihood: int64, anger_likelihood: int64, blurred_likelihood: int64, sorrow_likelihood: int64, headwear_likelihood: int64, underexposed_likelihood: int64, is_happy: bool, is_surprised: bool, is_angry: bool, is_blurred: bool, is_sad: bool, has_headwear: bool, is_underexposed: bool, face_area: double, face_ratio: double, face_ninth: string>

@emkornfield
Copy link

The underlying Arrow Issue has been fixed on master and will be available with the next release (https://issues.apache.org/jira/browse/ARROW-7872)

@plamut
Copy link
Contributor

plamut commented Mar 29, 2020

@emkornfield This sounds good, thanks! Looking forward to trying it out.

@emkornfield
Copy link

Also would using avro be a short term workaround for this?

@plamut
Copy link
Contributor

plamut commented Mar 31, 2020

@emkornfield If a release is indeed planned some time in the next few weeks, I think that's soon enough, considering the fact that this limitation has already been around for awhile.

I checked the test case from the fix commit, and while it fails in pyarrow 0.16.0, it indeed passes in version 0.16.1.dev383+g0facdc77b (I manually compiled it from source).

FWIW, I did encounter quite a few problems with dependencies and versions, but eventually managed to compile pyarrow for Python 3.6 and tested with that.


I did notice, however, that the following snippet from BigQuery internals still fails:

import pandas
import pyarrow

series = pandas.Series([1, 2, 3], name="foo")

inner_type = pyarrow.int64()
arrow_type = pyarrow.list_(inner_type)
pyarrow.ListArray.from_pandas(series, type=arrow_type) 
# ArrowNotImplementedError: NumPyConverter doesn't implement <list<item: int64>> conversion. 

This seems like a bug in BigQuery, as type argument should actually be the series element type (DataType(int64)), not the type of the series itself (ListType(list<item: int64>)).

Update: Actually, this "bug" above was an error in my test, there was a mismatch between the schema and the test data, nevermind.

@emkornfield
Copy link

emkornfield commented Mar 31, 2020

I think there are now nightly builds getting published. You should be able to use

pip install -U --extra-index-url \
https://pypi.fury.io/arrow-nightlies/ --pre pyarrow

@plamut
Copy link
Contributor

plamut commented Apr 2, 2020

It indeed works, thanks. I was able to install and run a development version with Python 3.7.

@jvschoen
Copy link
Author

jvschoen commented May 6, 2020

I know this is delayed, but that pyarrow update worked for me. I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

4 participants