Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

Closed
timocb opened this issue Sep 11, 2019 · 17 comments · Fixed by #35
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@timocb
Copy link

timocb commented Sep 11, 2019

I have an issue where it is not possible to upload a Pandas DataFrame with a repeated field to BigQuery. It is very much related to an issue I've had earlier: googleapis/google-cloud-python#8093

Since that has been resolved (by being able to specify the schema), I've created a separate issue. I also couldn't find issues related to REPEATED fields.

Environment details

Mac OS X 10.14.5
Python 3.6.8

Packages:

google-api-core==1.14.2
google-auth==1.6.3
google-cloud-bigquery==1.19.0
google-cloud-core==1.0.3
google-cloud-iam==0.2.1
google-cloud-logging==1.12.1
google-resumable-media==0.3.3
googleapis-common-protos==1.6.0

Steps to reproduce

  1. Have a table with a REPEATED field
  2. Upload a Pandas DataFrame with a repeated field to this table
  3. Error

Also:

  • Getting the schema from BigQuery and using that in the JobConfig doesn't change the error.

Code example

import pandas as pd
from google.cloud import bigquery


PROJECT = "MY-PROJECT"
DATASET = "MY_DATASET"
TABLE = "MY_TABLE"


# My table schema
schema = [
    bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED"),
    bigquery.SchemaField("bar", "FLOAT", mode="REPEATED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})

job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()

Stack trace

Traceback (most recent call last):
  File "test.py", line 51, in <module>
    client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()
  File "google/cloud/bigquery/job.py", line 734, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 

Provided schema is not compatible with the file 'prod-scotty-******'. 

Field 'bar' is specified as REPEATED in provided schema 
which does not match REQUIRED as specified in the file.
@plamut
Copy link
Contributor

plamut commented Sep 11, 2019

I can confirm, managed to reproduce the issue as described (thanks for the good self-contained example!).

Update: FWIW, table partitioning does not seem to play a role here, the issue is also reproducible with a non-partitioned table.

@plamut
Copy link
Contributor

plamut commented Sep 16, 2019

I suspect this is a backend issue.

The following archive contains a temp parquet file that is uploaded to the backend (GitHub does not allow attaching .parquet files directly): iss_9207_upload.parquet.zip

Inspecting the file reveals the following:

>>> import fastparquet
>>> FILE = "/path/to/iss_9207_upload.parquet"  # adjust accordingly
>>> pf = fastparquet.ParquetFile(FILE)
>>> print(pf.schema)
- schema: REQUIRED
| - foo: INT64, REQUIRED
  - bar: LIST, REQUIRED
    - list: REPEATED
      - item: DOUBLE, OPTIONAL

The schema in the file is correct, but I suspect that the backend simply compares the parquet type of the barcolumn with the bigquery.SchemaField instance for the same column, and incorrectly concluding that the mode REPEATED does not match the REQUIRED mode in the file:

Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file.

Will forward this to the backend to have a closer look.

@timocb
Copy link
Author

timocb commented Sep 17, 2019

@plamut Is this also the case when using pyarrow?

From this comment I got the idea that fastparquet and pyarrow provide similar functionality and can be used interchangeably.

@plamut
Copy link
Contributor

plamut commented Sep 17, 2019

@timocb Yes, I reproduced the issue with pyarrow installed (the BigQuery client used pyarrow to create a parquet file).

It seems that the generated parquet file that gets uploaded to the backend is correct, but the backend incorrectly concludes that the schema provided does not match the schema in the uploaded file.

Edit: As a workaround, would loading the data with load_table_from_json() help? It is similar, except that it accepts a list of row dictionaries instead of a dataframe.

@timocb
Copy link
Author

timocb commented Sep 17, 2019

@plamut Thanks for confirming. That workaround was indeed what I used!

@tswast
Copy link
Contributor

tswast commented Sep 25, 2019

I filed https://issuetracker.google.com/133415569 a while ago but closed it because I wasn't able to reproduce. Thanks for investigating further.

Seems that problem is when a parquet file specifies required and repeated at the same time?

@plamut
Copy link
Contributor

plamut commented Sep 25, 2019

It appears so, yes, as the things in the client look fine at the point when the parquet file is sent in an API request.

@tswast
Copy link
Contributor

tswast commented Sep 25, 2019

Related: https://github.com/googleapis/google-cloud-python/issues/8544 We never actually closed the feature request for ARRAY type, but perhaps we can, since we support it when an explicit schema is provided.

I've re-opened my internal issue, since we're able to reproduce the issue with parquet files generated by fastparquet.

@tswast
Copy link
Contributor

tswast commented Nov 14, 2019

I'm not able to reproduce with the test file provided.

$ bq load --source_format=PARQUET test_dataset.b133415569_parquet iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_r670a6280dcb701f9_0000016e6ae31f56_1 ... (6s) Current status: DONE

Appending to that same table:

$ bq load --source_format=PARQUET test_dataset.b133415569_parquet iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_r2f561503180ee012_0000016e6ae34f6f_1 ... (6s) Current status: DONE

image

@tswast
Copy link
Contributor

tswast commented Nov 14, 2019

I see one difference in the two schemas, which is bar is mode "repeated" in the file Peter provided, but "required" once uploaded.

BigQuery does not support arrays of arrays, but the workaround is to have an array of records with an array field, which is being followed here.

@tswast
Copy link
Contributor

tswast commented Nov 14, 2019

Actually, I think this is just a weirdness in how fastparquet encodes lists.

Fastparquet also handles some parquet LIST and MAP types. For instance, the schema may include

| - tags: LIST, OPTIONAL
    - list: REPEATED
       - element: BYTE_ARRAY, UTF8, OPTIONAL

In this case, columns would include an entry “tags”, which evaluates to an object column containing lists of strings. Reading such columns will be relatively slow. If the ‘element’ type is anything other than a primitive type, i.e., a struct, map or list, than fastparquet will not be able to read it, and the resulting column will either not be contained in the output, or contain only None values.

https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema

It's probably worth comparing with how other parquet encoders deal with this case, as I suspect the schema can be less complicated than a list of structs with one item.

@tswast
Copy link
Contributor

tswast commented Nov 14, 2019

Peter, can you try encoding the same data with some other tools (like pyarrow and maybe even the Java Parquet package https://github.com/apache/parquet-mr) and compare the generated schema?

If it's as I suspect and fastparquet is doing something weird, we should file a bug to fastparquet.

@plamut
Copy link
Contributor

plamut commented Nov 30, 2019

@tswast

I'm not able to reproduce with the test file provided.

Sanity check - was that when loading the data into a brand new table?
I can reproduce the "good" behavior by running "bq load" twice with a table name that initially does not exist. The data is uploaded fine, and the created table schema matches the one in the posted screenshot.

On the other hand, if using an existing table:

Screenshot from 2019-11-30 16-17-35

... uploading the same file with "bq load" results in an error, although in a different one:

$ bq load --source_format=PARQUET peter_test_dataset.table_9207 ../reproduce/iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_**********************_1 ... (1s) Current status: DONE   
BigQuery error in load operation: Error processing job '**********:**********': Provided Schema does not match Table ********:peter_test_dataset.table_9207. Field bar has changed type from FLOAT to RECORD

If trying to load the data with the script provided by @timocb, I get the same error message as in the issue description.

@timocb Can you please just confirm that the repeated bar field is indeed defined in the same way as in your own table?


@tswast The sample dataframe was converted to a parquet file (the one that I attached) via pyarrow in dataframe_to_parquet(). This might be the reason that the resulting parquet schema is "too complicated" for what the backend expects (or is it?).

In comparison, here's how fastparquet would encode the same dataframe directly:

import fastparquet
import pandas as pd                                                                                                                                                                                                                                                                                                                                                                                                      

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})                                                                                                                                                                                                                                                                                                                                       
fastparquet.write("/tmp/df_fastparquet.parquet", df)                                                                                                                                                                                                                                                                                                                                                                     
pf2 = fastparquet.ParquetFile('/tmp/df_fastparquet.parquet')                                                                                                                                                                                                                                                                                                                                                             
print(pf2.schema)                                                                                                                                                                                                                                                                                                                                                                                                        
- schema: 
| - foo: INT64, OPTIONAL
  - bar: BYTE_ARRAY, JSON, OPTIONAL

(loading this into an existing table like the one above would still fail, though)

Like you said, it seems like a weirdness with how REPEATED fields get encoded, and fail to match the definition on the backend.

@tswast
Copy link
Contributor

tswast commented Dec 2, 2019

Sanity check - was that when loading the data into a brand new table?

It was.

We aren’t encountering this when using the pyarrow code path to encode the data frame as a parquet file, right? This leads me to believe that what fastparquet is doing with the extra record datatype is unnecessary and certainly unexpected.

@plamut
Copy link
Contributor

plamut commented Dec 2, 2019

We aren’t encountering this when using the pyarrow code path to encode the data frame as a parquet file, right?

Actually, we are, and the resulting parquet file created in that code path is the one that I initially attached. The schema in it simply seems a bit too complex for an existing REPEATED FLOAT field on the backend.

@timocb
Copy link
Author

timocb commented Dec 3, 2019

@timocb Can you please just confirm that the repeated bar field is indeed defined in the same way as in your own table?

@plamut I can confirm this is the case.

@plamut plamut transferred this issue from googleapis/google-cloud-python Feb 4, 2020
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Feb 4, 2020
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 4, 2020
@plamut
Copy link
Contributor

plamut commented Feb 10, 2020

It really does seem that encoding REPEATED fields in parquet format is unusual, at least in the Python implementation of parquet, resulting in the schema mismatch with existing tables. I added a note to load_table_from_dataframe()'s docstring in #35.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants