BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

timocb · 2019-09-11T08:51:38Z

I have an issue where it is not possible to upload a Pandas DataFrame with a repeated field to BigQuery. It is very much related to an issue I've had earlier: googleapis/google-cloud-python#8093

Since that has been resolved (by being able to specify the schema), I've created a separate issue. I also couldn't find issues related to REPEATED fields.

Environment details

Mac OS X 10.14.5
Python 3.6.8

Packages:

google-api-core==1.14.2
google-auth==1.6.3
google-cloud-bigquery==1.19.0
google-cloud-core==1.0.3
google-cloud-iam==0.2.1
google-cloud-logging==1.12.1
google-resumable-media==0.3.3
googleapis-common-protos==1.6.0

Steps to reproduce

Have a table with a REPEATED field
Upload a Pandas DataFrame with a repeated field to this table
Error

Also:

Getting the schema from BigQuery and using that in the JobConfig doesn't change the error.

Code example

import pandas as pd
from google.cloud import bigquery


PROJECT = "MY-PROJECT"
DATASET = "MY_DATASET"
TABLE = "MY_TABLE"


# My table schema
schema = [
    bigquery.SchemaField("foo", "INTEGER", mode="REQUIRED"),
    bigquery.SchemaField("bar", "FLOAT", mode="REPEATED"),
]


# Set everything up
client = bigquery.Client(PROJECT)
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)


# Delete the table if exists
print("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)


# Create the table
print("Creating table...")
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY
)
table = client.create_table(table, exists_ok=True)

print("Table schema:")
print(table.schema)

print("Table partitioning:")
print(table.time_partitioning)

# Upload data to partition
table_partition = TABLE + "$20190522"
table_ref = dataset_ref.table(table_partition)

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})

job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()

Stack trace

Traceback (most recent call last):
  File "test.py", line 51, in <module>
    client.load_table_from_dataframe(df, table_ref, job_config=job_config).result()
  File "google/cloud/bigquery/job.py", line 734, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "google/api_core/future/polling.py", line 127, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: 

Provided schema is not compatible with the file 'prod-scotty-******'. 

Field 'bar' is specified as REPEATED in provided schema 
which does not match REQUIRED as specified in the file.

The text was updated successfully, but these errors were encountered:

plamut · 2019-09-11T19:06:00Z

I can confirm, managed to reproduce the issue as described (thanks for the good self-contained example!).

Update: FWIW, table partitioning does not seem to play a role here, the issue is also reproducible with a non-partitioned table.

plamut · 2019-09-16T13:31:07Z

I suspect this is a backend issue.

The following archive contains a temp parquet file that is uploaded to the backend (GitHub does not allow attaching .parquet files directly): iss_9207_upload.parquet.zip

Inspecting the file reveals the following:

>>> import fastparquet
>>> FILE = "/path/to/iss_9207_upload.parquet"  # adjust accordingly
>>> pf = fastparquet.ParquetFile(FILE)
>>> print(pf.schema)
- schema: REQUIRED
| - foo: INT64, REQUIRED
  - bar: LIST, REQUIRED
    - list: REPEATED
      - item: DOUBLE, OPTIONAL

The schema in the file is correct, but I suspect that the backend simply compares the parquet type of the barcolumn with the bigquery.SchemaField instance for the same column, and incorrectly concluding that the mode REPEATED does not match the REQUIRED mode in the file:

Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file.

Will forward this to the backend to have a closer look.

timocb · 2019-09-17T09:45:02Z

@plamut Is this also the case when using pyarrow?

From this comment I got the idea that fastparquet and pyarrow provide similar functionality and can be used interchangeably.

plamut · 2019-09-17T10:01:30Z

@timocb Yes, I reproduced the issue with pyarrow installed (the BigQuery client used pyarrow to create a parquet file).

It seems that the generated parquet file that gets uploaded to the backend is correct, but the backend incorrectly concludes that the schema provided does not match the schema in the uploaded file.

Edit: As a workaround, would loading the data with load_table_from_json() help? It is similar, except that it accepts a list of row dictionaries instead of a dataframe.

timocb · 2019-09-17T13:37:59Z

@plamut Thanks for confirming. That workaround was indeed what I used!

tswast · 2019-09-25T16:54:56Z

I filed https://issuetracker.google.com/133415569 a while ago but closed it because I wasn't able to reproduce. Thanks for investigating further.

Seems that problem is when a parquet file specifies required and repeated at the same time?

plamut · 2019-09-25T17:14:35Z

It appears so, yes, as the things in the client look fine at the point when the parquet file is sent in an API request.

tswast · 2019-09-25T22:53:37Z

Related: https://github.com/googleapis/google-cloud-python/issues/8544 We never actually closed the feature request for ARRAY type, but perhaps we can, since we support it when an explicit schema is provided.

I've re-opened my internal issue, since we're able to reproduce the issue with parquet files generated by fastparquet.

tswast · 2019-11-14T17:11:46Z

I'm not able to reproduce with the test file provided.

$ bq load --source_format=PARQUET test_dataset.b133415569_parquet iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_r670a6280dcb701f9_0000016e6ae31f56_1 ... (6s) Current status: DONE

Appending to that same table:

$ bq load --source_format=PARQUET test_dataset.b133415569_parquet iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_r2f561503180ee012_0000016e6ae34f6f_1 ... (6s) Current status: DONE

tswast · 2019-11-14T17:14:38Z

I see one difference in the two schemas, which is bar is mode "repeated" in the file Peter provided, but "required" once uploaded.

BigQuery does not support arrays of arrays, but the workaround is to have an array of records with an array field, which is being followed here.

tswast · 2019-11-14T17:19:29Z

Actually, I think this is just a weirdness in how fastparquet encodes lists.

Fastparquet also handles some parquet LIST and MAP types. For instance, the schema may include

| - tags: LIST, OPTIONAL
    - list: REPEATED
       - element: BYTE_ARRAY, UTF8, OPTIONAL

In this case, columns would include an entry “tags”, which evaluates to an object column containing lists of strings. Reading such columns will be relatively slow. If the ‘element’ type is anything other than a primitive type, i.e., a struct, map or list, than fastparquet will not be able to read it, and the resulting column will either not be contained in the output, or contain only None values.

https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema

It's probably worth comparing with how other parquet encoders deal with this case, as I suspect the schema can be less complicated than a list of structs with one item.

tswast · 2019-11-14T17:21:30Z

Peter, can you try encoding the same data with some other tools (like pyarrow and maybe even the Java Parquet package https://github.com/apache/parquet-mr) and compare the generated schema?

If it's as I suspect and fastparquet is doing something weird, we should file a bug to fastparquet.

plamut · 2019-11-30T16:19:02Z

@tswast

I'm not able to reproduce with the test file provided.

Sanity check - was that when loading the data into a brand new table?
I can reproduce the "good" behavior by running "bq load" twice with a table name that initially does not exist. The data is uploaded fine, and the created table schema matches the one in the posted screenshot.

On the other hand, if using an existing table:

... uploading the same file with "bq load" results in an error, although in a different one:

$ bq load --source_format=PARQUET peter_test_dataset.table_9207 ../reproduce/iss_9207_upload.parquet
Upload complete.
Waiting on bqjob_**********************_1 ... (1s) Current status: DONE   
BigQuery error in load operation: Error processing job '**********:**********': Provided Schema does not match Table ********:peter_test_dataset.table_9207. Field bar has changed type from FLOAT to RECORD

If trying to load the data with the script provided by @timocb, I get the same error message as in the issue description.

@timocb Can you please just confirm that the repeated bar field is indeed defined in the same way as in your own table?

@tswast The sample dataframe was converted to a parquet file (the one that I attached) via pyarrow in dataframe_to_parquet(). This might be the reason that the resulting parquet schema is "too complicated" for what the backend expects (or is it?).

In comparison, here's how fastparquet would encode the same dataframe directly:

import fastparquet
import pandas as pd                                                                                                                                                                                                                                                                                                                                                                                                      

df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})                                                                                                                                                                                                                                                                                                                                       
fastparquet.write("/tmp/df_fastparquet.parquet", df)                                                                                                                                                                                                                                                                                                                                                                     
pf2 = fastparquet.ParquetFile('/tmp/df_fastparquet.parquet')                                                                                                                                                                                                                                                                                                                                                             
print(pf2.schema)                                                                                                                                                                                                                                                                                                                                                                                                        
- schema: 
| - foo: INT64, OPTIONAL
  - bar: BYTE_ARRAY, JSON, OPTIONAL

(loading this into an existing table like the one above would still fail, though)

Like you said, it seems like a weirdness with how REPEATED fields get encoded, and fail to match the definition on the backend.

tswast · 2019-12-02T13:27:02Z

Sanity check - was that when loading the data into a brand new table?

It was.

We aren’t encountering this when using the pyarrow code path to encode the data frame as a parquet file, right? This leads me to believe that what fastparquet is doing with the extra record datatype is unnecessary and certainly unexpected.

plamut · 2019-12-02T13:39:33Z

We aren’t encountering this when using the pyarrow code path to encode the data frame as a parquet file, right?

Actually, we are, and the resulting parquet file created in that code path is the one that I initially attached. The schema in it simply seems a bit too complex for an existing REPEATED FLOAT field on the backend.

timocb · 2019-12-03T08:29:44Z

@timocb Can you please just confirm that the repeated bar field is indeed defined in the same way as in your own table?

@plamut I can confirm this is the case.

plamut · 2020-02-10T17:20:03Z

It really does seem that encoding REPEATED fields in parquet format is unusual, at least in the Python implementation of parquet, resulting in the schema mismatch with existing tables. I added a note to load_table_from_dataframe()'s docstring in #35.

tswast assigned plamut Nov 14, 2019

plamut transferred this issue from googleapis/google-cloud-python Feb 4, 2020

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Feb 4, 2020

plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 4, 2020

plamut mentioned this issue Feb 10, 2020

docs: add limitation note for loading REPEATED fields #35

Merged

4 tasks

plamut closed this as completed in #35 Feb 10, 2020

yoshi-automation added the 🚨 This issue needs some love. label Apr 7, 2020

aaaaahaaaaa mentioned this issue May 20, 2020

BigQuery: Upload pandas DataFrame containing arrays #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

timocb commented Sep 11, 2019

plamut commented Sep 11, 2019 •

edited

Loading

plamut commented Sep 16, 2019

timocb commented Sep 17, 2019

plamut commented Sep 17, 2019 •

edited

Loading

timocb commented Sep 17, 2019

tswast commented Sep 25, 2019

plamut commented Sep 25, 2019

tswast commented Sep 25, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019 •

edited

Loading

plamut commented Nov 30, 2019

tswast commented Dec 2, 2019

plamut commented Dec 2, 2019

timocb commented Dec 3, 2019

plamut commented Feb 10, 2020 •

edited

Loading

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17

Comments

timocb commented Sep 11, 2019

Environment details

Steps to reproduce

Code example

Stack trace

plamut commented Sep 11, 2019 • edited Loading

plamut commented Sep 16, 2019

timocb commented Sep 17, 2019

plamut commented Sep 17, 2019 • edited Loading

timocb commented Sep 17, 2019

tswast commented Sep 25, 2019

plamut commented Sep 25, 2019

tswast commented Sep 25, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019

tswast commented Nov 14, 2019 • edited Loading

plamut commented Nov 30, 2019

tswast commented Dec 2, 2019

plamut commented Dec 2, 2019

timocb commented Dec 3, 2019

plamut commented Feb 10, 2020 • edited Loading

plamut commented Sep 11, 2019 •

edited

Loading

plamut commented Sep 17, 2019 •

edited

Loading

tswast commented Nov 14, 2019 •

edited

Loading

plamut commented Feb 10, 2020 •

edited

Loading