-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Field 'bar' is specified as REPEATED in provided schema which does not match REQUIRED as specified in the file. #17
Comments
I can confirm, managed to reproduce the issue as described (thanks for the good self-contained example!). Update: FWIW, table partitioning does not seem to play a role here, the issue is also reproducible with a non-partitioned table. |
I suspect this is a backend issue. The following archive contains a temp parquet file that is uploaded to the backend (GitHub does not allow attaching Inspecting the file reveals the following: >>> import fastparquet
>>> FILE = "/path/to/iss_9207_upload.parquet" # adjust accordingly
>>> pf = fastparquet.ParquetFile(FILE)
>>> print(pf.schema)
- schema: REQUIRED
| - foo: INT64, REQUIRED
- bar: LIST, REQUIRED
- list: REPEATED
- item: DOUBLE, OPTIONAL The schema in the file is correct, but I suspect that the backend simply compares the parquet type of the
Will forward this to the backend to have a closer look. |
@plamut Is this also the case when using From this comment I got the idea that |
@timocb Yes, I reproduced the issue with It seems that the generated parquet file that gets uploaded to the backend is correct, but the backend incorrectly concludes that the schema provided does not match the schema in the uploaded file. Edit: As a workaround, would loading the data with load_table_from_json() help? It is similar, except that it accepts a list of row dictionaries instead of a dataframe. |
@plamut Thanks for confirming. That workaround was indeed what I used! |
I filed https://issuetracker.google.com/133415569 a while ago but closed it because I wasn't able to reproduce. Thanks for investigating further. Seems that problem is when a parquet file specifies required and repeated at the same time? |
It appears so, yes, as the things in the client look fine at the point when the parquet file is sent in an API request. |
Related: https://github.com/googleapis/google-cloud-python/issues/8544 We never actually closed the feature request for ARRAY type, but perhaps we can, since we support it when an explicit schema is provided. I've re-opened my internal issue, since we're able to reproduce the issue with parquet files generated by fastparquet. |
I'm not able to reproduce with the test file provided.
Appending to that same table:
|
I see one difference in the two schemas, which is bar is mode "repeated" in the file Peter provided, but "required" once uploaded. BigQuery does not support arrays of arrays, but the workaround is to have an array of records with an array field, which is being followed here. |
Actually, I think this is just a weirdness in how fastparquet encodes lists.
https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema It's probably worth comparing with how other parquet encoders deal with this case, as I suspect the schema can be less complicated than a list of structs with one item. |
Peter, can you try encoding the same data with some other tools (like pyarrow and maybe even the Java Parquet package https://github.com/apache/parquet-mr) and compare the generated schema? If it's as I suspect and fastparquet is doing something weird, we should file a bug to fastparquet. |
Sanity check - was that when loading the data into a brand new table? On the other hand, if using an existing table: ... uploading the same file with "bq load" results in an error, although in a different one:
If trying to load the data with the script provided by @timocb, I get the same error message as in the issue description. @timocb Can you please just confirm that the repeated @tswast The sample dataframe was converted to a parquet file (the one that I attached) via In comparison, here's how import fastparquet
import pandas as pd
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [[2.0, 3.0], [3.0, 4.0], [4.0, 5.0]]})
fastparquet.write("/tmp/df_fastparquet.parquet", df)
pf2 = fastparquet.ParquetFile('/tmp/df_fastparquet.parquet')
print(pf2.schema)
- schema:
| - foo: INT64, OPTIONAL
- bar: BYTE_ARRAY, JSON, OPTIONAL (loading this into an existing table like the one above would still fail, though) Like you said, it seems like a weirdness with how REPEATED fields get encoded, and fail to match the definition on the backend. |
It was. We aren’t encountering this when using the pyarrow code path to encode the data frame as a parquet file, right? This leads me to believe that what fastparquet is doing with the extra record datatype is unnecessary and certainly unexpected. |
Actually, we are, and the resulting |
It really does seem that encoding REPEATED fields in |
I have an issue where it is not possible to upload a Pandas DataFrame with a repeated field to BigQuery. It is very much related to an issue I've had earlier: googleapis/google-cloud-python#8093
Since that has been resolved (by being able to specify the schema), I've created a separate issue. I also couldn't find issues related to REPEATED fields.
Environment details
Mac OS X 10.14.5
Python 3.6.8
Packages:
Steps to reproduce
Also:
JobConfig
doesn't change the error.Code example
Stack trace
The text was updated successfully, but these errors were encountered: