Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A system test for load_table_from_datarame() consistently fails on master branch #61

Closed
plamut opened this issue Mar 17, 2020 · 14 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. testing type: process A process-related concern. May include testing, release, or the like.

Comments

@plamut
Copy link
Contributor

plamut commented Mar 17, 2020

A system test test_load_table_from_dataframe_w_explicit_schema() consistently fails on the latest master branch, both under Python 2.7 and Python 3.8 (example Kokoro run). It is also consistently reproducible locally.

...
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid datetime value -62135596800000000 for field 'dt_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of -62135596800000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

Ticket in the Google issue tracker: https://issuetracker.google.com/issues/151765076

@plamut plamut added type: process A process-related concern. May include testing, release, or the like. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. testing labels Mar 17, 2020
@plamut plamut self-assigned this Mar 17, 2020
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Mar 17, 2020
@plamut
Copy link
Contributor Author

plamut commented Mar 17, 2020

@shollyman Have there been any changes on the backend in the last week or so? The system tests passed fine when #58 was initially submitted, but now fail on the latest master.

Specifically, have there been any changes to handling the TIMESTAMP_MICROS logical type?

@shollyman
Copy link
Contributor

Let's grab the avro file that's created indirectly from the test, use that to file an issue with the backend team to aid them reproducing it: https://issuetracker.google.com/issues/new?component=187149&template=0.

The kokoro log:
BadRequest: 400 Error while reading data, error message: Invalid datetime value -62135596800000000 for field 'dt_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of -62135596800000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

However, this is the value for the min allowed datetime, the query engine agrees:
SELECT TIMESTAMP_MICROS(-62135596800000000) ==> 0001-01-01 00:00:00 UTC

Peter, can you take care of this?

@plamut
Copy link
Contributor Author

plamut commented Mar 17, 2020

OK, I'll intercept the file the test generates and submits to the backend, and open a ticket in the issuetracker.

Edit: The ticket - https://issuetracker.google.com/issues/151765076

@plamut
Copy link
Contributor Author

plamut commented Mar 25, 2020

This is interesting. When exploring the parquet file that is uploaded to the backend (the one attached to the issuetracker issue), I noticed the following:

>>> import fastparquet as fp
>>> filename = "/path/to/iss_61.parquet"
>>> pfile = fp.ParquetFile(filename)
>>> pfile.to_pandas()
                         dt_col
0 1754-08-30 22:43:41.128654848
1                           NaT
2 1816-03-30 05:56:08.066276376

The timestamps are incorrect, they should be 0001-01-1 00:00:00.000000 and 9999-12-31 23:59:59.999999 (I'm using the latest fastparquet - 0.3.3). Maybe the uploaded parquet file itself is incorrect, and something broke it recently without us knowing. Or my local conversion back to Pandas messes up the dates. 😕


Update:
I also tried reading the file with pyarrow.fastparquet, and converting it to a dict actually produces correct result:

>>> import pyarrow.parquet
>>> filename = "/path/to/iss_61.parquet"
>>> pfile_pyarrow = pyarrow.parquet.ParquetFile(filename)
>>> pyarrow_table = pfile_pyarrow.read()
>>> pyarrow_table.to_pydict()
{'dt_col': [datetime.datetime(1, 1, 1, 0, 0),
  None,
  datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)]}

However, converting pyarrow.Table to a dataframe also produces weird datetime values:

>>> pyarrow_table.to_pandas()
                         dt_col
0 1754-08-30 22:43:41.128654848
1                           NaT
2 1816-03-30 05:56:08.066276376

Pyarrow docs mention that "it is not possible to convert all column types unmodified". I don't know how the parquet file is read on the backend, but if Pandas is used, circumventing it might be the solution.

@emkornfield
Copy link

@plamut The weird values from Arrow to Pandas are because of https://issues.apache.org/jira/browse/ARROW-5359 (it seems like this is incorrect behavior in the short term, since I think by default this should raise an error).

@plamut
Copy link
Contributor Author

plamut commented Apr 2, 2020

@emkornfield I see, thanks for the info. This might spare some debugging time on our end.

@meredithslota
Copy link
Contributor

https://issues.apache.org/jira/browse/ARROW-5359 is marked as "Fixed" now. The internal ticket https://issuetracker.google.com/issues/151765076 has not been addressed.

@emkornfield
Copy link

@plamut plamut removed their assignment Oct 1, 2020
@HemangChothani
Copy link
Contributor

@emkornfield https://jira.apache.org/jira/browse/ARROW-2587 is also marked as "Fixed" now. any update on internal ticket because the test is still failing.

@tswast
Copy link
Contributor

tswast commented Oct 19, 2020

Internal issue 166476249 covers loading DATETIME in Parquet files (#56)

@HemangChothani sounds like we should update this system test to avoid DATETIME columns until the backend can support them.

@meredithslota
Copy link
Contributor

Are we still blocked on this?

@plamut
Copy link
Contributor Author

plamut commented Jan 15, 2021

I check again today from the client's perspective (but don't have insight into status on the backend).

Update:
Uploading DATETIME fields does not seem to be supported yet, the system test still fails if I uncomment the DATETIME column from the schema.

What's the priority of this on the backend, anyway? P1 or lower than that? (to align this ticket's priority with it)

@tswast tswast added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jan 22, 2021
@tswast
Copy link
Contributor

tswast commented Jan 22, 2021

To work around this, we added CSV as a serialization format. But, yes we are blocked on the backend, as they don't support DATETIME for Parquet yet.

Issue 166476249 is marked as P1, but no one has touched it yet, so I suspect it's being treated as lower priority.

@tswast tswast assigned tswast and unassigned shollyman Aug 23, 2021
@tswast
Copy link
Contributor

tswast commented Aug 23, 2021

Closing as a duplicate of #56

@tswast tswast closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. testing type: process A process-related concern. May include testing, release, or the like.
Projects
None yet
Development

No branches or pull requests

6 participants