A system test for load_table_from_datarame() consistently fails on master branch #61

plamut · 2020-03-17T10:31:35Z

A system test test_load_table_from_dataframe_w_explicit_schema() consistently fails on the latest master branch, both under Python 2.7 and Python 3.8 (example Kokoro run). It is also consistently reproducible locally.

...
google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid datetime value -62135596800000000 for field 'dt_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of -62135596800000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

Ticket in the Google issue tracker: https://issuetracker.google.com/issues/151765076

The text was updated successfully, but these errors were encountered:

plamut · 2020-03-17T11:15:34Z

@shollyman Have there been any changes on the backend in the last week or so? The system tests passed fine when #58 was initially submitted, but now fail on the latest master.

Specifically, have there been any changes to handling the TIMESTAMP_MICROS logical type?

shollyman · 2020-03-17T17:59:28Z

Let's grab the avro file that's created indirectly from the test, use that to file an issue with the backend team to aid them reproducing it: https://issuetracker.google.com/issues/new?component=187149&template=0.

The kokoro log:
BadRequest: 400 Error while reading data, error message: Invalid datetime value -62135596800000000 for field 'dt_col' of type 'INT64' (logical type 'TIMESTAMP_MICROS'): generic::out_of_range: Cannot return an invalid datetime value of -62135596800000000 microseconds relative to the Unix epoch. The range of valid datetime values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]

However, this is the value for the min allowed datetime, the query engine agrees:
SELECT TIMESTAMP_MICROS(-62135596800000000) ==> 0001-01-01 00:00:00 UTC

Peter, can you take care of this?

plamut · 2020-03-17T18:03:09Z

OK, I'll intercept the file the test generates and submits to the backend, and open a ticket in the issuetracker.

Edit: The ticket - https://issuetracker.google.com/issues/151765076

plamut · 2020-03-25T10:50:54Z

This is interesting. When exploring the parquet file that is uploaded to the backend (the one attached to the issuetracker issue), I noticed the following:

>>> import fastparquet as fp
>>> filename = "/path/to/iss_61.parquet"
>>> pfile = fp.ParquetFile(filename)
>>> pfile.to_pandas()
                         dt_col
0 1754-08-30 22:43:41.128654848
1                           NaT
2 1816-03-30 05:56:08.066276376

The timestamps are incorrect, they should be 0001-01-1 00:00:00.000000 and 9999-12-31 23:59:59.999999 (I'm using the latest fastparquet - 0.3.3). Maybe the uploaded parquet file itself is incorrect, and something broke it recently without us knowing. Or my local conversion back to Pandas messes up the dates. 😕

Update:
I also tried reading the file with pyarrow.fastparquet, and converting it to a dict actually produces correct result:

>>> import pyarrow.parquet
>>> filename = "/path/to/iss_61.parquet"
>>> pfile_pyarrow = pyarrow.parquet.ParquetFile(filename)
>>> pyarrow_table = pfile_pyarrow.read()
>>> pyarrow_table.to_pydict()
{'dt_col': [datetime.datetime(1, 1, 1, 0, 0),
  None,
  datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)]}

However, converting pyarrow.Table to a dataframe also produces weird datetime values:

>>> pyarrow_table.to_pandas()
                         dt_col
0 1754-08-30 22:43:41.128654848
1                           NaT
2 1816-03-30 05:56:08.066276376

Pyarrow docs mention that "it is not possible to convert all column types unmodified". I don't know how the parquet file is read on the backend, but if Pandas is used, circumventing it might be the solution.

emkornfield · 2020-04-02T04:29:52Z

@plamut The weird values from Arrow to Pandas are because of https://issues.apache.org/jira/browse/ARROW-5359 (it seems like this is incorrect behavior in the short term, since I think by default this should raise an error).

plamut · 2020-04-02T08:22:26Z

@emkornfield I see, thanks for the info. This might spare some debugging time on our end.

meredithslota · 2020-08-14T23:28:50Z

https://issues.apache.org/jira/browse/ARROW-5359 is marked as "Fixed" now. The internal ticket https://issuetracker.google.com/issues/151765076 has not been addressed.

emkornfield · 2020-08-17T12:25:23Z

https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-9768 might also be related

HemangChothani · 2020-10-19T09:26:59Z

@emkornfield https://jira.apache.org/jira/browse/ARROW-2587 is also marked as "Fixed" now. any update on internal ticket because the test is still failing.

tswast · 2020-10-19T20:09:36Z

Internal issue 166476249 covers loading DATETIME in Parquet files (#56)

@HemangChothani sounds like we should update this system test to avoid DATETIME columns until the backend can support them.

meredithslota · 2021-01-14T20:50:18Z

Are we still blocked on this?

plamut · 2021-01-15T08:28:29Z

I check again today from the client's perspective (but don't have insight into status on the backend).

Update:
Uploading DATETIME fields does not seem to be supported yet, the system test still fails if I uncomment the DATETIME column from the schema.

What's the priority of this on the backend, anyway? P1 or lower than that? (to align this ticket's priority with it)

tswast · 2021-01-22T23:14:43Z

To work around this, we added CSV as a serialization format. But, yes we are blocked on the backend, as they don't support DATETIME for Parquet yet.

Issue 166476249 is marked as P1, but no one has touched it yet, so I suspect it's being treated as lower priority.

tswast · 2021-08-23T18:56:45Z

Closing as a duplicate of #56

plamut added type: process A process-related concern. May include testing, release, or the like. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. testing labels Mar 17, 2020

plamut self-assigned this Mar 17, 2020

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Mar 17, 2020

meredithslota assigned shollyman Mar 17, 2020

plamut added the external This issue is blocked on a bug with the actual product. label Mar 17, 2020

This was referenced Mar 23, 2020

docs: include details of inherited class members #64

Merged

Temporarily adjust/disable system test for uploading DATETIME fields #67

Closed

Revert the temporarily adjustment of a failing system test #68

Closed

plamut removed their assignment Oct 1, 2020

tswast added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jan 22, 2021

tswast assigned tswast and unassigned shollyman Aug 23, 2021

tswast closed this as completed Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A system test for load_table_from_datarame() consistently fails on master branch #61

A system test for load_table_from_datarame() consistently fails on master branch #61

plamut commented Mar 17, 2020 •

edited

Loading

plamut commented Mar 17, 2020

shollyman commented Mar 17, 2020

plamut commented Mar 17, 2020 •

edited

Loading

plamut commented Mar 25, 2020 •

edited

Loading

emkornfield commented Apr 2, 2020

plamut commented Apr 2, 2020

meredithslota commented Aug 14, 2020

emkornfield commented Aug 17, 2020

HemangChothani commented Oct 19, 2020

tswast commented Oct 19, 2020

meredithslota commented Jan 14, 2021

plamut commented Jan 15, 2021 •

edited

Loading

tswast commented Jan 22, 2021

tswast commented Aug 23, 2021

A system test for load_table_from_datarame() consistently fails on master branch #61

A system test for load_table_from_datarame() consistently fails on master branch #61

Comments

plamut commented Mar 17, 2020 • edited Loading

plamut commented Mar 17, 2020

shollyman commented Mar 17, 2020

plamut commented Mar 17, 2020 • edited Loading

plamut commented Mar 25, 2020 • edited Loading

emkornfield commented Apr 2, 2020

plamut commented Apr 2, 2020

meredithslota commented Aug 14, 2020

emkornfield commented Aug 17, 2020

HemangChothani commented Oct 19, 2020

tswast commented Oct 19, 2020

meredithslota commented Jan 14, 2021

plamut commented Jan 15, 2021 • edited Loading

tswast commented Jan 22, 2021

tswast commented Aug 23, 2021

plamut commented Mar 17, 2020 •

edited

Loading

plamut commented Mar 17, 2020 •

edited

Loading

plamut commented Mar 25, 2020 •

edited

Loading

plamut commented Jan 15, 2021 •

edited

Loading