Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fastavro version used in Feast to avoid Timestamp delta error #490

Merged
merged 2 commits into from
Feb 25, 2020

Conversation

davidheryanto
Copy link
Collaborator

@davidheryanto davidheryanto commented Feb 24, 2020

What this PR does / why we need it:
This fixes end to end batch test
#488 (comment)
http://prow.feast.ai/view/gcs/feast-templocation-kf-feast/pr-logs/pull/gojek_feast/488/test-end-to-end-batch/1231860946993942532

Which issue(s) this PR fixes:

NONE

Does this PR introduce a user-facing change?:

NONE

This PR also added debugging info of all Python packages version used during end to end test, when the test fails, to aid debugging failed tests.

NOTE
Probably Feast Python SDK should be updated so it can work with fastavro version 0.22.10 and newer rather than fixing it to a specific version of fastavro?

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidheryanto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scottbelden
Copy link

@davidheryanto This sounds like it could be a bug in fastavro caused by some new changes. As the maintainer of that project I'd like to see if we can fix that. Do you think you could provide some details on the problem? Thanks!

@davidheryanto
Copy link
Collaborator Author

davidheryanto commented Feb 25, 2020

Hi @scottbelden, thanks for checking. Actually I'm not exactly sure what is the root cause, but I know this Pandas cython module is where the exception is thrown pandas/_libs/tslibs/c_timestamp.pyx

To reproduce the error, assuming you have Docker installed:

docker run --rm -it continuumio/miniconda3:4.7.12 bash

# Once in the shell in the container
# pandavro depends on fastavro
pip install pandavro==1.5

python <<EOF
from pandavro import to_avro
from datetime import datetime
import pytz
import pandas as pd

df = pd.DataFrame({"datetime": [datetime.utcnow().replace(tzinfo=pytz.utc)]})
to_avro("/tmp/out.avro", df)
EOF

Will throw this error with stacktrace

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "/opt/conda/lib/python3.7/site-packages/pandavro/__init__.py", line 151, in to_avro
    records=df.to_dict('records'), codec=codec)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 579, in writer
    output.write(record)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 421, in write
    write_data(self.io, record, self.schema)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 234, in write_data
    return fn(encoder, datum, schema)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 189, in write_record
    name, field.get('default')), field['type'])
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 234, in write_data
    return fn(encoder, datum, schema)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_write_py.py", line 154, in write_union
    if validate(datum, candidate, raise_errors=False):
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_validation_py.py", line 369, in validate
    datum = prepare(datum, schema)
  File "/opt/conda/lib/python3.7/site-packages/fastavro/_logical_writers_py.py", line 44, in prepare_timestamp_micros
    delta = (data - epoch)
  File "pandas/_libs/tslibs/c_timestamp.pyx", line 296, in pandas._libs.tslibs.c_timestamp._Timestamp.__sub__
TypeError: Timestamp subtraction must have the same timezones or no timezones

If i downgrade, to fastavro==0.22.9 the timestamp subtraction error is gone.

Strangely when I do not use Python from miniconda, say I use the Python official Docker image, it's ok. The timestamp subtraction error is not there.

docker run --rm -it python:3.7 bash

pip install pandavro==1.5

python <<EOF
from pandavro import to_avro
from datetime import datetime
import pytz
import pandas as pd

df = pd.DataFrame({"datetime": [datetime.utcnow().replace(tzinfo=pytz.utc)]})
to_avro("/tmp/out.avro", df)
EOF

I attach the output of pip list in these 2 occasions although I don't notice anything wrong.

From the miniconda that throws error

Package                Version  
---------------------- ---------
asn1crypto             1.0.1    
certifi                2019.9.11
cffi                   1.12.3   
chardet                3.0.4    
conda                  4.7.12   
conda-package-handling 1.6.0    
cryptography           2.7      
fastavro               0.22.10  
idna                   2.8      
numpy                  1.18.1   
pandas                 1.0.1    
pandavro               1.5.0    
pip                    19.2.3   
pycosat                0.6.3    
pycparser              2.19     
pyOpenSSL              19.0.0   
PySocks                1.7.1    
python-dateutil        2.8.1    
pytz                   2019.3   
requests               2.22.0   
ruamel-yaml            0.15.46  
setuptools             41.4.0   
six                    1.12.0   
tqdm                   4.36.1   
urllib3                1.24.2   
wheel                  0.33.6

From the official Python Docker image that does not throw error

Package         Version
--------------- -------
fastavro        0.22.10
numpy           1.18.1 
pandas          1.0.1  
pandavro        1.5.0  
pip             19.0.3 
python-dateutil 2.8.1  
pytz            2019.3 
setuptools      40.8.0 
six             1.14.0 
wheel           0.33.1

@woop
Copy link
Member

woop commented Feb 25, 2020

/lgtm

@scottbelden
Copy link

@davidheryanto The difference you are seeing is a slight difference in our python vs. cython implementation. However, it seems like the source of the error is more so because we create our own version of a UTC tzinfo. I have a PR that uses pytz instead for this and I believe it should resolve this issue but I'm still doing some testing.

@davidheryanto
Copy link
Collaborator Author

Thanks @scottbelden for the update.

davidheryanto added a commit to davidheryanto/feast that referenced this pull request Feb 26, 2020
…ast-dev#490)

* Fix fastavro version used in feast to 0.22.9

* Print python packages version used when e2e test fails

(cherry picked from commit 7586da6)
zhilingc pushed a commit that referenced this pull request Feb 26, 2020
* Fix fastavro version used in feast to 0.22.9

* Print python packages version used when e2e test fails

(cherry picked from commit 7586da6)
@scottbelden
Copy link

@davidheryanto I published version 0.22.11 which I believe should resolve this issue. I tried the docker test from above and does not produce the error with the new version.

@davidheryanto
Copy link
Collaborator Author

Thanks @scottbelden
The new update has fixed the datetime subtraction due to timezone error :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants