Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Dask zero division error if parquet dataset has only one partition #3236

Merged

Conversation

mzwiessele
Copy link
Contributor

@mzwiessele mzwiessele commented Sep 20, 2022

Signed-off-by: Max Zwiessele ibinbei@gmail.com

What this PR does / why we need it:
When loading data in parquet dataset format it can happen that the dataset only has one partition in the event_timestamp column. If that is the case, dask will fail to process the dataset, erroring with a ZeroDividionError similar to

# try-catch block is added to deal with this issue https://github.com/dask/dask/issues/8939.

This PR adds a try-catch block to gracefully circumvent the error and process the data in only one partition.

Which issue(s) this PR fixes:

N/A

@mzwiessele mzwiessele changed the title fix: dask zero division error if parquet dataset has only one partition fix: Dask zero division error if parquet dataset has only one partition Sep 20, 2022
@mzwiessele mzwiessele force-pushed the zero_division_dask_drop_duplicates branch 3 times, most recently from f1736d3 to 225910e Compare September 20, 2022 16:16
@mzwiessele
Copy link
Contributor Author

/assign @woop

Copy link
Member

@achals achals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this case?

@mzwiessele
Copy link
Contributor Author

mzwiessele commented Sep 21, 2022

Could you add a test for this case?

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

Related to #3235

@mzwiessele
Copy link
Contributor Author

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

"dask>=2021.*,<2022.02.0"

@mzwiessele mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 225910e to f75f44a Compare September 23, 2022 09:09
@mzwiessele
Copy link
Contributor Author

@achals I'll need help with the tests. Happy to do the ground work. Please point me to the right testing suite (in the unit tests) for loading a local parquet file from a FileSource.

@mzwiessele mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 7e34e2f to 393bf5c Compare September 23, 2022 16:06
@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2022

Codecov Report

Base: 67.50% // Head: 58.12% // Decreases project coverage by -9.38% ⚠️

Coverage data is based on head (393bf5c) compared to base (b48d36b).
Patch coverage: 45.45% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3236      +/-   ##
==========================================
- Coverage   67.50%   58.12%   -9.39%     
==========================================
  Files         179      213      +34     
  Lines       16371    17832    +1461     
==========================================
- Hits        11051    10364     -687     
- Misses       5320     7468    +2148     
Flag Coverage Δ
integrationtests ?
unittests 58.12% <45.45%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/tests/utils/e2e_test_validation.py 55.44% <ø> (-33.67%) ⬇️
...ation/feature_repos/universal/data_sources/file.py 45.63% <40.00%> (-24.27%) ⬇️
sdk/python/feast/infra/offline_stores/file.py 72.90% <50.00%> (-22.57%) ⬇️
...sts/integration/registration/test_universal_cli.py 20.20% <0.00%> (-79.80%) ⬇️
...ts/integration/offline_store/test_offline_write.py 26.08% <0.00%> (-73.92%) ⬇️
...fline_store/test_universal_historical_retrieval.py 28.75% <0.00%> (-71.25%) ⬇️
...ests/integration/e2e/test_python_feature_server.py 29.50% <0.00%> (-70.50%) ⬇️
...dk/python/tests/integration/e2e/test_validation.py 27.55% <0.00%> (-69.30%) ⬇️
...s/integration/registration/test_universal_types.py 32.25% <0.00%> (-67.75%) ⬇️
sdk/python/feast/infra/online_stores/redis.py 28.39% <0.00%> (-66.67%) ⬇️
... and 169 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mzwiessele
Copy link
Contributor Author

@achals I've added the test, please take a look and tell me if this does the trick!

@felixwang9817 felixwang9817 assigned felixwang9817 and unassigned woop Sep 24, 2022
@felixwang9817
Copy link
Collaborator

hey @mzwiessele thanks for the PR! a few responses to your comments above:

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

unless I'm misunderstanding, there's no need for the parquet dataset to be in an S3 bucket, right? the only thing that matters is that the Parquet dataset doesn't have npartitions set

if that's correct, I think the best way to write a test would be to do it locally - you can check out our unit tests in sdk/python/feast/infra/tests/unit, all of which run locally. I would try to imitate test_e2e_local in test_e2e_local.py - it's a good example of setting up a dataset locally, constructing a feature repo + FeatureStore object, and then running some tests, in your case I think you would just want to check that get_historical_features works on a Parquet dataset with npartitions unset

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

I forget exactly why we restricted that version; I'll go back and check, and if there's no strong reason I'm happy to bump up the upper bound restriction (although I think that can happen in a follow up PR)

@felixwang9817
Copy link
Collaborator

@mzwiessele also left some additional comments for you on #3235!

@mzwiessele
Copy link
Contributor Author

@felixwang9817 I've added the dataset source to the test suite as suggested at #3235 to test loading of parquet datasets.

Do you know if any of the created datasets feature only one partition in the event_timestamp column?

@mzwiessele
Copy link
Contributor Author

mzwiessele commented Oct 11, 2022

@felixwang9817 @achals kind ping :)

@achals
Copy link
Member

achals commented Oct 11, 2022

@felixwang9817 @achals kind ping :)

The DCO check still seems to be failing! Good to merge as soon as that's fixed. https://github.com/feast-dev/feast/pull/3236/checks?check_run_id=8744237672

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
@mzwiessele mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 7343c26 to f60862b Compare October 12, 2022 15:31
@mzwiessele
Copy link
Contributor Author

@achals fixed :)

Copy link
Member

@achals achals left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achals, mzwiessele

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 69e4a7d into feast-dev:master Oct 12, 2022
franciscojavierarceo pushed a commit to franciscojavierarceo/feast that referenced this pull request Oct 18, 2022
…on (feast-dev#3236)

* fix: dask zero division error if parquet dataset has only one partition

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

* Update file.py

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

* Update file.py

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
kevjumba pushed a commit that referenced this pull request Dec 5, 2022
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05)

### Bug Fixes

* Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b))
* Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d))
* Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e))
* Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace))
* Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde))
* Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9))
* Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f))
* Update the version counts ([#3378](#3378)) ([8112db5](8112db5))
* Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981))
* Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062))

### Features

* Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c))
* Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f))
* Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants