fix: Dask zero division error if parquet dataset has only one partition #3236

mzwiessele · 2022-09-20T16:02:57Z

Signed-off-by: Max Zwiessele ibinbei@gmail.com

What this PR does / why we need it:
When loading data in parquet dataset format it can happen that the dataset only has one partition in the event_timestamp column. If that is the case, dask will fail to process the dataset, erroring with a ZeroDividionError similar to

feast/sdk/python/feast/infra/offline_stores/file.py

Line 327 in 769c318

    
           # try-catch block is added to deal with this issue https://github.com/dask/dask/issues/8939.

This PR adds a try-catch block to gracefully circumvent the error and process the data in only one partition.

Which issue(s) this PR fixes:

N/A

sdk/python/feast/infra/offline_stores/file.py

mzwiessele · 2022-09-20T16:22:05Z

/assign @woop

achals

Could you add a test for this case?

mzwiessele · 2022-09-21T08:21:11Z

Could you add a test for this case?

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

Related to #3235

mzwiessele · 2022-09-21T13:18:40Z

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

"dask>=2021.*,<2022.02.0"

mzwiessele · 2022-09-23T09:22:21Z

@achals I'll need help with the tests. Happy to do the ground work. Please point me to the right testing suite (in the unit tests) for loading a local parquet file from a FileSource.

codecov-commenter · 2022-09-23T16:14:57Z

Codecov Report

Base: 67.50% // Head: 58.12% // Decreases project coverage by -9.38% ⚠️

Coverage data is based on head (393bf5c) compared to base (b48d36b).
Patch coverage: 45.45% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3236      +/-   ##
==========================================
- Coverage   67.50%   58.12%   -9.39%     
==========================================
  Files         179      213      +34     
  Lines       16371    17832    +1461     
==========================================
- Hits        11051    10364     -687     
- Misses       5320     7468    +2148

Flag	Coverage Δ
integrationtests	`?`
unittests	`58.12% <45.45%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/tests/utils/e2e_test_validation.py	`55.44% <ø> (-33.67%)`	⬇️
...ation/feature_repos/universal/data_sources/file.py	`45.63% <40.00%> (-24.27%)`	⬇️
sdk/python/feast/infra/offline_stores/file.py	`72.90% <50.00%> (-22.57%)`	⬇️
...sts/integration/registration/test_universal_cli.py	`20.20% <0.00%> (-79.80%)`	⬇️
...ts/integration/offline_store/test_offline_write.py	`26.08% <0.00%> (-73.92%)`	⬇️
...fline_store/test_universal_historical_retrieval.py	`28.75% <0.00%> (-71.25%)`	⬇️
...ests/integration/e2e/test_python_feature_server.py	`29.50% <0.00%> (-70.50%)`	⬇️
...dk/python/tests/integration/e2e/test_validation.py	`27.55% <0.00%> (-69.30%)`	⬇️
...s/integration/registration/test_universal_types.py	`32.25% <0.00%> (-67.75%)`	⬇️
sdk/python/feast/infra/online_stores/redis.py	`28.39% <0.00%> (-66.67%)`	⬇️
... and 169 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

mzwiessele · 2022-09-23T16:34:07Z

@achals I've added the test, please take a look and tell me if this does the trick!

felixwang9817 · 2022-09-24T01:07:25Z

hey @mzwiessele thanks for the PR! a few responses to your comments above:

Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the event_timestamp column. The reason is simply that this code correction fixes exactly that edge case.

unless I'm misunderstanding, there's no need for the parquet dataset to be in an S3 bucket, right? the only thing that matters is that the Parquet dataset doesn't have npartitions set

if that's correct, I think the best way to write a test would be to do it locally - you can check out our unit tests in sdk/python/feast/infra/tests/unit, all of which run locally. I would try to imitate test_e2e_local in test_e2e_local.py - it's a good example of setting up a dataset locally, constructing a feature repo + FeatureStore object, and then running some tests, in your case I think you would just want to check that get_historical_features works on a Parquet dataset with npartitions unset

Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?

I forget exactly why we restricted that version; I'll go back and check, and if there's no strong reason I'm happy to bump up the upper bound restriction (although I think that can happen in a follow up PR)

felixwang9817 · 2022-09-24T01:10:17Z

@mzwiessele also left some additional comments for you on #3235!

mzwiessele · 2022-10-05T09:34:24Z

@felixwang9817 I've added the dataset source to the test suite as suggested at #3235 to test loading of parquet datasets.

Do you know if any of the created datasets feature only one partition in the event_timestamp column?

mzwiessele · 2022-10-11T16:21:14Z

@felixwang9817 @achals kind ping :)

achals · 2022-10-11T16:59:15Z

@felixwang9817 @achals kind ping :)

The DCO check still seems to be failing! Good to merge as soon as that's fixed. https://github.com/feast-dev/feast/pull/3236/checks?check_run_id=8744237672

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

mzwiessele · 2022-10-12T15:45:12Z

@achals fixed :)

achals

/lgtm

feast-ci-bot · 2022-10-12T16:28:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achals, mzwiessele

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [achals]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…on (feast-dev#3236) * fix: dask zero division error if parquet dataset has only one partition Signed-off-by: Max Zwiessle <ibinbei@gmail.com> * Update file.py Signed-off-by: Max Zwiessle <ibinbei@gmail.com> * Update file.py Signed-off-by: Max Zwiessle <ibinbei@gmail.com> Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))

feast-ci-bot added the size/M label Sep 20, 2022

mzwiessele changed the title ~~fix: dask zero division error if parquet dataset has only one partition~~ fix: Dask zero division error if parquet dataset has only one partition Sep 20, 2022

mzwiessele force-pushed the zero_division_dask_drop_duplicates branch 3 times, most recently from f1736d3 to 225910e Compare September 20, 2022 16:16

mzwiessele commented Sep 20, 2022

View reviewed changes

sdk/python/feast/infra/offline_stores/file.py Show resolved Hide resolved

feast-ci-bot assigned woop Sep 20, 2022

achals approved these changes Sep 20, 2022

View reviewed changes

feast-ci-bot added the approved label Sep 20, 2022

mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 225910e to f75f44a Compare September 23, 2022 09:09

mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 7e34e2f to 393bf5c Compare September 23, 2022 16:06

felixwang9817 assigned felixwang9817 and unassigned woop Sep 24, 2022

mzwiessele added 3 commits October 12, 2022 16:31

fix: dask zero division error if parquet dataset has only one partition

91835d1

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

Update file.py

69b3fae

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

Update file.py

f60862b

Signed-off-by: Max Zwiessle <ibinbei@gmail.com>

mzwiessele force-pushed the zero_division_dask_drop_duplicates branch from 7343c26 to f60862b Compare October 12, 2022 15:31

achals approved these changes Oct 12, 2022

View reviewed changes

feast-ci-bot assigned achals Oct 12, 2022

feast-ci-bot added the lgtm label Oct 12, 2022

achals added the ok-to-test label Oct 12, 2022

feast-ci-bot merged commit 69e4a7d into feast-dev:master Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Dask zero division error if parquet dataset has only one partition #3236

fix: Dask zero division error if parquet dataset has only one partition #3236

mzwiessele commented Sep 20, 2022 •

edited

Loading

mzwiessele commented Sep 20, 2022

achals left a comment

mzwiessele commented Sep 21, 2022 •

edited

Loading

mzwiessele commented Sep 21, 2022

mzwiessele commented Sep 23, 2022

codecov-commenter commented Sep 23, 2022 •

edited

Loading

mzwiessele commented Sep 23, 2022

felixwang9817 commented Sep 24, 2022

felixwang9817 commented Sep 24, 2022

mzwiessele commented Oct 5, 2022

mzwiessele commented Oct 11, 2022 •

edited

Loading

achals commented Oct 11, 2022

mzwiessele commented Oct 12, 2022

achals left a comment

feast-ci-bot commented Oct 12, 2022

fix: Dask zero division error if parquet dataset has only one partition #3236

fix: Dask zero division error if parquet dataset has only one partition #3236

Conversation

mzwiessele commented Sep 20, 2022 • edited Loading

mzwiessele commented Sep 20, 2022

achals left a comment

Choose a reason for hiding this comment

mzwiessele commented Sep 21, 2022 • edited Loading

mzwiessele commented Sep 21, 2022

mzwiessele commented Sep 23, 2022

codecov-commenter commented Sep 23, 2022 • edited Loading

Codecov Report

mzwiessele commented Sep 23, 2022

felixwang9817 commented Sep 24, 2022

felixwang9817 commented Sep 24, 2022

mzwiessele commented Oct 5, 2022

mzwiessele commented Oct 11, 2022 • edited Loading

achals commented Oct 11, 2022

mzwiessele commented Oct 12, 2022

achals left a comment

Choose a reason for hiding this comment

feast-ci-bot commented Oct 12, 2022

mzwiessele commented Sep 20, 2022 •

edited

Loading

mzwiessele commented Sep 21, 2022 •

edited

Loading

codecov-commenter commented Sep 23, 2022 •

edited

Loading

mzwiessele commented Oct 11, 2022 •

edited

Loading