-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Dask zero division error if parquet dataset has only one partition #3236
fix: Dask zero division error if parquet dataset has only one partition #3236
Conversation
f1736d3
to
225910e
Compare
/assign @woop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this case?
Yes, I need help though. We need a test folder with a parquet dataset in the test S3 bucket. That parquet dataset must have only one partition in the Related to #3235 |
Before merging this: Is it possible to update the Dask version feast relies on? Or in other words, why is the version restricted like this?
|
225910e
to
f75f44a
Compare
@achals I'll need help with the tests. Happy to do the ground work. Please point me to the right testing suite (in the unit tests) for loading a local parquet file from a FileSource. |
7e34e2f
to
393bf5c
Compare
Codecov ReportBase: 67.50% // Head: 58.12% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #3236 +/- ##
==========================================
- Coverage 67.50% 58.12% -9.39%
==========================================
Files 179 213 +34
Lines 16371 17832 +1461
==========================================
- Hits 11051 10364 -687
- Misses 5320 7468 +2148
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@achals I've added the test, please take a look and tell me if this does the trick! |
hey @mzwiessele thanks for the PR! a few responses to your comments above:
unless I'm misunderstanding, there's no need for the parquet dataset to be in an S3 bucket, right? the only thing that matters is that the Parquet dataset doesn't have if that's correct, I think the best way to write a test would be to do it locally - you can check out our unit tests in
I forget exactly why we restricted that version; I'll go back and check, and if there's no strong reason I'm happy to bump up the upper bound restriction (although I think that can happen in a follow up PR) |
@mzwiessele also left some additional comments for you on #3235! |
@felixwang9817 I've added the dataset source to the test suite as suggested at #3235 to test loading of parquet datasets. Do you know if any of the created datasets feature only one partition in the event_timestamp column? |
@felixwang9817 @achals kind ping :) |
The DCO check still seems to be failing! Good to merge as soon as that's fixed. https://github.com/feast-dev/feast/pull/3236/checks?check_run_id=8744237672 |
Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
7343c26
to
f60862b
Compare
@achals fixed :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: achals, mzwiessele The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…on (feast-dev#3236) * fix: dask zero division error if parquet dataset has only one partition Signed-off-by: Max Zwiessle <ibinbei@gmail.com> * Update file.py Signed-off-by: Max Zwiessle <ibinbei@gmail.com> * Update file.py Signed-off-by: Max Zwiessle <ibinbei@gmail.com> Signed-off-by: Max Zwiessle <ibinbei@gmail.com>
# [0.27.0](v0.26.0...v0.27.0) (2022-12-05) ### Bug Fixes * Changing Snowflake template code to avoid query not implemented … ([#3319](#3319)) ([1590d6b](1590d6b)) * Dask zero division error if parquet dataset has only one partition ([#3236](#3236)) ([69e4a7d](69e4a7d)) * Enable Spark materialization on Yarn ([#3370](#3370)) ([0c20a4e](0c20a4e)) * Ensure that Snowflake accounts for number columns that overspecify precision ([#3306](#3306)) ([0ad0ace](0ad0ace)) * Fix memory leak from usage.py not properly cleaning up call stack ([#3371](#3371)) ([a0c6fde](a0c6fde)) * Fix workflow to contain env vars ([#3379](#3379)) ([548bed9](548bed9)) * Update bytewax materialization ([#3368](#3368)) ([4ebe00f](4ebe00f)) * Update the version counts ([#3378](#3378)) ([8112db5](8112db5)) * Updated AWS Athena template ([#3322](#3322)) ([5956981](5956981)) * Wrong UI data source type display ([#3276](#3276)) ([8f28062](8f28062)) ### Features * Cassandra online store, concurrency in bulk write operations ([#3367](#3367)) ([eaf354c](eaf354c)) * Cassandra online store, concurrent fetching for multiple entities ([#3356](#3356)) ([00fa21f](00fa21f)) * Get Snowflake Query Output As Pyspark Dataframe ([#2504](#2504)) ([#3358](#3358)) ([2f18957](2f18957))
Signed-off-by: Max Zwiessele ibinbei@gmail.com
What this PR does / why we need it:
When loading data in parquet dataset format it can happen that the dataset only has one partition in the
event_timestamp
column. If that is the case, dask will fail to process the dataset, erroring with aZeroDividionError
similar tofeast/sdk/python/feast/infra/offline_stores/file.py
Line 327 in 769c318
This PR adds a
try-catch
block to gracefully circumvent the error and process the data in only one partition.Which issue(s) this PR fixes:
N/A