-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Polars generic dataset (take 3) #170
Conversation
404709d
to
b25a517
Compare
b25a517
to
a7963e6
Compare
Lots of test failures that are completely unrelated to this PR. Will have a look later. |
This dataset would be of great use! Happy to help where possible. The error that is mostly present is:
The kedro version compared to main shows changes in kedro-org/kedro@08d3a02...main#diff-236fdc9ebc4cf9a942f1044bec507416ca1ab60918d41177e04b18591fecb40d This probably has to do with the renaming of kedro/io/core.py I'm not well-known with the transition plans for the |
a7963e6
to
44fc59a
Compare
Hi @sbrugman, that help would be really appreciated! This has been on my to-do list for too long. I just rebased and force pushed on top of current |
About the move to |
LGTM! |
Looks like everything is green now 😄 Have you used it already @sbrugman? To be honest I haven't looked at the code nor tried to use it yet myself. But it looks like
I'm... marking this as ready for review 😬 |
As it stands, we will use it for parquet on S3 once this dataset is incorporated in |
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
44fc59a
to
754a6ba
Compare
I tested this with https://github.com/astrojuanlu/workshop-jupyter-kedro successfully: diff --git a/conf/base/catalog.yml b/conf/base/catalog.yml
index 854b5e8..f06dd7e 100644
--- a/conf/base/catalog.yml
+++ b/conf/base/catalog.yml
@@ -12,22 +12,14 @@ openrepair-0_3-categories:
filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
openrepair-0_3-combined:
- type: polars.CSVDataSet
- filepath: data/02_intermediate/openrepairdata_v0.3_combined.csv
- load_args:
- dtypes:
- product_age: ${pl:Float64}
- group_identifier: ${pl:Utf8}
- try_parse_dates: true
+ type: polars.GenericDataSet
+ file_format: parquet
+ filepath: data/02_intermediate/openrepairdata_v0.3_combined.pq
openrepair-0_3:
- type: polars.CSVDataSet
- filepath: data/03_primary/openrepairdata_v0.3_clean.csv
- load_args:
- dtypes:
- product_age: ${pl:Float64}
- group_identifier: ${pl:Utf8}
- try_parse_dates: true
+ type: polars.GenericDataSet
+ file_format: parquet
+ filepath: data/03_primary/openrepairdata_v0.3_clean.pq
wordcloud-plot:
type: matplotlib.MatplotlibWriter and everything worked perfectly. I cannot approve my own pull request 😬 so, reviews welcome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've copied over my comments from the closed Polars PRs which were left unanswered.
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
bd07ee7
to
ba4b056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it too and it works fine! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, just left some minor nitpicks but otherwise all good to go! 🌟
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
a85cc5e
to
a0fa68b
Compare
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Polars 0.19 is out 😄 after this PR is merged I'll verify that nothing broke here https://github.com/pola-rs/polars/releases/tag/py-0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Let's get it merged 👍
🔥 Thanks a lot reviewers! Super happy to see this land 👏🏽 |
And thanks a lot to @wmoreiraa for starting the original effort! |
Oh, I thought this was already available in 0.19.1. If I add a pandas<->polars conversion at the beginning and end of every node I could have a polars pipeline, right? And then switch over once a new version is released that reads and writes Polars? |
@grofte This is available in kedro-datasets! You have to |
I'm running Kedro 0.19.2 and I've done I kept getting EDIT OOOOOOOh. It's EagerPolarsDataset or LazyPolarsDataset to get a generic dataset with Polars. |
Oh, yes it changed its name I believe 🙏🏽 Glad you got it to work! If you have any troubles with it, feel free to open an issue. |
Thank you! And update your blog post 😉 https://kedro.org/blog/a-polars-exploration-into-kedro It's the premiere source for kedro+polars information 😁 |
For any future folks that come across this: There's a bug, I think in Polars but maybe in Kedro Datasets, where Kedro's EagerDataset opens the parquet file but polars doesn't recognize it as bytes / io.BufferedIOBase / io.RawIOBase and therefore sends it to load_args:
use_pyarrow: true |
Ugh, could you open a separate issue about this @grofte ? |
Description
Close gh-110.
Development notes
See past discussion in gh-153 and gh-116.
Not 100 % sure I did the rebase right, will look into it later.
Checklist
RELEASE.md
file