Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Polars generic dataset (take 3) #170

Merged
merged 12 commits into from
Aug 30, 2023
Merged

Conversation

astrojuanlu
Copy link
Member

@astrojuanlu astrojuanlu commented Apr 16, 2023

Description

Close gh-110.

Development notes

See past discussion in gh-153 and gh-116.

Not 100 % sure I did the rebase right, will look into it later.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

kedro-datasets/RELEASE.md Outdated Show resolved Hide resolved
@astrojuanlu astrojuanlu force-pushed the feat/polars-generic-dataset branch 3 times, most recently from 404709d to b25a517 Compare June 19, 2023 06:32
@astrojuanlu astrojuanlu changed the title Polars generic dataset (take 3) feat: Polars generic dataset (take 3) Jun 19, 2023
@astrojuanlu
Copy link
Member Author

Lots of test failures that are completely unrelated to this PR. Will have a look later.

@sbrugman
Copy link
Contributor

This dataset would be of great use! Happy to help where possible.

The error that is mostly present is:

TypeError: expected exception must be a BaseException type, not DataSetError

The kedro version compared to main shows changes in kedro.io.core:

kedro-org/kedro@08d3a02...main#diff-236fdc9ebc4cf9a942f1044bec507416ca1ab60918d41177e04b18591fecb40d

This probably has to do with the renaming of DataSetError to DatasetError:

kedro/io/core.py

kedro/io/core.py

I'm not well-known with the transition plans for the DataSet names and the parity of kedro and kedro-datasets. Do you think it's possible to move forward?

@astrojuanlu
Copy link
Member Author

Hi @sbrugman, that help would be really appreciated! This has been on my to-do list for too long.

I just rebased and force pushed on top of current main, let's see what the CI says.

@astrojuanlu astrojuanlu mentioned this pull request Jul 13, 2023
4 tasks
@astrojuanlu
Copy link
Member Author

About the move to *Dataset from *DataSet, #255 should take care of that I believe. But it's a bit stuck.

@sbrugman
Copy link
Contributor

LGTM!

@astrojuanlu
Copy link
Member Author

Looks like everything is green now 😄

Have you used it already @sbrugman? To be honest I haven't looked at the code nor tried to use it yet myself. But it looks like

  • kedro_datasets/polars/generic_dataset.py has 100 % coverage
  • Tests are parametrized by file type and location

I'm... marking this as ready for review 😬

@astrojuanlu astrojuanlu marked this pull request as ready for review July 13, 2023 07:30
@sbrugman
Copy link
Contributor

sbrugman commented Jul 13, 2023

As it stands, we will use it for parquet on S3 once this dataset is incorporated in kedro-datasets. Will report issues if we run into them.

astrojuanlu and others added 4 commits July 14, 2023 13:47
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
Signed-off-by: wmoreiraa <walber3@gmail.com>
@astrojuanlu
Copy link
Member Author

I tested this with https://github.com/astrojuanlu/workshop-jupyter-kedro successfully:

diff --git a/conf/base/catalog.yml b/conf/base/catalog.yml
index 854b5e8..f06dd7e 100644
--- a/conf/base/catalog.yml
+++ b/conf/base/catalog.yml
@@ -12,22 +12,14 @@ openrepair-0_3-categories:
   filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
 
 openrepair-0_3-combined:
-  type: polars.CSVDataSet
-  filepath: data/02_intermediate/openrepairdata_v0.3_combined.csv
-  load_args:
-    dtypes:
-      product_age: ${pl:Float64}
-      group_identifier: ${pl:Utf8}
-    try_parse_dates: true
+  type: polars.GenericDataSet
+  file_format: parquet
+  filepath: data/02_intermediate/openrepairdata_v0.3_combined.pq
 
 openrepair-0_3:
-  type: polars.CSVDataSet
-  filepath: data/03_primary/openrepairdata_v0.3_clean.csv
-  load_args:
-    dtypes:
-      product_age: ${pl:Float64}
-      group_identifier: ${pl:Utf8}
-    try_parse_dates: true
+  type: polars.GenericDataSet
+  file_format: parquet
+  filepath: data/03_primary/openrepairdata_v0.3_clean.pq
 
 wordcloud-plot:
   type: matplotlib.MatplotlibWriter

and everything worked perfectly. I cannot approve my own pull request 😬 so, reviews welcome!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've copied over my comments from the closed Polars PRs which were left unanswered.

kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/tests/polars/test_generic_dataset.py Outdated Show resolved Hide resolved
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it too and it works fine! :)

Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, just left some minor nitpicks but otherwise all good to go! 🌟

kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved
astrojuanlu and others added 2 commits August 26, 2023 10:24
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu astrojuanlu requested review from merelcht and removed request for AhdraMeraliQB August 26, 2023 08:30
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu
Copy link
Member Author

Polars 0.19 is out 😄 after this PR is merged I'll verify that nothing broke here https://github.com/pola-rs/polars/releases/tag/py-0.19.0

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Let's get it merged 👍

@astrojuanlu astrojuanlu merged commit dad3edd into main Aug 30, 2023
17 checks passed
@astrojuanlu astrojuanlu deleted the feat/polars-generic-dataset branch August 30, 2023 14:22
@astrojuanlu
Copy link
Member Author

🔥 Thanks a lot reviewers! Super happy to see this land 👏🏽

@astrojuanlu
Copy link
Member Author

And thanks a lot to @wmoreiraa for starting the original effort!

@grofte
Copy link

grofte commented Mar 4, 2024

Oh, I thought this was already available in 0.19.1. If I add a pandas<->polars conversion at the beginning and end of every node I could have a polars pipeline, right? And then switch over once a new version is released that reads and writes Polars?

@astrojuanlu
Copy link
Member Author

@grofte This is available in kedro-datasets! You have to pip install kedro-datasets[polars]

@grofte
Copy link

grofte commented Mar 4, 2024

I'm running Kedro 0.19.2 and I've done pip install kedro-dataset[polars] (and repeated it after uninstalling kedro-dataset). The kedro-dataset version is 2.1.0.

I kept getting kedro.io.core.DatasetError: Class 'polars.GenericDataSet' not found, is this a typo? but then I found out that S->s so now I get Class 'polars.GenericDataset' not found, is this a typo?. According to this https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.0.0/api/kedro_datasets.html GenericDataset is only a thing for Pandas.

EDIT OOOOOOOh. It's EagerPolarsDataset or LazyPolarsDataset to get a generic dataset with Polars.

@astrojuanlu
Copy link
Member Author

Oh, yes it changed its name I believe 🙏🏽 Glad you got it to work! If you have any troubles with it, feel free to open an issue.

@grofte
Copy link

grofte commented Mar 4, 2024

Thank you!

And update your blog post 😉 https://kedro.org/blog/a-polars-exploration-into-kedro

It's the premiere source for kedro+polars information 😁

@grofte
Copy link

grofte commented Mar 5, 2024

For any future folks that come across this: There's a bug, I think in Polars but maybe in Kedro Datasets, where Kedro's EagerDataset opens the parquet file but polars doesn't recognize it as bytes / io.BufferedIOBase / io.RawIOBase and therefore sends it to .scan_parquet which only takes paths as an argument. I don't know which library it is happening in, or why, so I can't make a PR to fix it but there's a work-around. In your catalog.yml add

load_args:
    use_pyarrow: true

@astrojuanlu
Copy link
Member Author

Ugh, could you open a separate issue about this @grofte ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Polars' dataframe as a dataset
9 participants