polars generic dataset #116

wmoreiraa · 2023-02-13T20:12:58Z

Signed-off-by: wmoreiraa walber3@gmail.com

Description

Add polars.GenericDataSet as discussed in #95

Development notes

Similar to pandas.GenericDataSet, changes made:

Add entry "write_mode" to be able to use formats that polars doesnt provide a write method (e.g.: delta, excel)
Write file permission changed to "wb" from "w".
Added more tests cases for different file formats.

Non-polars related

Changed delta-spark to deltalake when using spark.DeltaTableDataSet.

Checklist

Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

astrojuanlu · 2023-02-14T08:47:28Z

Hi @wmoreiraa, thanks a lot for your pull request! 🙌🏽

I didn't look at the code yet, but a few pipelines are failing because some lines are not covered by tests.

On the other hand, I'm spotting a dependencies issue on Python 3.10. pip keeps backtracking until it fails to build an old version of NumPy. On another pull request from a few hours ago it was working and there were some dependency changes introduced here, so it might be related.

wmoreiraa · 2023-02-14T11:43:15Z

Hi @astrojuanlu ! Yeah! Working on it right now!

wmoreiraa · 2023-02-14T12:11:02Z

@astrojuanlu , Ive pushed a commit, but it will still fail at pip install -r test_requirements.txt and I dont know how to proceed:

If you check: pip freeze from PR few hours ago , pandas and numpy version are not following pandas=~1.3 and its probably the numpy version (1.22.4 doesnt work with pyspark)

datajoely · 2023-02-16T14:14:36Z

@wmoreiraa this is looking really really great. T

angentially I've been thinking of how to help users work with Delta lake without needing Spark and in my research I realised the Pythonic solution uses a rust library anyway! So this is actually our answer to that question as well.

We probably don't need to do it in this PR, but we should look to add the scan_* readers to the scope later on so that users can read things lazily too 💪

wmoreiraa · 2023-02-23T13:35:37Z

@datajoely I think I finally did it... couldn't find another way 🤣

datajoely

Nice thank you for also fixing that Deltalake dataset too! I will defer to engineers in the team to get this over the line - but this is looking realllllyyyy good.

kedro-datasets/kedro_datasets/polars/generic_dataset.py

merelcht

Thanks so much for another contribution @wmoreiraa 🥇

I've added some initial comments and questions.

merelcht · 2023-02-24T12:06:33Z

kedro-datasets/RELEASE.md

-| Type                                 | Description                                                                | Location                      |
-| ------------------------------------ | -------------------------------------------------------------------------- | ----------------------------- |
 | `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |
+| `polars.GenericDataSet` | A `GenericDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |

-## Bug fixes and other changes
-
-
-# Release 1.0.2:

 ## Bug fixes and other changes
-* Change reference to `kedro.pipeline.Pipeline` object throughout test suite with `kedro.modular_pipeline.pipeline` factory.
+
 * Relaxed PyArrow range in line with Pandas
-* Fixed outdated links to the dill package documentation


I think you accidentally deleted some lines from the release notes. The 1.0.2 section should stay and so should the table header.

merelcht · 2023-02-24T15:19:46Z

kedro-datasets/test_requirements.txt

@@ -7,7 +7,7 @@ black~=22.0
 compress-pickle[lz4]~=1.2.0
 coverage[toml]
 dask[complete]~=2021.10  # pinned by Snyk to avoid a vulnerability
-delta-spark~=1.2.1  # 1.2.0 has a bug that breaks some of our tests: https://github.com/delta-io/delta/issues/1070
+delta-spark


We always try to pin our dependencies to some extent, even if it's loosely. Could you add a pin that's suitable?

Yeah, sure. But just a reminder here, those pins arent quite being respected. The dependency resolver is just finding a solution that "works", like Ive commented earlier. (E.g: pandas~=1.3, but its 1.5.3 thats being used).
Honestly, why were not using pip install -e ".[all]" ? (I may be missing the bigger picture)

You're absolutely right. IMO we don't necessarily have the most optimal testings and development flow for the datasets. We're improving our dev setup, but it's a slow process. I guess for the time being it would just be nice to keep it consistent with how other dependencies are pinned.

merelcht · 2023-02-24T15:20:42Z

kedro-datasets/test_requirements.txt

@@ -30,13 +30,13 @@ networkx~=2.4
 opencv-python~=4.5.5.64
 openpyxl>=3.0.3, <4.0
 pandas-gbq>=0.12.0, <0.18.0
-pandas~=1.3  # 1.3 for read_xml/to_xml
+pandas>=1.3, <2.0  # 1.3 for read_xml/to_xml


👍 A reminder for me that we'll soon need to look at supporting pandas 2.0 👀

"Our work is never over" PUNK, Daft.

merelcht · 2023-02-24T15:21:35Z

kedro-datasets/RELEASE.md

@@ -6,26 +6,21 @@



Can you also add a note about the deltatable changes you made?

merelcht · 2023-02-24T15:22:56Z

kedro-datasets/kedro_datasets/spark/deltatable_dataset.py

+            DeltaTable(load_path)
+        except PyDeltaTableError as exception:
+            if "Not a Delta table" in str(exception):


I don't know much about DeltaTable, so could you describe what's changed here? It looks fine, but I'm curious to understand 🙂

@merelcht , while I was writing the response to you, I noticed that I probably introduced a mistake.
The spark.DeltaTableDataSet used the delta-spark lib, which sits on top of spark to communicate with delta files. I changed to the native delta reader, implemented in rust delta-rs . It works? Yes, it works.
It should be done at the spark.DeltaTableDataSet ? Hell not, its not spark.
imo, the solution should be spark.DeltaTableDataSet implementation using the prior delta-spark and then other.DeltaTableDataSet using the native delta-rs.

For background we implemented this super early on in Delta's life so this is now the better way to do it

@datajoely Yeah, I know! Might as well wait a bit, they're implementing the write mode on delta-rs but its still experimental.

Yeah in that case maybe we should leave it as is and create a completely new DeltaTableDataSet that uses the newer and better of doing things.

merelcht · 2023-02-24T15:56:03Z

kedro-datasets/kedro_datasets/polars/generic_dataset.py

+            if self._file_format in ACCEPTED_READ_FILE_FORMATS:
+                raise DataSetError(
+                    f"This file format is read-only: '{self._file_format}' "
+                    f"If you want only to read, change write_mode to 'ignore'"
+                )


Maybe I'm misunderstanding this, but if you're using a read-only data format, you simply shouldn't call save right? It kind of sounds that if you change the write mode to ignore you wouldn't get an error, but you would get raise DataSetError(f"Write mode '{self._write_mode}' is read-only.") as set in line 219.

merelcht · 2023-02-24T15:56:50Z

kedro-datasets/kedro_datasets/polars/generic_dataset.py

+            )
+        if self._write_mode == "ignore":
+            raise DataSetError(f"Write mode '{self._write_mode}' is read-only.")
+        self._ensure_file_system_target()


Why do you need to call this as well as the other checks above?

merelcht · 2023-02-24T15:58:50Z

kedro-datasets/kedro_datasets/polars/generic_dataset.py

+        self,
+        filepath: str,
+        file_format: str,
+        write_mode: str = "overwrite",


When you say "overwrite" do you mean that the data would be overwritten every time you call save? The dataset is inheriting from AbstractVersionedDataSet so I'd expect it to still be able to save versions of the dataset.

merelcht · 2023-02-24T16:02:42Z

kedro-datasets/tests/polars/test_generic_dataset.py

+
+
+class TestGenericExcelDataSet:
+    def test_assert_write_mode(self):


I would change this to

Suggested change

def test_assert_write_mode(self):

def test_assert_write_mode_fails(self):

in line with the tests below that have fail in the name for a test that should raise an exception.

astrojuanlu · 2023-03-09T12:36:27Z

FYI: #124

astrojuanlu mentioned this pull request Feb 17, 2023

Adding Polars' dataframe as a dataset #110

Closed

datajoely reviewed Feb 23, 2023

View reviewed changes

kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved

merelcht reviewed Feb 24, 2023

View reviewed changes

datajoely mentioned this pull request Mar 1, 2023

Allow for dynamic SQL filtering of datasets through lazy loading kedro-org/kedro#2374

Closed

merelcht linked an issue Mar 21, 2023 that may be closed by this pull request

Adding Polars' dataframe as a dataset #110

Closed

wmoreiraa closed this Apr 1, 2023

wmoreiraa force-pushed the polars-generic branch from 4448b9f to b07a57e Compare April 1, 2023 13:14

wmoreiraa mentioned this pull request Apr 1, 2023

Polars Generic Dataset #153

Closed

3 tasks

merelcht removed a link to an issue Apr 11, 2023

Adding Polars' dataframe as a dataset #110

Closed

astrojuanlu mentioned this pull request Apr 16, 2023

feat: Polars generic dataset (take 3) #170

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars generic dataset #116

polars generic dataset #116

wmoreiraa commented Feb 13, 2023 •

edited

Loading

astrojuanlu commented Feb 14, 2023

wmoreiraa commented Feb 14, 2023

wmoreiraa commented Feb 14, 2023 •

edited

Loading

datajoely commented Feb 16, 2023

wmoreiraa commented Feb 23, 2023

datajoely left a comment

merelcht left a comment

merelcht Feb 24, 2023

wmoreiraa Feb 24, 2023

merelcht Feb 24, 2023

wmoreiraa Feb 24, 2023

merelcht Mar 7, 2023

merelcht Feb 24, 2023

wmoreiraa Feb 24, 2023

merelcht Feb 24, 2023

merelcht Feb 24, 2023

wmoreiraa Feb 24, 2023

datajoely Feb 24, 2023

wmoreiraa Feb 24, 2023

merelcht Mar 7, 2023

merelcht Feb 24, 2023

merelcht Feb 24, 2023

merelcht Feb 24, 2023

merelcht Feb 24, 2023

astrojuanlu commented Mar 9, 2023



		class TestGenericExcelDataSet:
		def test_assert_write_mode(self):

	def test_assert_write_mode(self):
	def test_assert_write_mode_fails(self):

polars generic dataset #116

polars generic dataset #116

Conversation

wmoreiraa commented Feb 13, 2023 • edited Loading

Description

Development notes

Non-polars related

Checklist

astrojuanlu commented Feb 14, 2023

wmoreiraa commented Feb 14, 2023

wmoreiraa commented Feb 14, 2023 • edited Loading

datajoely commented Feb 16, 2023

wmoreiraa commented Feb 23, 2023

datajoely left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrojuanlu commented Mar 9, 2023

wmoreiraa commented Feb 13, 2023 •

edited

Loading

wmoreiraa commented Feb 14, 2023 •

edited

Loading