Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars generic dataset #116

Closed
wants to merge 0 commits into from
Closed

Conversation

wmoreiraa
Copy link
Contributor

@wmoreiraa wmoreiraa commented Feb 13, 2023

Signed-off-by: wmoreiraa walber3@gmail.com

Description

Add polars.GenericDataSet as discussed in #95

Development notes

Similar to pandas.GenericDataSet, changes made:

  1. Add entry "write_mode" to be able to use formats that polars doesnt provide a write method (e.g.: delta, excel)
  2. Write file permission changed to "wb" from "w".
  3. Added more tests cases for different file formats.

Non-polars related

Changed delta-spark to deltalake when using spark.DeltaTableDataSet.

Checklist

  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@astrojuanlu
Copy link
Member

Hi @wmoreiraa, thanks a lot for your pull request! 🙌🏽

I didn't look at the code yet, but a few pipelines are failing because some lines are not covered by tests.

On the other hand, I'm spotting a dependencies issue on Python 3.10. pip keeps backtracking until it fails to build an old version of NumPy. On another pull request from a few hours ago it was working and there were some dependency changes introduced here, so it might be related.

@wmoreiraa
Copy link
Contributor Author

Hi @astrojuanlu ! Yeah! Working on it right now!

@wmoreiraa
Copy link
Contributor Author

wmoreiraa commented Feb 14, 2023

@astrojuanlu , Ive pushed a commit, but it will still fail at pip install -r test_requirements.txt and I dont know how to proceed:

  1. If you check: pip freeze from PR few hours ago , pandas and numpy version are not following pandas=~1.3 and its probably the numpy version (1.22.4 doesnt work with pyspark)

@datajoely
Copy link
Contributor

@wmoreiraa this is looking really really great. T

angentially I've been thinking of how to help users work with Delta lake without needing Spark and in my research I realised the Pythonic solution uses a rust library anyway! So this is actually our answer to that question as well.

We probably don't need to do it in this PR, but we should look to add the scan_* readers to the scope later on so that users can read things lazily too 💪

@wmoreiraa
Copy link
Contributor Author

@datajoely I think I finally did it... couldn't find another way 🤣

Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you for also fixing that Deltalake dataset too! I will defer to engineers in the team to get this over the line - but this is looking realllllyyyy good.

kedro-datasets/kedro_datasets/polars/generic_dataset.py Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for another contribution @wmoreiraa 🥇

I've added some initial comments and questions.

Comment on lines 9 to 42
| Type | Description | Location |
| ------------------------------------ | -------------------------------------------------------------------------- | ----------------------------- |
| `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |
| `polars.GenericDataSet` | A `GenericDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` |

## Bug fixes and other changes


# Release 1.0.2:

## Bug fixes and other changes
* Change reference to `kedro.pipeline.Pipeline` object throughout test suite with `kedro.modular_pipeline.pipeline` factory.

* Relaxed PyArrow range in line with Pandas
* Fixed outdated links to the dill package documentation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you accidentally deleted some lines from the release notes. The 1.0.2 section should stay and so should the table header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wops!

@@ -7,7 +7,7 @@ black~=22.0
compress-pickle[lz4]~=1.2.0
coverage[toml]
dask[complete]~=2021.10 # pinned by Snyk to avoid a vulnerability
delta-spark~=1.2.1 # 1.2.0 has a bug that breaks some of our tests: https://github.com/delta-io/delta/issues/1070
delta-spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always try to pin our dependencies to some extent, even if it's loosely. Could you add a pin that's suitable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure. But just a reminder here, those pins arent quite being respected. The dependency resolver is just finding a solution that "works", like Ive commented earlier. (E.g: pandas~=1.3, but its 1.5.3 thats being used).
Honestly, why were not using pip install -e ".[all]" ? (I may be missing the bigger picture)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. IMO we don't necessarily have the most optimal testings and development flow for the datasets. We're improving our dev setup, but it's a slow process. I guess for the time being it would just be nice to keep it consistent with how other dependencies are pinned.

@@ -30,13 +30,13 @@ networkx~=2.4
opencv-python~=4.5.5.64
openpyxl>=3.0.3, <4.0
pandas-gbq>=0.12.0, <0.18.0
pandas~=1.3 # 1.3 for read_xml/to_xml
pandas>=1.3, <2.0 # 1.3 for read_xml/to_xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 A reminder for me that we'll soon need to look at supporting pandas 2.0 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Our work is never over" PUNK, Daft.

@@ -6,26 +6,21 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a note about the deltatable changes you made?

Comment on lines 88 to 90
DeltaTable(load_path)
except PyDeltaTableError as exception:
if "Not a Delta table" in str(exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about DeltaTable, so could you describe what's changed here? It looks fine, but I'm curious to understand 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht , while I was writing the response to you, I noticed that I probably introduced a mistake.
The spark.DeltaTableDataSet used the delta-spark lib, which sits on top of spark to communicate with delta files. I changed to the native delta reader, implemented in rust delta-rs . It works? Yes, it works.
It should be done at the spark.DeltaTableDataSet ? Hell not, its not spark.
imo, the solution should be spark.DeltaTableDataSet implementation using the prior delta-spark and then other.DeltaTableDataSet using the native delta-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For background we implemented this super early on in Delta's life so this is now the better way to do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datajoely Yeah, I know! Might as well wait a bit, they're implementing the write mode on delta-rs but its still experimental.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in that case maybe we should leave it as is and create a completely new DeltaTableDataSet that uses the newer and better of doing things.

Comment on lines 206 to 210
if self._file_format in ACCEPTED_READ_FILE_FORMATS:
raise DataSetError(
f"This file format is read-only: '{self._file_format}' "
f"If you want only to read, change write_mode to 'ignore'"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding this, but if you're using a read-only data format, you simply shouldn't call save right? It kind of sounds that if you change the write mode to ignore you wouldn't get an error, but you would get raise DataSetError(f"Write mode '{self._write_mode}' is read-only.") as set in line 219.

)
if self._write_mode == "ignore":
raise DataSetError(f"Write mode '{self._write_mode}' is read-only.")
self._ensure_file_system_target()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to call this as well as the other checks above?

self,
filepath: str,
file_format: str,
write_mode: str = "overwrite",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "overwrite" do you mean that the data would be overwritten every time you call save? The dataset is inheriting from AbstractVersionedDataSet so I'd expect it to still be able to save versions of the dataset.



class TestGenericExcelDataSet:
def test_assert_write_mode(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to

Suggested change
def test_assert_write_mode(self):
def test_assert_write_mode_fails(self):

in line with the tests below that have fail in the name for a test that should raise an exception.

@astrojuanlu
Copy link
Member

FYI: #124

@merelcht merelcht linked an issue Mar 21, 2023 that may be closed by this pull request
@wmoreiraa wmoreiraa closed this Apr 1, 2023
@wmoreiraa wmoreiraa mentioned this pull request Apr 1, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants