-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polars generic dataset #116
Conversation
Hi @wmoreiraa, thanks a lot for your pull request! 🙌🏽 I didn't look at the code yet, but a few pipelines are failing because some lines are not covered by tests. On the other hand, I'm spotting a dependencies issue on Python 3.10. pip keeps backtracking until it fails to build an old version of NumPy. On another pull request from a few hours ago it was working and there were some dependency changes introduced here, so it might be related. |
Hi @astrojuanlu ! Yeah! Working on it right now! |
@astrojuanlu , Ive pushed a commit, but it will still fail at
|
@wmoreiraa this is looking really really great. T angentially I've been thinking of how to help users work with Delta lake without needing Spark and in my research I realised the Pythonic solution uses a rust library anyway! So this is actually our answer to that question as well. We probably don't need to do it in this PR, but we should look to add the |
@datajoely I think I finally did it... couldn't find another way 🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you for also fixing that Deltalake dataset too! I will defer to engineers in the team to get this over the line - but this is looking realllllyyyy good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for another contribution @wmoreiraa 🥇
I've added some initial comments and questions.
kedro-datasets/RELEASE.md
Outdated
| Type | Description | Location | | ||
| ------------------------------------ | -------------------------------------------------------------------------- | ----------------------------- | | ||
| `polars.CSVDataSet` | A `CSVDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` | | ||
| `polars.GenericDataSet` | A `GenericDataSet` backed by [polars](https://www.pola.rs/), a lighting fast dataframe package built entirely using Rust. | `kedro_datasets.polars` | | ||
|
||
## Bug fixes and other changes | ||
|
||
|
||
# Release 1.0.2: | ||
|
||
## Bug fixes and other changes | ||
* Change reference to `kedro.pipeline.Pipeline` object throughout test suite with `kedro.modular_pipeline.pipeline` factory. | ||
|
||
* Relaxed PyArrow range in line with Pandas | ||
* Fixed outdated links to the dill package documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you accidentally deleted some lines from the release notes. The 1.0.2
section should stay and so should the table header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wops!
kedro-datasets/test_requirements.txt
Outdated
@@ -7,7 +7,7 @@ black~=22.0 | |||
compress-pickle[lz4]~=1.2.0 | |||
coverage[toml] | |||
dask[complete]~=2021.10 # pinned by Snyk to avoid a vulnerability | |||
delta-spark~=1.2.1 # 1.2.0 has a bug that breaks some of our tests: https://github.com/delta-io/delta/issues/1070 | |||
delta-spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always try to pin our dependencies to some extent, even if it's loosely. Could you add a pin that's suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure. But just a reminder here, those pins arent quite being respected. The dependency resolver is just finding a solution that "works", like Ive commented earlier. (E.g: pandas~=1.3, but its 1.5.3 thats being used).
Honestly, why were not using pip install -e ".[all]"
? (I may be missing the bigger picture)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. IMO we don't necessarily have the most optimal testings and development flow for the datasets. We're improving our dev setup, but it's a slow process. I guess for the time being it would just be nice to keep it consistent with how other dependencies are pinned.
kedro-datasets/test_requirements.txt
Outdated
@@ -30,13 +30,13 @@ networkx~=2.4 | |||
opencv-python~=4.5.5.64 | |||
openpyxl>=3.0.3, <4.0 | |||
pandas-gbq>=0.12.0, <0.18.0 | |||
pandas~=1.3 # 1.3 for read_xml/to_xml | |||
pandas>=1.3, <2.0 # 1.3 for read_xml/to_xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 A reminder for me that we'll soon need to look at supporting pandas 2.0
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Our work is never over" PUNK, Daft.
kedro-datasets/RELEASE.md
Outdated
@@ -6,26 +6,21 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a note about the deltatable changes you made?
DeltaTable(load_path) | ||
except PyDeltaTableError as exception: | ||
if "Not a Delta table" in str(exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about DeltaTable
, so could you describe what's changed here? It looks fine, but I'm curious to understand 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merelcht , while I was writing the response to you, I noticed that I probably introduced a mistake.
The spark.DeltaTableDataSet used the delta-spark
lib, which sits on top of spark to communicate with delta files. I changed to the native delta reader, implemented in rust delta-rs
. It works? Yes, it works.
It should be done at the spark.DeltaTableDataSet
? Hell not, its not spark.
imo, the solution should be spark.DeltaTableDataSet
implementation using the prior delta-spark
and then other.DeltaTableDataSet
using the native delta-rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For background we implemented this super early on in Delta's life so this is now the better way to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datajoely Yeah, I know! Might as well wait a bit, they're implementing the write mode on delta-rs
but its still experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in that case maybe we should leave it as is and create a completely new DeltaTableDataSet
that uses the newer and better of doing things.
if self._file_format in ACCEPTED_READ_FILE_FORMATS: | ||
raise DataSetError( | ||
f"This file format is read-only: '{self._file_format}' " | ||
f"If you want only to read, change write_mode to 'ignore'" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding this, but if you're using a read-only data format, you simply shouldn't call save
right? It kind of sounds that if you change the write mode to ignore you wouldn't get an error, but you would get raise DataSetError(f"Write mode '{self._write_mode}' is read-only.")
as set in line 219.
) | ||
if self._write_mode == "ignore": | ||
raise DataSetError(f"Write mode '{self._write_mode}' is read-only.") | ||
self._ensure_file_system_target() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to call this as well as the other checks above?
self, | ||
filepath: str, | ||
file_format: str, | ||
write_mode: str = "overwrite", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "overwrite" do you mean that the data would be overwritten every time you call save? The dataset is inheriting from AbstractVersionedDataSet
so I'd expect it to still be able to save versions of the dataset.
|
||
|
||
class TestGenericExcelDataSet: | ||
def test_assert_write_mode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this to
def test_assert_write_mode(self): | |
def test_assert_write_mode_fails(self): |
in line with the tests below that have fail
in the name for a test that should raise an exception.
FYI: #124 |
4448b9f
to
b07a57e
Compare
Signed-off-by: wmoreiraa walber3@gmail.com
Description
Add
polars.GenericDataSet
as discussed in #95Development notes
Similar to
pandas.GenericDataSet
, changes made:Non-polars related
Changed
delta-spark
todeltalake
when using spark.DeltaTableDataSet.Checklist
RELEASE.md
file