-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft PR for issue 1946 - snowflake integration #78
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really coming together @heber-urdaneta really good work!
raise DataSetError("'database' argument cannot be empty.") | ||
|
||
if not schema: | ||
raise DataSetError("'schema' argument cannot be empty.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a default schema? I'm all for explicit, but I think the underlying API will pick a default IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. When you create a user, you can define the default namespace, but the default value is NULL:
https://docs.snowflake.com/en/sql-reference/sql/create-user.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sfc-gh-mgorkow then we'll be conservative here!
|
||
sp_df.write.mode(self._save_args["mode"]).save_as_table( | ||
table_name, | ||
column_order=self._save_args["column_order"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just do **self._save_args
:)
kedro-datasets/test_requirements.txt
Outdated
@@ -34,7 +34,7 @@ Pillow~=9.0 | |||
plotly>=4.8.0, <6.0 | |||
pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2. | |||
psutil==5.8.0 | |||
pyarrow>=1.0, <7.0 | |||
pyarrow>=1.0, <9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need sign off from the wider team if it affects other pieces of the framework - hopefully not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just remove the upper bound, I expect it mostly works with pandas
, and in pandas
they have an open pin pyarrow>6.0
.
Of course, the assumption is it still needs to passes the tests.
See this previous PR, I think we keep a relative open bound, but pyarrow
bump the major version almost for every release.
kedro-org/kedro#1057
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we should include 10.0.0
at least?
kedro-datasets/setup.py
Outdated
@@ -73,6 +73,9 @@ def _collect_requirements(requires): | |||
"spark.SparkJDBCDataSet": [SPARK, HDFS, S3FS], | |||
"spark.DeltaTableDataSet": [SPARK, HDFS, S3FS, "delta-spark~=1.0"], | |||
} | |||
snowpark_require = { | |||
"snowflake.SnowParkDataSet": ["snowflake-snowpark-python~=1.0.0", "pyarrow>=8.0, <9.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it only work with 8.0
? This is equivalent to pyarrow==8.0.0
since the next version after 8.0.0
is 9.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested snowpark with pyarrow 10.0.0 and it is incompatible, the upper bound is actually more strict: "please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version 9.0.0 doesn't work, so I think we can be more specific and adjust to pyarrow==8.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that's fine, thanks for checking that🙏. Do you have an idea why is it not working? I think it would be nice to leave a comment there if we know certain API is not compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! When creating the snowpark session it shows a warning if pyarrow version is different than 8.0.0 ("please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0'), and there are potential crashes when interacting with the dataframe API
reuse database and warehouse from credentials if not provided with dataset
Improved credentials handling Add pyarrow dependency
kedro-datasets/test_requirements.txt
Outdated
@@ -34,7 +34,7 @@ Pillow~=9.0 | |||
plotly>=4.8.0, <6.0 | |||
pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2. | |||
psutil==5.8.0 | |||
pyarrow>=1.0, <7.0 | |||
pyarrow==8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow==8.0 | |
pyarrow~=8.0 |
>>> database: meteorology | ||
>>> schema: observations | ||
>>> credentials: db_credentials | ||
>>> load_args (WIP): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> load_args (WIP): | |
>>> load_args (WIP): |
You don't need in the example, but It would be good to get examples of doing this form of credentials, but also the externalbroser
SSO approach :)
SnowParkDataSet tests
SnowParkDataSet tests SnowParkDataSet documentation
Hey everyone. We're wrapping up SnowParkData connector and I want to get feedback from the community before finalising the PR. But there are other ways of doing this, so below is thinking process and why we concluded that this pattern seems to be good fit. Using hook like pyspark dataset?We can open connection to Snowflake using So hook seems to be giving no value to a snowpark connector (as opposed to a pyspark, where hook makes sense) at the cost of complicating project structure. Pipeline uses Snowflake only at very late stagesIf we imagine kedro pipeline that does some file data preprocessing (hours?) and saving data into Snowflake only at later stages it means connection to the Snowflake will be opened all that time (hours?) before actually used. To address this issue we can initiate connection in a lazy fashion - Snowflake costs considerationJust opening connection to a Snowflake (from snowpark) does not awake Snowflake virtual warehouse you provided in connection string. So user is not charged for connection being open and from this perspective there are no drawbacks of opening connection in advance of actually using Snowflake data. Appreciate feedback on thinking above and if you think we should change implementation of opening connection for Snowpark. @datajoely @noklam @marrrcin |
Multiple of the circle CI checks have failed and we have some questions, here's a summary: 1. lint and unit-tests 3.8: it seems there was a timeout (10m) when installing requirements (particularly sqlalchemy) -> I am not sure this is caused by our commits. Have you seen this before? Can timeout maybe be extended beyond 10m? @datajoely @noklam, let us know your thoughts, thanks! |
Hi @Vladimir-Filimonov - Thank you for the hard work, I've checked with Kedro's Tech Lead @idanov that we don't need a hook here, so you're proposal is perfect. @heber-urdaneta can if you run the check locally using the |
Regarding this part
Is there a middle-ground of where we do an eager credentials check and a lazy data load/save? Perhaps this where a |
are we far away from marking this as ready for review? |
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
@@ -34,7 +34,7 @@ min-public-methods = 1 | |||
[tool.coverage.report] | |||
fail_under = 100 | |||
show_missing = true | |||
omit = ["tests/*", "kedro_datasets/holoviews/*"] | |||
omit = ["tests/*", "kedro_datasets/holoviews/*", "kedro_datasets/snowflake/*"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exclude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to exclude the execution of snowpark tests from the coverage report and avoid % coverage errors.
To execute snowpark tests we need a live connection to a snowflake account, we planned to keep it separated from the main tests and only execute when triggering tests with a snowflake marker (we added make test-snowflake-only
). While the test execution is skipped if no snowflake marker is provided, it still affects the coverage report which expects 100% - which is why I added it to the omit from the report.
Let us know your thoughts and if there's any other way around, and we can discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's looking good - makes sense :)
Ready for review PR was opened here #104. Due to need to fix DCO issues we had to have a clean start and sign all commits. |
Description
Moving PR from main kedro repo to kedro-plugins
Development notes
Adding snowpark dataset and tests
Checklist
RELEASE.md
file