Draft PR for issue 1946 - snowflake integration #78

heber-urdaneta · 2022-11-22T23:08:43Z

Description

Moving PR from main kedro repo to kedro-plugins

Development notes

Adding snowpark dataset and tests

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

datajoely

It's really coming together @heber-urdaneta really good work!

kedro-datasets/kedro_datasets/snowflake/__init__.py

datajoely · 2022-11-23T09:56:28Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+            raise DataSetError("'database' argument cannot be empty.")
+
+        if not schema:
+            raise DataSetError("'schema' argument cannot be empty.")


Is there not a default schema? I'm all for explicit, but I think the underlying API will pick a default IIRC

Not necessarily. When you create a user, you can define the default namespace, but the default value is NULL:
https://docs.snowflake.com/en/sql-reference/sql/create-user.html

Thanks @sfc-gh-mgorkow then we'll be conservative here!

datajoely · 2022-11-23T09:59:31Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+
+        sp_df.write.mode(self._save_args["mode"]).save_as_table(
+            table_name,
+            column_order=self._save_args["column_order"],


let's just do **self._save_args :)

datajoely · 2022-11-23T10:00:41Z

kedro-datasets/test_requirements.txt

@@ -34,7 +34,7 @@ Pillow~=9.0
 plotly>=4.8.0, <6.0
 pre-commit>=2.9.2, <3.0  # The hook `mypy` requires pre-commit version 2.9.2.
 psutil==5.8.0
-pyarrow>=1.0, <7.0
+pyarrow>=1.0, <9.0


This will need sign off from the wider team if it affects other pieces of the framework - hopefully not an issue.

@noklam @merelcht do you have a view here? What's the process for this sort of change.

I will just remove the upper bound, I expect it mostly works with pandas, and in pandas they have an open pin pyarrow>6.0.

Of course, the assumption is it still needs to passes the tests.

See this previous PR, I think we keep a relative open bound, but pyarrow bump the major version almost for every release.
kedro-org/kedro#1057

Or we should include 10.0.0 at least?

noklam · 2022-11-23T16:42:35Z

kedro-datasets/setup.py

@@ -73,6 +73,9 @@ def _collect_requirements(requires):
    "spark.SparkJDBCDataSet": [SPARK, HDFS, S3FS],
    "spark.DeltaTableDataSet": [SPARK, HDFS, S3FS, "delta-spark~=1.0"],
 }
+snowpark_require = {
+    "snowflake.SnowParkDataSet": ["snowflake-snowpark-python~=1.0.0", "pyarrow>=8.0, <9.0"]


Does it only work with 8.0? This is equivalent to pyarrow==8.0.0 since the next version after 8.0.0 is 9.0.0

I tested snowpark with pyarrow 10.0.0 and it is incompatible, the upper bound is actually more strict: "please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0"

How about 9.0.0?

You are right technically but this is the pyarrow release

Version 9.0.0 doesn't work, so I think we can be more specific and adjust to pyarrow==8.0.0

Cool, that's fine, thanks for checking that🙏. Do you have an idea why is it not working? I think it would be nice to leave a comment there if we know certain API is not compatible.

Sure! When creating the snowpark session it shows a warning if pyarrow version is different than 8.0.0 ("please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0'), and there are potential crashes when interacting with the dataframe API

reuse database and warehouse from credentials if not provided with dataset

Improved credentials handling Add pyarrow dependency

kedro-datasets/test_requirements.txt

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

datajoely · 2022-11-30T13:52:53Z

kedro-datasets/test_requirements.txt

@@ -34,7 +34,7 @@ Pillow~=9.0
 plotly>=4.8.0, <6.0
 pre-commit>=2.9.2, <3.0  # The hook `mypy` requires pre-commit version 2.9.2.
 psutil==5.8.0
-pyarrow>=1.0, <7.0
+pyarrow==8.0


Suggested change

pyarrow==8.0

pyarrow~=8.0

datajoely · 2022-11-30T13:53:39Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+        >>>   database: meteorology
+        >>>   schema: observations
+        >>>   credentials: db_credentials
+        >>>   load_args (WIP):


Suggested change

>>> load_args (WIP):

>>> load_args (WIP):

You don't need in the example, but It would be good to get examples of doing this form of credentials, but also the externalbroser SSO approach :)

SnowParkDataSet tests

SnowParkDataSet tests SnowParkDataSet documentation

Vladimir-Filimonov · 2023-01-06T12:45:39Z

Hey everyone. We're wrapping up SnowParkData connector and I want to get feedback from the community before finalising the PR.
We implemented connection to Snowflake via standard use of credentials of kedro. And connection to the Snowflake gets opened at the moment of catalog initialisation. Same pattern was used for pandas SQL Dataset.

But there are other ways of doing this, so below is thinking process and why we concluded that this pattern seems to be good fit.

Using hook like pyspark dataset?

We can open connection to Snowflake using after_context_created hook like pyspark starter does. But the thing is - hooks are fired AFTER catalog & context initialisation so this means connection will be opened later than in current approach (no benefit of starting it earlier) at the cost of complicating project configuration (user have to use starter; for users using ex. AWS + Snowflake credentials getting split between credentials file and some separate snowpark.yml if we follow analogy of pyspark dataset).

So hook seems to be giving no value to a snowpark connector (as opposed to a pyspark, where hook makes sense) at the cost of complicating project structure.

Pipeline uses Snowflake only at very late stages

If we imagine kedro pipeline that does some file data preprocessing (hours?) and saving data into Snowflake only at later stages it means connection to the Snowflake will be opened all that time (hours?) before actually used.

To address this issue we can initiate connection in a lazy fashion - load, save and exists methods will check if connection opened first and if not - very first use opens connection. This allows to avoid connection being opened w/o use BUT has downside that if connection credentials are wrong - we'll know about it only when pipeline will come to a stage of actually using Snowflake - which might be later in the pipeline and impact user experience.
Based on this thinking I consider this as suboptimal design.

Snowflake costs consideration

Just opening connection to a Snowflake (from snowpark) does not awake Snowflake virtual warehouse you provided in connection string. So user is not charged for connection being open and from this perspective there are no drawbacks of opening connection in advance of actually using Snowflake data.

Appreciate feedback on thinking above and if you think we should change implementation of opening connection for Snowpark. @datajoely @noklam @marrrcin

heber-urdaneta · 2023-01-09T20:54:53Z

Multiple of the circle CI checks have failed and we have some questions, here's a summary:

1. lint and unit-tests 3.8: it seems there was a timeout (10m) when installing requirements (particularly sqlalchemy) -> I am not sure this is caused by our commits. Have you seen this before? Can timeout maybe be extended beyond 10m?
2. win-unit-test 3.8: fails when installing GDAL related library -> this also doesn't look like it's caused by our changes, have you seen this occur before?
3. tests done with python versions different than 3.8: for the time being, snowpark only works with Python 3.8 -> not sure how to proceed, could those tests be ignored under the acknowledgement that only version 3.8 should be used for snowpark?

@datajoely @noklam, let us know your thoughts, thanks!

datajoely · 2023-01-10T11:08:28Z

Hi @Vladimir-Filimonov -

Thank you for the hard work, I've checked with Kedro's Tech Lead @idanov that we don't need a hook here, so you're proposal is perfect.

@heber-urdaneta can if you run the check locally using the makefile do they pass?

datajoely · 2023-01-10T11:09:37Z

Regarding this part

To address this issue we can initiate connection in a lazy fashion - load, save and exists methods will check if connection opened first and if not - very first use opens connection. This allows to avoid connection being opened w/o use BUT has downside that if connection credentials are wrong - we'll know about it only when pipeline will come to a stage of actually using Snowflake - which might be later in the pipeline and impact user experience.
Based on this thinking I consider this as suboptimal design.

Is there a middle-ground of where we do an eager credentials check and a lazy data load/save? Perhaps this where a after_context_created hook would be useful?

datajoely · 2023-01-16T09:14:18Z

are we far away from marking this as ready for review?

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

datajoely · 2023-01-18T09:43:27Z

kedro-datasets/pyproject.toml

@@ -34,7 +34,7 @@ min-public-methods = 1
 [tool.coverage.report]
 fail_under = 100
 show_missing = true
-omit = ["tests/*", "kedro_datasets/holoviews/*"]
+omit = ["tests/*", "kedro_datasets/holoviews/*", "kedro_datasets/snowflake/*"]


Why exclude?

This was to exclude the execution of snowpark tests from the coverage report and avoid % coverage errors.

To execute snowpark tests we need a live connection to a snowflake account, we planned to keep it separated from the main tests and only execute when triggering tests with a snowflake marker (we added make test-snowflake-only). While the test execution is skipped if no snowflake marker is provided, it still affects the coverage report which expects 100% - which is why I added it to the omit from the report.

Let us know your thoughts and if there's any other way around, and we can discuss further.

No it's looking good - makes sense :)

Vladimir-Filimonov · 2023-01-23T19:28:40Z

Ready for review PR was opened here #104. Due to need to fix DCO issues we had to have a clean start and sign all commits.

Migrating from main kedro forked repo

e4638a6

This was referenced Nov 22, 2022

Draft PR for issue 1946 - Snowpark dataset kedro-org/kedro#2032

Closed

Draft PR for issue 1946 - snowflake integration kedro-org/kedro#2029

Closed

datajoely reviewed Nov 23, 2022

View reviewed changes

noklam reviewed Nov 23, 2022

View reviewed changes

heber-urdaneta and others added 4 commits November 28, 2022 22:03

Adjust pyarrow version for snowpark

fea7210

Adjust session creation method and use saveargs

581a33d

made credentials work with SSO again

ec5e7f4

reuse database and warehouse from credentials if not provided with dataset

Merge pull request #1 from heber-urdaneta/dev

c5302d4

Improved credentials handling Add pyarrow dependency

datajoely reviewed Nov 30, 2022

View reviewed changes

kedro-datasets/test_requirements.txt Outdated Show resolved Hide resolved

datajoely reviewed Nov 30, 2022

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Show resolved Hide resolved

datajoely reviewed Nov 30, 2022

View reviewed changes

Vladimir-Filimonov added 2 commits December 5, 2022 10:39

First chunk of SnowParkDataSet tests (#2)

d8352d2

SnowParkDataSet tests

Documentation and removal of warehouse param from dataset (#3)

1e8fb92

SnowParkDataSet tests SnowParkDataSet documentation

heber-urdaneta added 2 commits January 12, 2023 19:12

Merge branch 'main' into fix_conflicts

ded30f9

Update pyarrow

356fe30

heber-urdaneta added 6 commits January 16, 2023 14:49

Add python version 3.8 marker for snowpark

56041b4

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

Implement lint suggestions

1710bb7

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

Omit snowflake tests on pyproject.toml

f580c58

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

Add Makefile final line

42a29b6

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

Fix for tests on different python version

6cc0ef5

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

Remove snowpark tests from cov report

7143f26

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>

datajoely reviewed Jan 18, 2023

View reviewed changes

Vladimir-Filimonov mentioned this pull request Jan 23, 2023

[DRAFT] Snowpark dataset implementation #102

Closed

Vladimir-Filimonov mentioned this pull request Jan 23, 2023

Snowpark (Snowflake) dataset for kedro #104

Merged

4 tasks

heber-urdaneta closed this Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft PR for issue 1946 - snowflake integration #78

Draft PR for issue 1946 - snowflake integration #78

heber-urdaneta commented Nov 22, 2022

datajoely left a comment

datajoely Nov 23, 2022

sfc-gh-mgorkow Nov 23, 2022 •

edited

Loading

datajoely Nov 23, 2022

datajoely Nov 23, 2022

datajoely Nov 23, 2022

datajoely Nov 23, 2022

noklam Nov 23, 2022

noklam Nov 23, 2022

noklam Nov 23, 2022

heber-urdaneta Nov 24, 2022

noklam Nov 24, 2022

heber-urdaneta Nov 25, 2022

noklam Nov 25, 2022

heber-urdaneta Nov 25, 2022

datajoely Nov 30, 2022

datajoely Nov 30, 2022

Vladimir-Filimonov commented Jan 6, 2023

heber-urdaneta commented Jan 9, 2023

datajoely commented Jan 10, 2023

datajoely commented Jan 10, 2023

datajoely commented Jan 16, 2023

datajoely Jan 18, 2023

heber-urdaneta Jan 18, 2023

datajoely Jan 19, 2023

Vladimir-Filimonov commented Jan 23, 2023

Draft PR for issue 1946 - snowflake integration #78

Draft PR for issue 1946 - snowflake integration #78

Conversation

heber-urdaneta commented Nov 22, 2022

Description

Development notes

Checklist

datajoely left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-mgorkow Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vladimir-Filimonov commented Jan 6, 2023

Using hook like pyspark dataset?

Pipeline uses Snowflake only at very late stages

Snowflake costs consideration

heber-urdaneta commented Jan 9, 2023

datajoely commented Jan 10, 2023

datajoely commented Jan 10, 2023

datajoely commented Jan 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vladimir-Filimonov commented Jan 23, 2023

sfc-gh-mgorkow Nov 23, 2022 •

edited

Loading