Snowpark (Snowflake) dataset for kedro #104

Vladimir-Filimonov · 2023-01-23T18:59:10Z

Description

Ready for review PR summarising work and discussions happened as part of #78 (but we needed clean start to make all git commits signed properly).

This PR:

Implements Snowpark dataset as a connector to Snowflake data
Implements unit tests for the connector

This allows kedro users to work with Snowflake data using Snowpark dataframes that attempt to mimic pyspark dataframes interface.

Development notes

Snowpark package from Snowflake works only with python 3.8 link. It also requires higher version of pyarrow so we had to bump the version in requirements.

How to run tests

To run tests you need to have Snowflake instance to run them against.
Under kedro-datasets/tests/snowflake you can find readme explaining how to run tests locally. Also you'll find guidance on what permissions on Snowflake user needs to have in order to have tests executed successfully.

Snowpark-related tests disabled by default from pytest scope. Snowpark dataset class also excluded from test coverage report (as tests don't run by default and lower overall coverage report otherwise).

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Snowflake/snowpark dataset implementation

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

heber-urdaneta · 2023-02-03T21:29:58Z

@deepyaman I addressed the comments and pushed a couple of commits but I see lint check failing now. I don't think its related to the snowpark changes, can you confirm? thanks!
FYI, I've ran lint locally without errors

AhdraMeraliQB · 2023-02-08T14:54:24Z

@heber-urdaneta Does the lint still fail if you pull the latest changes from main?

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

heber-urdaneta · 2023-02-08T23:23:41Z

@AhdraMeraliQB thanks! Most errors were fixed, but had to push an additional commit to fix the video_dataset, hope that's fine! All checks passed now

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

kedro-datasets/tests/snowflake/test_snowpark_dataset.py

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

datajoely · 2023-02-23T11:01:23Z

Hi @Vladimir-Filimonov thank you again for your patience on this we've got together as team and have reached a consensus on the right way forward.

We make an effort not to extend the underlying API of a dataset and this is why we're a little uncomfortable supporting the pd.DataFrame -> sp.DataFrame -> Snowflake Table journey. The user value is obvious, but we'd prefer to be a little less magic in the short term - especially since we can add this in if users ask for it later on.

Asks for you:

Please remove the pd.DataFrame parts of the class so this is a pure snowpark class. I've also realised pandas isn't part of the setup.py requirements so this would have failed if the user just ran pip install kedro[snowflake.SnowparkTableDataSet] which is another reason to keep this simple!
(Optional) would you mind including a YAML example in your doc string which demonstrates that the user can use externalbrowser as credentials for SSO login?

Roadmap for ourselves (or any community contributors who would like to get involved):

Once this is released we recommend users to write to Snowflake using the following prefered options:
- pandas via pandas.SQLTableDataSet
- spark via spark.JDBCTableDataSet
We also introduce snowflake.SnowparkQueryDataSet which returns a sp.DataFrame on read and raised NotImplementedErrror on write. What's nice is we can steal all the good work you've done on authentication + existence checks.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Remove pd interactions and add docs

datajoely

Thank you this is really really looking good

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

datajoely · 2023-03-01T12:57:54Z

@merelcht I think this is ready for final review :)

AhdraMeraliQB

Fantastic work @Vladimir-Filimonov ! Thanks for the contribution 🎉

Vladimir-Filimonov · 2023-03-03T09:52:51Z

Fantastic work @Vladimir-Filimonov ! Thanks for the contribution 🎉

kudos to @heber-urdaneta !

merelcht

Thank you so much for this contribution!! ⭐
I left some minor comments around wording of docs. Also don't forget to update the release notes with this addition. See the previous notes for how we format dataset additions: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/RELEASE.md

merelcht · 2023-03-07T11:40:53Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+    One can skip everything but "table_name" if database and
+    schema provided via credentials. Therefore catalog entries can be shorter
+    if ex. all used Snowflake tables live in same database/schema.
+    Values in dataset definition take priority over ones defined in credentials


Suggested change

One can skip everything but "table_name" if database and

schema provided via credentials. Therefore catalog entries can be shorter

if ex. all used Snowflake tables live in same database/schema.

Values in dataset definition take priority over ones defined in credentials

You can skip everything but "table_name" if the database and

schema are provided via credentials. That way catalog entries can be shorter

if, for example, all used Snowflake tables live in same database/schema.

Values in the dataset definition take priority over those defined in credentials.

merelcht · 2023-03-07T11:41:54Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+    Credentials file provides all connection attributes, catalog entry
+    "weather" reuse credentials parameters, "polygons" catalog entry reuse
+    all credentials parameters except providing different schema name.
+    Second example of credentials file uses externalbrowser authentication


Suggested change

Credentials file provides all connection attributes, catalog entry

"weather" reuse credentials parameters, "polygons" catalog entry reuse

all credentials parameters except providing different schema name.

Second example of credentials file uses externalbrowser authentication

Credentials file provides all connection attributes, catalog entry

"weather" reuses credentials parameters, "polygons" catalog entry reuses

all credentials parameters except providing a different schema name.

Second example of credentials file uses ``externalbrowser`` authentication

merelcht · 2023-03-07T11:42:43Z

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py

+          user: "john_doe@wdomain.com"
+          authenticator: "externalbrowser"
+
+    As of Jan-2023, the snowpark connector only works with Python 3.8


I think it's worth to put this all the way at the top of the class doc string. I can imagine a lot of users would just skip reading the examples.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

merelcht

Thanks again for this contribution! 🎉 I'll get it merged in.

merelcht · 2023-03-08T09:59:37Z

@Vladimir-Filimonov I don't seem to be allowed to push changes to your branch. Could you please resolve the merge conflicts for the release notes? Then we can merge it in.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Update branch

heber-urdaneta · 2023-03-08T20:25:16Z

@Vladimir-Filimonov I don't seem to be allowed to push changes to your branch. Could you please resolve the merge conflicts for the release notes? Then we can merge it in.

@merelcht thanks for the note, conflict was solved and branch should be ready to merge!

* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com> Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com>

Vladimir-Filimonov and others added 4 commits January 23, 2023 17:35

Add Snowpark datasets

2a861cd

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>

Add snowpark tests

4469b5f

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Update tests requirements and config

d21fce4

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Snowpark dataset implementation

05b5059

Snowflake/snowpark dataset implementation

Vladimir-Filimonov mentioned this pull request Jan 23, 2023

Draft PR for issue 1946 - snowflake integration #78

Closed

4 tasks

deepyaman reviewed Feb 2, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

deepyaman reviewed Feb 2, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

deepyaman reviewed Feb 2, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

deepyaman reviewed Feb 2, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

deepyaman reviewed Feb 2, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

heber-urdaneta added 2 commits February 3, 2023 11:07

Update snowpark class name and docs formatting

ec6de72

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Adjustments for lint check

3cc53d7

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

merelcht added Community Issue/PR opened by the open-source community and removed Community Issue/PR opened by the open-source community labels Feb 6, 2023

heber-urdaneta added 3 commits February 8, 2023 14:06

Merge branch 'kedro-org:main' into main

de89ed7

Change describe to remove dict call

c4a1fd9

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Remove pylint too many args suppression

4168e64

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

mobilize-mrojas reviewed Feb 9, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Show resolved Hide resolved

mobilize-mrojas reviewed Feb 9, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Show resolved Hide resolved

mobilize-mrojas reviewed Feb 9, 2023

View reviewed changes

kedro-datasets/tests/snowflake/test_snowpark_dataset.py Outdated Show resolved Hide resolved

Update SnowparkTableDataSet and env vars

5fcd1b9

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

AhdraMeraliQB requested a review from deepyaman February 10, 2023 15:13

heber-urdaneta added 3 commits February 23, 2023 17:29

Remove pd interactions and add docs

8f61ab7

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Fix docs example

5bc11e1

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Merge pull request #2 from Vladimir-Filimonov/remove_pd

e968eec

Remove pd interactions and add docs

datajoely reviewed Feb 24, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

Fix sp for other py versions

386a1b9

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

AhdraMeraliQB self-assigned this Feb 27, 2023

AhdraMeraliQB reviewed Feb 28, 2023

View reviewed changes

kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py Outdated Show resolved Hide resolved

Remove leftover TODO

4898c5e

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

AhdraMeraliQB approved these changes Mar 2, 2023

View reviewed changes

AhdraMeraliQB mentioned this pull request Mar 6, 2023

Refactor kedro-datasets testing #122

Open

merelcht reviewed Mar 7, 2023

View reviewed changes

heber-urdaneta added 2 commits March 7, 2023 16:58

Adjust documentation wording

ca7de93

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Add SnowparkTableDataSet to release notes

47d9afc

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

merelcht approved these changes Mar 8, 2023

View reviewed changes

heber-urdaneta added 4 commits March 8, 2023 13:46

Revert Add SnowparkTableDataSet to release notes

1904104

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Merge branch 'kedro-org:main' into update_branch

c16b778

Correct RELEASE.md conflict

c399e7c

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>

Merge pull request #3 from Vladimir-Filimonov/update_branch

2716be6

Update branch

merelcht merged commit 3dac0d4 into kedro-org:main Mar 9, 2023

merelcht mentioned this pull request Mar 21, 2023

Snowflake Data Connectors (SnowPark) #108

Closed

This was referenced Jun 14, 2023

Make GitPod works kedro-org/kedro#2688

Merged

[DON'T MERGE] chore(datasets): Update setup.py for snowpark #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowpark (Snowflake) dataset for kedro #104

Snowpark (Snowflake) dataset for kedro #104

Vladimir-Filimonov commented Jan 23, 2023 •

edited

Loading

heber-urdaneta commented Feb 3, 2023 •

edited

Loading

AhdraMeraliQB commented Feb 8, 2023

heber-urdaneta commented Feb 8, 2023

datajoely commented Feb 23, 2023 •

edited

Loading

datajoely left a comment

datajoely commented Mar 1, 2023

AhdraMeraliQB left a comment

Vladimir-Filimonov commented Mar 3, 2023

merelcht left a comment

merelcht Mar 7, 2023

merelcht Mar 7, 2023

merelcht Mar 7, 2023

merelcht left a comment

merelcht commented Mar 8, 2023

heber-urdaneta commented Mar 8, 2023

Snowpark (Snowflake) dataset for kedro #104

Snowpark (Snowflake) dataset for kedro #104

Conversation

Vladimir-Filimonov commented Jan 23, 2023 • edited Loading

Description

Development notes

How to run tests

Checklist

heber-urdaneta commented Feb 3, 2023 • edited Loading

AhdraMeraliQB commented Feb 8, 2023

heber-urdaneta commented Feb 8, 2023

datajoely commented Feb 23, 2023 • edited Loading

datajoely left a comment

Choose a reason for hiding this comment

datajoely commented Mar 1, 2023

AhdraMeraliQB left a comment

Choose a reason for hiding this comment

Vladimir-Filimonov commented Mar 3, 2023

merelcht left a comment

Choose a reason for hiding this comment

merelcht Mar 7, 2023

Choose a reason for hiding this comment

merelcht Mar 7, 2023

Choose a reason for hiding this comment

merelcht Mar 7, 2023

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

merelcht commented Mar 8, 2023

heber-urdaneta commented Mar 8, 2023

Vladimir-Filimonov commented Jan 23, 2023 •

edited

Loading

heber-urdaneta commented Feb 3, 2023 •

edited

Loading

datajoely commented Feb 23, 2023 •

edited

Loading