Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowpark (Snowflake) dataset for kedro #104

Merged
merged 21 commits into from
Mar 9, 2023

Conversation

Vladimir-Filimonov
Copy link
Contributor

@Vladimir-Filimonov Vladimir-Filimonov commented Jan 23, 2023

Description

Ready for review PR summarising work and discussions happened as part of #78 (but we needed clean start to make all git commits signed properly).

This PR:

  • Implements Snowpark dataset as a connector to Snowflake data
  • Implements unit tests for the connector

This allows kedro users to work with Snowflake data using Snowpark dataframes that attempt to mimic pyspark dataframes interface.

Development notes

Snowpark package from Snowflake works only with python 3.8 link. It also requires higher version of pyarrow so we had to bump the version in requirements.

How to run tests

To run tests you need to have Snowflake instance to run them against.
Under kedro-datasets/tests/snowflake you can find readme explaining how to run tests locally. Also you'll find guidance on what permissions on Snowflake user needs to have in order to have tests executed successfully.

Snowpark-related tests disabled by default from pytest scope. Snowpark dataset class also excluded from test coverage report (as tests don't run by default and lower overall coverage report otherwise).

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Vladimir-Filimonov and others added 4 commits January 23, 2023 17:35
Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Snowflake/snowpark dataset implementation
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@heber-urdaneta
Copy link

heber-urdaneta commented Feb 3, 2023

@deepyaman I addressed the comments and pushed a couple of commits but I see lint check failing now. I don't think its related to the snowpark changes, can you confirm? thanks!
FYI, I've ran lint locally without errors

@merelcht merelcht added Community Issue/PR opened by the open-source community and removed Community Issue/PR opened by the open-source community labels Feb 6, 2023
@AhdraMeraliQB
Copy link
Contributor

@heber-urdaneta Does the lint still fail if you pull the latest changes from main?

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@heber-urdaneta
Copy link

@AhdraMeraliQB thanks! Most errors were fixed, but had to push an additional commit to fix the video_dataset, hope that's fine! All checks passed now

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@datajoely
Copy link
Contributor

datajoely commented Feb 23, 2023

Hi @Vladimir-Filimonov thank you again for your patience on this we've got together as team and have reached a consensus on the right way forward.

We make an effort not to extend the underlying API of a dataset and this is why we're a little uncomfortable supporting the pd.DataFrame -> sp.DataFrame -> Snowflake Table journey. The user value is obvious, but we'd prefer to be a little less magic in the short term - especially since we can add this in if users ask for it later on.

Asks for you:

  • Please remove the pd.DataFrame parts of the class so this is a pure snowpark class. I've also realised pandas isn't part of the setup.py requirements so this would have failed if the user just ran pip install kedro[snowflake.SnowparkTableDataSet] which is another reason to keep this simple!
  • (Optional) would you mind including a YAML example in your doc string which demonstrates that the user can use externalbrowser as credentials for SSO login?

Roadmap for ourselves (or any community contributors who would like to get involved):

  • Once this is released we recommend users to write to Snowflake using the following prefered options:
    • pandas via pandas.SQLTableDataSet
    • spark via spark.JDBCTableDataSet
  • We also introduce snowflake.SnowparkQueryDataSet which returns a sp.DataFrame on read and raised NotImplementedErrror on write. What's nice is we can steal all the good work you've done on authentication + existence checks.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Remove pd interactions and add docs
Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you this is really really looking good

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@AhdraMeraliQB AhdraMeraliQB self-assigned this Feb 27, 2023
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@datajoely
Copy link
Contributor

@merelcht I think this is ready for final review :)

Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @Vladimir-Filimonov ! Thanks for the contribution 🎉

@Vladimir-Filimonov
Copy link
Contributor Author

Fantastic work @Vladimir-Filimonov ! Thanks for the contribution 🎉

kudos to @heber-urdaneta !

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution!! ⭐
I left some minor comments around wording of docs. Also don't forget to update the release notes with this addition. See the previous notes for how we format dataset additions: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/RELEASE.md

Comment on lines 32 to 35
One can skip everything but "table_name" if database and
schema provided via credentials. Therefore catalog entries can be shorter
if ex. all used Snowflake tables live in same database/schema.
Values in dataset definition take priority over ones defined in credentials
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One can skip everything but "table_name" if database and
schema provided via credentials. Therefore catalog entries can be shorter
if ex. all used Snowflake tables live in same database/schema.
Values in dataset definition take priority over ones defined in credentials
You can skip everything but "table_name" if the database and
schema are provided via credentials. That way catalog entries can be shorter
if, for example, all used Snowflake tables live in same database/schema.
Values in the dataset definition take priority over those defined in credentials.

Comment on lines 38 to 41
Credentials file provides all connection attributes, catalog entry
"weather" reuse credentials parameters, "polygons" catalog entry reuse
all credentials parameters except providing different schema name.
Second example of credentials file uses externalbrowser authentication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Credentials file provides all connection attributes, catalog entry
"weather" reuse credentials parameters, "polygons" catalog entry reuse
all credentials parameters except providing different schema name.
Second example of credentials file uses externalbrowser authentication
Credentials file provides all connection attributes, catalog entry
"weather" reuses credentials parameters, "polygons" catalog entry reuses
all credentials parameters except providing a different schema name.
Second example of credentials file uses ``externalbrowser`` authentication

user: "john_doe@wdomain.com"
authenticator: "externalbrowser"

As of Jan-2023, the snowpark connector only works with Python 3.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to put this all the way at the top of the class doc string. I can imagine a lot of users would just skip reading the examples.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this contribution! 🎉 I'll get it merged in.

@merelcht
Copy link
Member

merelcht commented Mar 8, 2023

@Vladimir-Filimonov I don't seem to be allowed to push changes to your branch. Could you please resolve the merge conflicts for the release notes? Then we can merge it in.

Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
@heber-urdaneta
Copy link

@Vladimir-Filimonov I don't seem to be allowed to push changes to your branch. Could you please resolve the merge conflicts for the release notes? Then we can merge it in.

@merelcht thanks for the note, conflict was solved and branch should be ready to merge!

@merelcht merelcht merged commit 3dac0d4 into kedro-org:main Mar 9, 2023
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 13, 2023
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
dannyrfar pushed a commit to dannyrfar/kedro-plugins that referenced this pull request Mar 21, 2023
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <vladimir_filimonov@mckinsey.com>
Signed-off-by: heber-urdaneta <heber_urdaneta@mckinsey.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants