Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake Data Connectors (SnowPark) #108

Closed
yetudada opened this issue Oct 18, 2022 · 13 comments
Closed

Snowflake Data Connectors (SnowPark) #108

yetudada opened this issue Oct 18, 2022 · 13 comments

Comments

@yetudada
Copy link
Contributor

Description

I think there's scope to create a series of data connectors that would allow Kedro users to connect to Snowflake in different ways. This usage pattern was identified in the kedro-org/kedro#1653 research, that sometimes our users want to leverage SQL-based workflows for their data engineering pipelines. These connectors essentially simplify use of Python when you need it for the data science part of your workflow.

While I have created this issue, I think it's important to document why we have seen users create Snowflake datasets and not leverage our pandas.SQLTableDataSet and pandas.SQLQueryDataSet to do the same, especially in the case of workflows based on Pandas.

Possible Implementation

This tasks proposes building out the following data connectors:

  • spark.SnowflakeTableDataSet - Load and saves data from/to a table as a Spark DataFrame.
  • spark.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame
  • pandas.SnowflakeTableDataSet - Load and saves data from/to a table as a Pandas DataFrame.
  • pandas.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame
@yetudada yetudada added the good first issue Good for newcomers label Oct 18, 2022
@datajoely
Copy link
Contributor

datajoely commented Oct 18, 2022

So I've been wanting to work on this for a while - in my opinion the right way to approach this is to use the new SnowPark approach which exposes an almost identical API to Spark DataFrames (context hook, data frame class).

https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html

@Vladimir-Filimonov
Copy link
Contributor

Implementing via Snowpark is tempting but I would be cautious using Snowpark Python for production workloads as it is still in preview

@Vladimir-Filimonov
Copy link
Contributor

Nevermind, snowpark python 1.0 just arrived: https://pypi.org/project/snowflake-snowpark-python/

@datajoely
Copy link
Contributor

Excellent news!

I'm a big proponent of going the Snowpark route - to do this right I think we need to do 3 things. In the following sequence as a separate PRs.

  1. We need to implement SnowParkDataSet, this should mostly be a copy paste of SparkDataSet. It will be a much simpler implementation as we don't need to worry about file paths, just table schemas and names.

  2. We need to introduce a Kedro starter that works the same way as the pyspark one we have today. This will give users a ready to go Snowflake example doing kedro --new starters snowflake. We would also need to add it to the default starter scope here.

  3. We would then need some lightweight docs explaining how to use it!

  4. (Extra credit) ensure transcoding to pandas works okay :)

@datajoely
Copy link
Contributor

I hope to eventually get to this, but any help would be greatly appreciated!

@datajoely datajoely changed the title Snowflake Data Connectors Snowflake Data Connectors (SnowPark) Nov 2, 2022
@marrrcin
Copy link

@datajoely We (GetInData) have recently researched the topic of Snowpark + Kedro extensively and we will be happy to take over this feature and help to implement it.
Let's discuss the details 🙂

@datajoely
Copy link
Contributor

@marrrcin that's awesome - I connected with @Vladimir-Filimonov and @heber-urdaneta last week who have started work on the prototype. Are you able to raise your PRs even if they're in draft and we can discuss next steps together?

@heber-urdaneta
Copy link

@datajoely sure, just created the draft PR (still in progress), we can discuss further

@datajoely
Copy link
Contributor

Awesome - I've done a mini review of it's in a really good place

@datajoely
Copy link
Contributor

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this:
https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

@Vladimir-Filimonov
Copy link
Contributor

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this: https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

We reached out to Snowflake folks asking if we can expect any mocks provided as part of Snowpark and making unit-tests easier, but it doesn't seem something like this can be expected near term. So I'm afraid for us to cover methods like _save with unit tests will have to use real Snowflake instance. Let me push a proposal of what tests can be done locally vs require real SF early next week

@heber-urdaneta
Copy link

@datajoely @marrrcin thanks for your comments on the PR!
@Vladimir-Filimonov took a shot at addressing the feedback, just created a new PR: kedro-org/kedro#2032

@merelcht merelcht transferred this issue from kedro-org/kedro Jan 26, 2023
@merelcht
Copy link
Member

A Snowpark dataset was added in #104 and released in kedro-datasets 1.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants