-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowflake Data Connectors (SnowPark) #108
Comments
So I've been wanting to work on this for a while - in my opinion the right way to approach this is to use the new SnowPark approach which exposes an almost identical API to Spark DataFrames (context hook, data frame class). https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html |
Implementing via Snowpark is tempting but I would be cautious using Snowpark Python for production workloads as it is still in preview |
Nevermind, snowpark python 1.0 just arrived: https://pypi.org/project/snowflake-snowpark-python/ |
Excellent news! I'm a big proponent of going the Snowpark route - to do this right I think we need to do 3 things. In the following sequence as a separate PRs.
|
I hope to eventually get to this, but any help would be greatly appreciated! |
@datajoely We (GetInData) have recently researched the topic of Snowpark + Kedro extensively and we will be happy to take over this feature and help to implement it. |
@marrrcin that's awesome - I connected with @Vladimir-Filimonov and @heber-urdaneta last week who have started work on the prototype. Are you able to raise your PRs even if they're in draft and we can discuss next steps together? |
@datajoely sure, just created the draft PR (still in progress), we can discuss further |
Awesome - I've done a mini review of it's in a really good place |
I'm still not sure how to write integration tests for this without an actual snowflake instance spun up. Looking at some examples in the Snowflake Labs repo perhaps we can do it like this: |
We reached out to Snowflake folks asking if we can expect any mocks provided as part of Snowpark and making unit-tests easier, but it doesn't seem something like this can be expected near term. So I'm afraid for us to cover methods like |
@datajoely @marrrcin thanks for your comments on the PR! |
A Snowpark dataset was added in #104 and released in |
Description
I think there's scope to create a series of data connectors that would allow Kedro users to connect to Snowflake in different ways. This usage pattern was identified in the kedro-org/kedro#1653 research, that sometimes our users want to leverage SQL-based workflows for their data engineering pipelines. These connectors essentially simplify use of Python when you need it for the data science part of your workflow.
While I have created this issue, I think it's important to document why we have seen users create Snowflake datasets and not leverage our
pandas.SQLTableDataSet
andpandas.SQLQueryDataSet
to do the same, especially in the case of workflows based on Pandas.Possible Implementation
This tasks proposes building out the following data connectors:
spark.SnowflakeTableDataSet
- Load and saves data from/to a table as a Spark DataFrame.spark.SnowflakeQueryDataSet
- Executes a SQL query on save(), can either load the table as a Spark DataFramepandas.SnowflakeTableDataSet
- Load and saves data from/to a table as a Pandas DataFrame.pandas.SnowflakeQueryDataSet
- Executes a SQL query on save(), can either load the table as a Spark DataFrameThe text was updated successfully, but these errors were encountered: