Snowflake Data Connectors (SnowPark) #108

yetudada · 2022-10-18T08:01:58Z

Description

I think there's scope to create a series of data connectors that would allow Kedro users to connect to Snowflake in different ways. This usage pattern was identified in the kedro-org/kedro#1653 research, that sometimes our users want to leverage SQL-based workflows for their data engineering pipelines. These connectors essentially simplify use of Python when you need it for the data science part of your workflow.

While I have created this issue, I think it's important to document why we have seen users create Snowflake datasets and not leverage our pandas.SQLTableDataSet and pandas.SQLQueryDataSet to do the same, especially in the case of workflows based on Pandas.

Possible Implementation

This tasks proposes building out the following data connectors:

spark.SnowflakeTableDataSet - Load and saves data from/to a table as a Spark DataFrame.
spark.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame
pandas.SnowflakeTableDataSet - Load and saves data from/to a table as a Pandas DataFrame.
pandas.SnowflakeQueryDataSet - Executes a SQL query on save(), can either load the table as a Spark DataFrame

The text was updated successfully, but these errors were encountered:

datajoely · 2022-10-18T08:07:39Z

So I've been wanting to work on this for a while - in my opinion the right way to approach this is to use the new SnowPark approach which exposes an almost identical API to Spark DataFrames (context hook, data frame class).

https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html

Vladimir-Filimonov · 2022-10-20T07:35:56Z

Implementing via Snowpark is tempting but I would be cautious using Snowpark Python for production workloads as it is still in preview

Vladimir-Filimonov · 2022-11-02T10:50:41Z

Nevermind, snowpark python 1.0 just arrived: https://pypi.org/project/snowflake-snowpark-python/

datajoely · 2022-11-02T11:35:40Z

Excellent news!

I'm a big proponent of going the Snowpark route - to do this right I think we need to do 3 things. In the following sequence as a separate PRs.

We need to implement SnowParkDataSet, this should mostly be a copy paste of SparkDataSet. It will be a much simpler implementation as we don't need to worry about file paths, just table schemas and names.
We need to introduce a Kedro starter that works the same way as the pyspark one we have today. This will give users a ready to go Snowflake example doing kedro --new starters snowflake. We would also need to add it to the default starter scope here.
We would then need some lightweight docs explaining how to use it!
(Extra credit) ensure transcoding to pandas works okay :)

datajoely · 2022-11-02T11:35:58Z

I hope to eventually get to this, but any help would be greatly appreciated!

marrrcin · 2022-11-15T10:35:34Z

@datajoely We (GetInData) have recently researched the topic of Snowpark + Kedro extensively and we will be happy to take over this feature and help to implement it.
Let's discuss the details 🙂

datajoely · 2022-11-15T10:43:36Z

@marrrcin that's awesome - I connected with @Vladimir-Filimonov and @heber-urdaneta last week who have started work on the prototype. Are you able to raise your PRs even if they're in draft and we can discuss next steps together?

heber-urdaneta · 2022-11-15T18:58:58Z

@datajoely sure, just created the draft PR (still in progress), we can discuss further

datajoely · 2022-11-16T12:02:27Z

Awesome - I've done a mini review of it's in a really good place

datajoely · 2022-11-16T12:05:31Z

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this:
https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

Vladimir-Filimonov · 2022-11-16T15:22:25Z

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this: https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

We reached out to Snowflake folks asking if we can expect any mocks provided as part of Snowpark and making unit-tests easier, but it doesn't seem something like this can be expected near term. So I'm afraid for us to cover methods like _save with unit tests will have to use real Snowflake instance. Let me push a proposal of what tests can be done locally vs require real SF early next week

heber-urdaneta · 2022-11-16T23:28:10Z

@datajoely @marrrcin thanks for your comments on the PR!
@Vladimir-Filimonov took a shot at addressing the feedback, just created a new PR: kedro-org/kedro#2032

merelcht · 2023-03-21T11:48:51Z

A Snowpark dataset was added in #104 and released in kedro-datasets 1.1.0.

yetudada added the good first issue Good for newcomers label Oct 18, 2022

noklam added the Stage: Technical Design 🎨 label Oct 31, 2022

datajoely changed the title ~~Snowflake Data Connectors~~ Snowflake Data Connectors (SnowPark) Nov 2, 2022

heber-urdaneta mentioned this issue Nov 15, 2022

Draft PR for issue 1946 - snowflake integration kedro-org/kedro#2029

Closed

5 tasks

merelcht transferred this issue from kedro-org/kedro Jan 26, 2023

merelcht closed this as completed Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowflake Data Connectors (SnowPark) #108

Snowflake Data Connectors (SnowPark) #108

yetudada commented Oct 18, 2022

datajoely commented Oct 18, 2022 •

edited

Loading

Vladimir-Filimonov commented Oct 20, 2022

Vladimir-Filimonov commented Nov 2, 2022

datajoely commented Nov 2, 2022

datajoely commented Nov 2, 2022

marrrcin commented Nov 15, 2022

datajoely commented Nov 15, 2022

heber-urdaneta commented Nov 15, 2022

datajoely commented Nov 16, 2022

datajoely commented Nov 16, 2022

Vladimir-Filimonov commented Nov 16, 2022

heber-urdaneta commented Nov 16, 2022

merelcht commented Mar 21, 2023

Snowflake Data Connectors (SnowPark) #108

Snowflake Data Connectors (SnowPark) #108

Comments

yetudada commented Oct 18, 2022

Description

Possible Implementation

datajoely commented Oct 18, 2022 • edited Loading

Vladimir-Filimonov commented Oct 20, 2022

Vladimir-Filimonov commented Nov 2, 2022

datajoely commented Nov 2, 2022

datajoely commented Nov 2, 2022

marrrcin commented Nov 15, 2022

datajoely commented Nov 15, 2022

heber-urdaneta commented Nov 15, 2022

datajoely commented Nov 16, 2022

datajoely commented Nov 16, 2022

Vladimir-Filimonov commented Nov 16, 2022

heber-urdaneta commented Nov 16, 2022

merelcht commented Mar 21, 2023

datajoely commented Oct 18, 2022 •

edited

Loading