Draft PR for issue 1946 - snowflake integration #2029

heber-urdaneta · 2022-11-15T18:52:36Z

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

Issue 1946, kedro-org/kedro-plugins#108

Development notes

Created snowflake_dataset.py. Save and load has been tested with dummy datasets on sandbox snowflake env
Pending to integrate snowflake session creation with snowflake kedro starter
Pending to create tests

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

datajoely

Really really nice work @heber-urdaneta 💪

datajoely · 2022-11-16T11:24:57Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+    # in case module import error does not match our expected pattern
+    # we have no recommendation
+    if not res:
+        return None


Do we want to raise an error here?

datajoely · 2022-11-16T11:25:46Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+        if not credentials:
+            raise DataSetError("Please configure expected credentials")
+
+        # print(self._load_args)


Suggested change

# print(self._load_args)

datajoely · 2022-11-16T11:26:47Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+
+    @classmethod
+    def _get_session(cls, credentials: dict) -> None:
+        """Given a connection string, create singleton connection


We did something similar in pandas.SQL*DataSet - is this the same pattern?

Yes, that's the same pattern as the SQLDataSet, this would change when implementing a session hook, similar to the SparkDataSet

datajoely · 2022-11-16T11:27:33Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+
+    def _load(self) -> pd.DataFrame:
+        sp_df = self._session.table(self._load_args["table_name"])
+        return sp_df.to_pandas()


I don't think we want to return as a Pandas, I would return a SnowPark DataFrame and let the user do the pandas casting themselves.

➕ to that, Snowpark DataFrames are lazy, so the user node could potentially leverage that.

datajoely · 2022-11-16T11:28:45Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+        sp_df = self._session.table(self._load_args["table_name"])
+        return sp_df.to_pandas()
+
+    def _save(self, data: pd.DataFrame) -> None:


Suggested change

def _save(self, data: pd.DataFrame) -> None:

def _save(self, data: [pd.DataFrame, snowpark.DataFrame]) -> None:

I'm not actually sure on the type signature, but the push here is to accept either option and handle gracefully.

Union should be stated either by Union[pd.DataFrame, snowpark.DataFrame] or pd.DataFrame | snowpark.DataFrame.

datajoely · 2022-11-16T11:29:06Z

test_requirements.txt

@@ -37,7 +37,7 @@ Pillow~=9.0
 plotly>=4.8.0, <6.0
 pre-commit>=2.9.2, <3.0  # The hook `mypy` requires pre-commit version 2.9.2.
 psutil==5.8.0
-pyarrow>=1.0, <7.0
+pyarrow>=1.0, <9.0


Hopefully this doesn't break any other bits of Kedro!

Hopefully not! But found this dependency update required for the snowpark df.to_pandas() method to work.
Maybe it won't be required if _load() returns a snowpark df - but still could affect if it gets converted to pandas on a pipeline

datajoely · 2022-11-16T11:29:54Z

test_requirements.txt

@@ -50,6 +50,7 @@ requests-mock~=1.6
 requests~=2.20
 s3fs>=0.3.0, <0.5  # Needs to be at least 0.3.0 to make use of `cachable` attribute on S3FileSystem.
 SQLAlchemy~=1.2
+snowflake-snowpark-python~=0.12.0


1.0.0 came out 1st November

marrrcin · 2022-11-16T12:03:32Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+
+    def _load(self) -> pd.DataFrame:
+        sp_df = self._session.table(self._load_args["table_name"])
+        return sp_df.to_pandas()


➕ to that, Snowpark DataFrames are lazy, so the user node could potentially leverage that.

marrrcin · 2022-11-16T12:04:53Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+        sp_df = self._session.table(self._load_args["table_name"])
+        return sp_df.to_pandas()
+
+    def _save(self, data: pd.DataFrame) -> None:


Union should be stated either by Union[pd.DataFrame, snowpark.DataFrame] or pd.DataFrame | snowpark.DataFrame.

marrrcin · 2022-11-16T12:06:17Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+        >>>     "role": "",
+        >>>     "warehouse": "",
+        >>>     "database": "",
+        >>>     "schema": ""


Why are database, schema, warehouse part of credentials?

marrrcin · 2022-11-16T12:08:36Z

kedro/extras/datasets/snowflake/snowflake_dataset.py

+        ]
+        sp_df.write.mode(self._save_args["mode"]).save_as_table(
+            table_name,
+            column_order=self._save_args["column_order"],


Couldn't this be inferred from the passed dataframe and only fallback to save_args if they are set?

Updated snowpark test prerequisites Draft implementation of ShowParkDataSet class

datajoely · 2022-11-17T10:14:39Z

I'm confused between this and #2032 what's the difference?

Bumped python version to 3.8 as required by snowpark

heber-urdaneta · 2022-11-17T19:18:31Z

We can ignore #2032, the latest commit on this PR has now the snowpark_dataset.py, addressing previous observations from snowflake_dataset.py

heber-urdaneta · 2022-11-22T23:10:20Z

Development of snowpark integration moved to this PR: kedro-org/kedro-plugins#78

heber-urdaneta added 7 commits October 31, 2022 17:20

Add SnowflakeTableDataSet

0a4c79e

Add snowflake-snowpark-python

a657d3a

Delete snowflake_dataset.py on pandas folder

95e272b

Update __init__.py

8643609

Create __init__.py

d4976f7

Create snowflake_dataset.py

7ccef62

Updated pyarrow dependency

6d22856

datajoely requested changes Nov 16, 2022

View reviewed changes

marrrcin reviewed Nov 16, 2022

View reviewed changes

Bumped python version to 3.8 as required by snowpark

da3d463

Updated snowpark test prerequisites Draft implementation of ShowParkDataSet class

Merge pull request #1 from heber-urdaneta/SnowParkDataSet

56d37af

Bumped python version to 3.8 as required by snowpark

heber-urdaneta closed this Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft PR for issue 1946 - snowflake integration #2029

Draft PR for issue 1946 - snowflake integration #2029

heber-urdaneta commented Nov 15, 2022 •

edited by gitpod-io bot

Loading

datajoely left a comment

datajoely Nov 16, 2022

datajoely Nov 16, 2022

datajoely Nov 16, 2022

heber-urdaneta Nov 16, 2022

datajoely Nov 16, 2022

marrrcin Nov 16, 2022

datajoely Nov 16, 2022

marrrcin Nov 16, 2022

datajoely Nov 16, 2022

heber-urdaneta Nov 16, 2022

datajoely Nov 16, 2022

marrrcin Nov 16, 2022

marrrcin Nov 16, 2022

marrrcin Nov 16, 2022

marrrcin Nov 16, 2022

datajoely commented Nov 17, 2022

heber-urdaneta commented Nov 17, 2022

heber-urdaneta commented Nov 22, 2022

	def _save(self, data: pd.DataFrame) -> None:
	def _save(self, data: [pd.DataFrame, snowpark.DataFrame]) -> None:

Draft PR for issue 1946 - snowflake integration #2029

Draft PR for issue 1946 - snowflake integration #2029

Conversation

heber-urdaneta commented Nov 15, 2022 • edited by gitpod-io bot Loading

Description

Development notes

Checklist

datajoely left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajoely commented Nov 17, 2022

heber-urdaneta commented Nov 17, 2022

heber-urdaneta commented Nov 22, 2022

heber-urdaneta commented Nov 15, 2022 •

edited by gitpod-io bot

Loading