-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowpark (Snowflake) dataset for kedro #104
Merged
Merged
Changes from 14 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
2a861cd
Add Snowpark datasets
Vladimir-Filimonov 4469b5f
Add snowpark tests
heber-urdaneta d21fce4
Update tests requirements and config
heber-urdaneta 05b5059
Snowpark dataset implementation
Vladimir-Filimonov ec6de72
Update snowpark class name and docs formatting
heber-urdaneta 3cc53d7
Adjustments for lint check
heber-urdaneta de89ed7
Merge branch 'kedro-org:main' into main
heber-urdaneta c4a1fd9
Change describe to remove dict call
heber-urdaneta 4168e64
Remove pylint too many args suppression
heber-urdaneta 5fcd1b9
Update SnowparkTableDataSet and env vars
heber-urdaneta 8f61ab7
Remove pd interactions and add docs
heber-urdaneta 5bc11e1
Fix docs example
heber-urdaneta e968eec
Merge pull request #2 from Vladimir-Filimonov/remove_pd
heber-urdaneta 386a1b9
Fix sp for other py versions
heber-urdaneta 4898c5e
Remove leftover TODO
heber-urdaneta ca7de93
Adjust documentation wording
heber-urdaneta 47d9afc
Add SnowparkTableDataSet to release notes
heber-urdaneta 1904104
Revert Add SnowparkTableDataSet to release notes
heber-urdaneta c16b778
Merge branch 'kedro-org:main' into update_branch
heber-urdaneta c399e7c
Correct RELEASE.md conflict
heber-urdaneta 2716be6
Merge pull request #3 from Vladimir-Filimonov/update_branch
heber-urdaneta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
"""Provides I/O modules for Snowflake.""" | ||
|
||
__all__ = ["SnowparkTableDataSet"] | ||
|
||
from contextlib import suppress | ||
|
||
with suppress(ImportError): | ||
from .snowpark_dataset import SnowparkTableDataSet |
233 changes: 233 additions & 0 deletions
233
kedro-datasets/kedro_datasets/snowflake/snowpark_dataset.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,233 @@ | ||||||||||||||||||
"""``AbstractDataSet`` implementation to access Snowflake using Snowpark dataframes | ||||||||||||||||||
""" | ||||||||||||||||||
import logging | ||||||||||||||||||
from copy import deepcopy | ||||||||||||||||||
from typing import Any, Dict | ||||||||||||||||||
|
||||||||||||||||||
import snowflake.snowpark as sp | ||||||||||||||||||
from kedro.io.core import AbstractDataSet, DataSetError | ||||||||||||||||||
|
||||||||||||||||||
logger = logging.getLogger(__name__) | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
class SnowparkTableDataSet(AbstractDataSet): | ||||||||||||||||||
"""``SnowparkTableDataSet`` loads and saves Snowpark dataframes. | ||||||||||||||||||
|
||||||||||||||||||
Example usage for the | ||||||||||||||||||
`YAML API <https://kedro.readthedocs.io/en/stable/data/\ | ||||||||||||||||||
data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_: | ||||||||||||||||||
|
||||||||||||||||||
.. code-block:: yaml | ||||||||||||||||||
weather: | ||||||||||||||||||
type: kedro_datasets.snowflake.SnowparkTableDataSet | ||||||||||||||||||
table_name: "weather_data" | ||||||||||||||||||
database: "meteorology" | ||||||||||||||||||
schema: "observations" | ||||||||||||||||||
credentials: db_credentials | ||||||||||||||||||
save_args: | ||||||||||||||||||
mode: overwrite | ||||||||||||||||||
column_order: name | ||||||||||||||||||
table_type: '' | ||||||||||||||||||
|
||||||||||||||||||
One can skip everything but "table_name" if database and | ||||||||||||||||||
schema provided via credentials. Therefore catalog entries can be shorter | ||||||||||||||||||
if ex. all used Snowflake tables live in same database/schema. | ||||||||||||||||||
Values in dataset definition take priority over ones defined in credentials | ||||||||||||||||||
|
||||||||||||||||||
Example: | ||||||||||||||||||
Credentials file provides all connection attributes, catalog entry | ||||||||||||||||||
"weather" reuse credentials parameters, "polygons" catalog entry reuse | ||||||||||||||||||
all credentials parameters except providing different schema name. | ||||||||||||||||||
Second example of credentials file uses externalbrowser authentication | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
|
||||||||||||||||||
catalog.yml | ||||||||||||||||||
|
||||||||||||||||||
.. code-block:: yaml | ||||||||||||||||||
weather: | ||||||||||||||||||
type: kedro_datasets.snowflake.SnowparkTableDataSet | ||||||||||||||||||
table_name: "weather_data" | ||||||||||||||||||
Vladimir-Filimonov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
database: "meteorology" | ||||||||||||||||||
schema: "observations" | ||||||||||||||||||
credentials: snowflake_client | ||||||||||||||||||
save_args: | ||||||||||||||||||
mode: overwrite | ||||||||||||||||||
column_order: name | ||||||||||||||||||
table_type: '' | ||||||||||||||||||
|
||||||||||||||||||
polygons: | ||||||||||||||||||
type: kedro_datasets.snowflake.SnowparkTableDataSet | ||||||||||||||||||
table_name: "geopolygons" | ||||||||||||||||||
credentials: snowflake_client | ||||||||||||||||||
schema: "geodata" | ||||||||||||||||||
|
||||||||||||||||||
credentials.yml | ||||||||||||||||||
|
||||||||||||||||||
.. code-block:: yaml | ||||||||||||||||||
snowflake_client: | ||||||||||||||||||
account: 'ab12345.eu-central-1' | ||||||||||||||||||
port: 443 | ||||||||||||||||||
warehouse: "datascience_wh" | ||||||||||||||||||
database: "detailed_data" | ||||||||||||||||||
schema: "observations" | ||||||||||||||||||
user: "service_account_abc" | ||||||||||||||||||
password: "supersecret" | ||||||||||||||||||
|
||||||||||||||||||
credentials.yml (with externalbrowser authenticator) | ||||||||||||||||||
|
||||||||||||||||||
.. code-block:: yaml | ||||||||||||||||||
snowflake_client: | ||||||||||||||||||
account: 'ab12345.eu-central-1' | ||||||||||||||||||
port: 443 | ||||||||||||||||||
warehouse: "datascience_wh" | ||||||||||||||||||
database: "detailed_data" | ||||||||||||||||||
schema: "observations" | ||||||||||||||||||
user: "john_doe@wdomain.com" | ||||||||||||||||||
authenticator: "externalbrowser" | ||||||||||||||||||
|
||||||||||||||||||
As of Jan-2023, the snowpark connector only works with Python 3.8 | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's worth to put this all the way at the top of the class doc string. I can imagine a lot of users would just skip reading the examples. |
||||||||||||||||||
""" | ||||||||||||||||||
|
||||||||||||||||||
# this dataset cannot be used with ``ParallelRunner``, | ||||||||||||||||||
# therefore it has the attribute ``_SINGLE_PROCESS = True`` | ||||||||||||||||||
# for parallelism within a pipeline please consider | ||||||||||||||||||
# ``ThreadRunner`` instead | ||||||||||||||||||
_SINGLE_PROCESS = True | ||||||||||||||||||
DEFAULT_LOAD_ARGS = {} # type: Dict[str, Any] | ||||||||||||||||||
DEFAULT_SAVE_ARGS = {} # type: Dict[str, Any] | ||||||||||||||||||
|
||||||||||||||||||
# TODO: Update docstring | ||||||||||||||||||
Vladimir-Filimonov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
def __init__( # pylint: disable=too-many-arguments | ||||||||||||||||||
self, | ||||||||||||||||||
table_name: str, | ||||||||||||||||||
schema: str = None, | ||||||||||||||||||
database: str = None, | ||||||||||||||||||
load_args: Dict[str, Any] = None, | ||||||||||||||||||
save_args: Dict[str, Any] = None, | ||||||||||||||||||
credentials: Dict[str, Any] = None, | ||||||||||||||||||
) -> None: | ||||||||||||||||||
"""Creates a new instance of ``SnowparkTableDataSet``. | ||||||||||||||||||
|
||||||||||||||||||
Args: | ||||||||||||||||||
table_name: The table name to load or save data to. | ||||||||||||||||||
schema: Name of the schema where ``table_name`` is. | ||||||||||||||||||
Optional as can be provided as part of ``credentials`` | ||||||||||||||||||
dictionary. Argument value takes priority over one provided | ||||||||||||||||||
in ``credentials`` if any. | ||||||||||||||||||
database: Name of the database where ``schema`` is. | ||||||||||||||||||
Optional as can be provided as part of ``credentials`` | ||||||||||||||||||
dictionary. Argument value takes priority over one provided | ||||||||||||||||||
in ``credentials`` if any. | ||||||||||||||||||
load_args: Currently not used | ||||||||||||||||||
save_args: Provided to underlying snowpark ``save_as_table`` | ||||||||||||||||||
To find all supported arguments, see here: | ||||||||||||||||||
https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.DataFrameWriter.saveAsTable.html | ||||||||||||||||||
credentials: A dictionary with a snowpark connection string. | ||||||||||||||||||
To find all supported arguments, see here: | ||||||||||||||||||
https://docs.snowflake.com/en/user-guide/python-connector-api.html#connect | ||||||||||||||||||
""" | ||||||||||||||||||
|
||||||||||||||||||
if not table_name: | ||||||||||||||||||
raise DataSetError("'table_name' argument cannot be empty.") | ||||||||||||||||||
|
||||||||||||||||||
if not credentials: | ||||||||||||||||||
raise DataSetError("'credentials' argument cannot be empty.") | ||||||||||||||||||
|
||||||||||||||||||
if not database: | ||||||||||||||||||
if not ("database" in credentials and credentials["database"]): | ||||||||||||||||||
raise DataSetError( | ||||||||||||||||||
"'database' must be provided by credentials or dataset." | ||||||||||||||||||
) | ||||||||||||||||||
database = credentials["database"] | ||||||||||||||||||
|
||||||||||||||||||
if not schema: | ||||||||||||||||||
if not ("schema" in credentials and credentials["schema"]): | ||||||||||||||||||
raise DataSetError( | ||||||||||||||||||
"'schema' must be provided by credentials or dataset." | ||||||||||||||||||
) | ||||||||||||||||||
schema = credentials["schema"] | ||||||||||||||||||
# Handle default load and save arguments | ||||||||||||||||||
self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS) | ||||||||||||||||||
if load_args is not None: | ||||||||||||||||||
self._load_args.update(load_args) | ||||||||||||||||||
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS) | ||||||||||||||||||
if save_args is not None: | ||||||||||||||||||
self._save_args.update(save_args) | ||||||||||||||||||
|
||||||||||||||||||
self._table_name = table_name | ||||||||||||||||||
self._database = database | ||||||||||||||||||
self._schema = schema | ||||||||||||||||||
|
||||||||||||||||||
connection_parameters = credentials | ||||||||||||||||||
connection_parameters.update( | ||||||||||||||||||
{"database": self._database, "schema": self._schema} | ||||||||||||||||||
) | ||||||||||||||||||
self._connection_parameters = connection_parameters | ||||||||||||||||||
self._session = self._get_session(self._connection_parameters) | ||||||||||||||||||
|
||||||||||||||||||
def _describe(self) -> Dict[str, Any]: | ||||||||||||||||||
return { | ||||||||||||||||||
"table_name": self._table_name, | ||||||||||||||||||
"database": self._database, | ||||||||||||||||||
"schema": self._schema, | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
@staticmethod | ||||||||||||||||||
def _get_session(connection_parameters) -> sp.Session: | ||||||||||||||||||
"""Given a connection string, create singleton connection | ||||||||||||||||||
to be used across all instances of `SnowparkTableDataSet` that | ||||||||||||||||||
need to connect to the same source. | ||||||||||||||||||
connection_parameters is a dictionary of any values | ||||||||||||||||||
supported by snowflake python connector: | ||||||||||||||||||
https://docs.snowflake.com/en/user-guide/python-connector-api.html#connect | ||||||||||||||||||
example: | ||||||||||||||||||
connection_parameters = { | ||||||||||||||||||
"account": "", | ||||||||||||||||||
"user": "", | ||||||||||||||||||
"password": "", (optional) | ||||||||||||||||||
"role": "", (optional) | ||||||||||||||||||
"warehouse": "", (optional) | ||||||||||||||||||
"database": "", (optional) | ||||||||||||||||||
"schema": "", (optional) | ||||||||||||||||||
"authenticator: "" (optional) | ||||||||||||||||||
} | ||||||||||||||||||
""" | ||||||||||||||||||
try: | ||||||||||||||||||
logger.debug("Trying to reuse active snowpark session...") | ||||||||||||||||||
session = sp.context.get_active_session() | ||||||||||||||||||
except sp.exceptions.SnowparkSessionException: | ||||||||||||||||||
logger.debug("No active snowpark session found. Creating") | ||||||||||||||||||
session = sp.Session.builder.configs(connection_parameters).create() | ||||||||||||||||||
return session | ||||||||||||||||||
|
||||||||||||||||||
def _load(self) -> sp.DataFrame: | ||||||||||||||||||
table_name = [ | ||||||||||||||||||
self._database, | ||||||||||||||||||
self._schema, | ||||||||||||||||||
self._table_name, | ||||||||||||||||||
] | ||||||||||||||||||
|
||||||||||||||||||
sp_df = self._session.table(".".join(table_name)) | ||||||||||||||||||
return sp_df | ||||||||||||||||||
|
||||||||||||||||||
def _save(self, data: sp.DataFrame) -> None: | ||||||||||||||||||
table_name = [ | ||||||||||||||||||
self._database, | ||||||||||||||||||
self._schema, | ||||||||||||||||||
self._table_name, | ||||||||||||||||||
] | ||||||||||||||||||
|
||||||||||||||||||
data.write.save_as_table(table_name, **self._save_args) | ||||||||||||||||||
|
||||||||||||||||||
def _exists(self) -> bool: | ||||||||||||||||||
session = self._session | ||||||||||||||||||
query = "SELECT COUNT(*) FROM {database}.INFORMATION_SCHEMA.TABLES \ | ||||||||||||||||||
Vladimir-Filimonov marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
WHERE TABLE_SCHEMA = '{schema}' \ | ||||||||||||||||||
AND TABLE_NAME = '{table_name}'" | ||||||||||||||||||
rows = session.sql( | ||||||||||||||||||
query.format( | ||||||||||||||||||
database=self._database, | ||||||||||||||||||
schema=self._schema, | ||||||||||||||||||
table_name=self._table_name, | ||||||||||||||||||
) | ||||||||||||||||||
).collect() | ||||||||||||||||||
return rows[0][0] == 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Snowpark connector testing | ||
|
||
Execution of automated tests for Snowpark connector requires real Snowflake instance access. Therefore tests located in this folder are **disabled** by default from pytest execution scope using [conftest.py](conftest.py). | ||
|
||
[Makefile](/Makefile) provides separate argument ``test-snowflake-only`` to run only tests related to Snowpark connector. To run tests one need to provide Snowflake connection parameters via environment variables: | ||
* SNOWSQL_ACCOUNT - Snowflake account name with region. Ex `ab12345.eu-central-2` | ||
* SNOWSQL_WAREHOUSE - Snowflake virtual warehouse to use | ||
* SNOWSQL_DATABASE - Database to use | ||
* SNOWSQL_SCHEMA - Schema to use when creating tables for tests | ||
* SNOWSQL_ROLE - Role to use for connection | ||
* SNOWSQL_USER - Username to use for connection | ||
* SNOWSQL_PWD - Plain password to use for connection | ||
|
||
All environment variables need to be provided for tests to run. | ||
|
||
Here is example shell command to run snowpark tests via make utility: | ||
```bash | ||
SF_ACCOUNT='ab12345.eu-central-2' SF_WAREHOUSE='DEV_WH' SF_DATABASE='DEV_DB' SF_ROLE='DEV_ROLE' SF_USER='DEV_USER' SF_SCHEMA='DATA' SF_PASSWORD='supersecret' make test-snowflake-only | ||
``` | ||
|
||
Currently running tests supports only simple username & password authentication and not SSO/MFA. | ||
|
||
As of Jan-2023, the snowpark connector only works with Python 3.8. | ||
|
||
## Snowflake permissions required | ||
Credentials provided via environment variables should have following permissions granted to run tests successfully: | ||
* Create tables in a given schema | ||
* Drop tables in a given schema | ||
* Insert rows into tables in a given schema | ||
* Query tables in a given schema | ||
* Query `INFORMATION_SCHEMA.TABLES` of respective database | ||
|
||
## Extending tests | ||
Contributors adding new tests should add `@pytest.mark.snowflake` decorator to each test. Exclusion of Snowpark-related pytests from overall execution scope in [conftest.py](conftest.py) works based on markers. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
""" | ||
We disable execution of tests that require real Snowflake instance | ||
to run by default. Providing -m snowflake option explicitly to | ||
pytest will make these and only these tests run | ||
""" | ||
import pytest | ||
|
||
|
||
def pytest_collection_modifyitems(config, items): | ||
markers_arg = config.getoption("-m") | ||
|
||
# Naive implementation to handle basic marker expressions | ||
# Will not work if someone will (ever) run pytest with complex marker | ||
# expressions like "-m spark and not (snowflake or pandas)" | ||
if ( | ||
"snowflake" in markers_arg.lower() | ||
and "not snowflake" not in markers_arg.lower() | ||
): | ||
return | ||
|
||
skip_snowflake = pytest.mark.skip(reason="need -m snowflake option to run") | ||
for item in items: | ||
if "snowflake" in item.keywords: | ||
item.add_marker(skip_snowflake) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.