Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft PR for issue 1946 - snowflake integration #78

Closed
wants to merge 15 commits into from

Conversation

heber-urdaneta
Copy link

Description

Moving PR from main kedro repo to kedro-plugins

Development notes

Adding snowpark dataset and tests

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really coming together @heber-urdaneta really good work!

raise DataSetError("'database' argument cannot be empty.")

if not schema:
raise DataSetError("'schema' argument cannot be empty.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not a default schema? I'm all for explicit, but I think the underlying API will pick a default IIRC

Copy link

@sfc-gh-mgorkow sfc-gh-mgorkow Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. When you create a user, you can define the default namespace, but the default value is NULL:
https://docs.snowflake.com/en/sql-reference/sql/create-user.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sfc-gh-mgorkow then we'll be conservative here!


sp_df.write.mode(self._save_args["mode"]).save_as_table(
table_name,
column_order=self._save_args["column_order"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just do **self._save_args :)

@@ -34,7 +34,7 @@ Pillow~=9.0
plotly>=4.8.0, <6.0
pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2.
psutil==5.8.0
pyarrow>=1.0, <7.0
pyarrow>=1.0, <9.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need sign off from the wider team if it affects other pieces of the framework - hopefully not an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam @merelcht do you have a view here? What's the process for this sort of change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just remove the upper bound, I expect it mostly works with pandas, and in pandas they have an open pin pyarrow>6.0.

Of course, the assumption is it still needs to passes the tests.

See this previous PR, I think we keep a relative open bound, but pyarrow bump the major version almost for every release.
kedro-org/kedro#1057

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we should include 10.0.0 at least?

@@ -73,6 +73,9 @@ def _collect_requirements(requires):
"spark.SparkJDBCDataSet": [SPARK, HDFS, S3FS],
"spark.DeltaTableDataSet": [SPARK, HDFS, S3FS, "delta-spark~=1.0"],
}
snowpark_require = {
"snowflake.SnowParkDataSet": ["snowflake-snowpark-python~=1.0.0", "pyarrow>=8.0, <9.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it only work with 8.0? This is equivalent to pyarrow==8.0.0 since the next version after 8.0.0 is 9.0.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested snowpark with pyarrow 10.0.0 and it is incompatible, the upper bound is actually more strict: "please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 9.0.0?

You are right technically but this is the pyarrow release Screenshot_20221124-012010.png

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 9.0.0 doesn't work, so I think we can be more specific and adjust to pyarrow==8.0.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that's fine, thanks for checking that🙏. Do you have an idea why is it not working? I think it would be nice to leave a comment there if we know certain API is not compatible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! When creating the snowpark session it shows a warning if pyarrow version is different than 8.0.0 ("please install a version that adheres to: 'pyarrow<8.1.0,>=8.0.0'), and there are potential crashes when interacting with the dataframe API

heber-urdaneta and others added 4 commits November 28, 2022 22:03
reuse database and warehouse from credentials if not provided with dataset
Improved credentials handling
Add pyarrow dependency
@@ -34,7 +34,7 @@ Pillow~=9.0
plotly>=4.8.0, <6.0
pre-commit>=2.9.2, <3.0 # The hook `mypy` requires pre-commit version 2.9.2.
psutil==5.8.0
pyarrow>=1.0, <7.0
pyarrow==8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pyarrow==8.0
pyarrow~=8.0

>>> database: meteorology
>>> schema: observations
>>> credentials: db_credentials
>>> load_args (WIP):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> load_args (WIP):
>>> load_args (WIP):

You don't need in the example, but It would be good to get examples of doing this form of credentials, but also the externalbroser SSO approach :)

@Vladimir-Filimonov
Copy link
Contributor

Hey everyone. We're wrapping up SnowParkData connector and I want to get feedback from the community before finalising the PR.
We implemented connection to Snowflake via standard use of credentials of kedro. And connection to the Snowflake gets opened at the moment of catalog initialisation. Same pattern was used for pandas SQL Dataset.

But there are other ways of doing this, so below is thinking process and why we concluded that this pattern seems to be good fit.

Using hook like pyspark dataset?

We can open connection to Snowflake using after_context_created hook like pyspark starter does. But the thing is - hooks are fired AFTER catalog & context initialisation so this means connection will be opened later than in current approach (no benefit of starting it earlier) at the cost of complicating project configuration (user have to use starter; for users using ex. AWS + Snowflake credentials getting split between credentials file and some separate snowpark.yml if we follow analogy of pyspark dataset).

So hook seems to be giving no value to a snowpark connector (as opposed to a pyspark, where hook makes sense) at the cost of complicating project structure.

Pipeline uses Snowflake only at very late stages

If we imagine kedro pipeline that does some file data preprocessing (hours?) and saving data into Snowflake only at later stages it means connection to the Snowflake will be opened all that time (hours?) before actually used.

To address this issue we can initiate connection in a lazy fashion - load, save and exists methods will check if connection opened first and if not - very first use opens connection. This allows to avoid connection being opened w/o use BUT has downside that if connection credentials are wrong - we'll know about it only when pipeline will come to a stage of actually using Snowflake - which might be later in the pipeline and impact user experience.
Based on this thinking I consider this as suboptimal design.

Snowflake costs consideration

Just opening connection to a Snowflake (from snowpark) does not awake Snowflake virtual warehouse you provided in connection string. So user is not charged for connection being open and from this perspective there are no drawbacks of opening connection in advance of actually using Snowflake data.

Appreciate feedback on thinking above and if you think we should change implementation of opening connection for Snowpark. @datajoely @noklam @marrrcin

@heber-urdaneta
Copy link
Author

Multiple of the circle CI checks have failed and we have some questions, here's a summary:

1. lint and unit-tests 3.8: it seems there was a timeout (10m) when installing requirements (particularly sqlalchemy) -> I am not sure this is caused by our commits. Have you seen this before? Can timeout maybe be extended beyond 10m?
2. win-unit-test 3.8: fails when installing GDAL related library -> this also doesn't look like it's caused by our changes, have you seen this occur before?
3. tests done with python versions different than 3.8: for the time being, snowpark only works with Python 3.8 -> not sure how to proceed, could those tests be ignored under the acknowledgement that only version 3.8 should be used for snowpark?

@datajoely @noklam, let us know your thoughts, thanks!

@datajoely
Copy link
Contributor

Hi @Vladimir-Filimonov -

Thank you for the hard work, I've checked with Kedro's Tech Lead @idanov that we don't need a hook here, so you're proposal is perfect.

@heber-urdaneta can if you run the check locally using the makefile do they pass?

@datajoely
Copy link
Contributor

Regarding this part

To address this issue we can initiate connection in a lazy fashion - load, save and exists methods will check if connection opened first and if not - very first use opens connection. This allows to avoid connection being opened w/o use BUT has downside that if connection credentials are wrong - we'll know about it only when pipeline will come to a stage of actually using Snowflake - which might be later in the pipeline and impact user experience.
Based on this thinking I consider this as suboptimal design.

Is there a middle-ground of where we do an eager credentials check and a lazy data load/save? Perhaps this where a after_context_created hook would be useful?

@datajoely
Copy link
Contributor

are we far away from marking this as ready for review?

Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
Signed-off-by: heber-urdaneta <98349957+heber-urdaneta@users.noreply.github.com>
@@ -34,7 +34,7 @@ min-public-methods = 1
[tool.coverage.report]
fail_under = 100
show_missing = true
omit = ["tests/*", "kedro_datasets/holoviews/*"]
omit = ["tests/*", "kedro_datasets/holoviews/*", "kedro_datasets/snowflake/*"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exclude?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to exclude the execution of snowpark tests from the coverage report and avoid % coverage errors.

To execute snowpark tests we need a live connection to a snowflake account, we planned to keep it separated from the main tests and only execute when triggering tests with a snowflake marker (we added make test-snowflake-only). While the test execution is skipped if no snowflake marker is provided, it still affects the coverage report which expects 100% - which is why I added it to the omit from the report.

Let us know your thoughts and if there's any other way around, and we can discuss further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's looking good - makes sense :)

@Vladimir-Filimonov
Copy link
Contributor

Ready for review PR was opened here #104. Due to need to fix DCO issues we had to have a clean start and sign all commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants