-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for dynamic SQL filtering of datasets through lazy loading #2374
Comments
@inigohidalgo thank you for putting a lot of thought into this - I spent last night reading about Ibis which I think may be the modern and correct way to approach the problem space you're trying to work with. I'm still not sure how to make the syntax work neatly in Kedro world, but I'm hoping it can work like this polars dataset we nearly have merged. |
@inigohidalgo while this doesn't solve for your more detailed use case, have you looked at using globals? This is how I currently handle the use case of updating my query for specific dates, and I believe they can be passed at runtime. |
Thanks for the heads up about Ibis @datajoely, it does seem to be the direction we could go in, though it seems to go a step further than what we'd need (full computational backend vs just simple filtering, joining etc). I would note I haven't been able to connect to our Azure SQL databases using our usual connection method (pyodbc), so I'm not sure how mature it is, in case you're considering putting development towards it. I might open an issue in the project regarding this if it seems promising: basically at the moment they only provide the pymssql driver for connecting to mssql DBs, but it doesn't seem to be compatible with our authentication method (Azure AD). Hey @bpmeek yeah that's the "compromise" solution we'd been using until now, basically just passing it as a general parameter. But it isn't really flexible enough for what we're trying to do a lot of the time. For the moment I haven't made progress on this since I've been sidetracked on other issues, but it is still very much an important question for my team and me, and I am still considering pursuing my original idea since it seems simple enough to implement through SQLAlchemy which is much more mature and developed. |
Hey all, I'm the lead developer of ibis! We're happy to support alternative methods of connecting to mssql-oriented DBs. We haven't added that capability yet because no one has asked for it! If that'll tip the scales in favor of ibis then I think we'd be happy to work on adding ODBC support to the MSSQL backend. |
Thanks for chiming in @cpcloud! 🙌🏽 We haven't started exploring Ibis for real yet, but if the only limitation is mssql DBs, maybe we could build a proof of concept with PostgreSQL or something else. We'll surely reach out soon if we make progress on this front. |
Went ahead and built a small poc using duckdb (link). I really like Ibis' API for achieving what I want, much simpler than SQLAlchemy. I will keep testing it, but it definitely seems like it could be the way forward for us. I will open a separate issue in the appropriate repo when I investigate further, but: the Full traceback:
(Note: I tested it with both duckdb and sqlite as a backend and ran into the same issue) I haven't been able to dig too deep into what is causing it, but I suspect it could be related to how kedro passes datasets from node to node (depending on the MemoryDataset copy_mode arg, |
Hi @inigohidalgo, sorry it took some time to get back, but thanks a lot for making this POC! This week I spoke to @datajoely and @marlenezw about Ibis & Kedro and I'm thrilled to see this is moving forward. Two comments:
Ideally there should be a repo that only contains the |
Yeah that solves it, but is kind of cumbersome to have to specify copy_mode for each Ibis memorydataset I'd be passing through, as at the moment I am not specifying any memorydatasets at all in my catalog (note my comments in the kedro slack). I suggested more flexibility in the I like the idea of sharing the IbisDataSet as a standalone package, and I appreciate the comment regarding Do you think the IbisDataSet as-is would be in a state to publish? With the required tests and documentation obviously, though all the necessary methods are implemented already, though potentially in a barebones manner. |
Release early, release often 🚀 |
I extracted the IbisDataSet and filled out the missing functionality here: https://github.com/inigohidalgo/kedro-ibis-dataset I also updated the PoC to use the external dataset package instead of the bundled extra: https://github.com/inigohidalgo/ibis-kedro-poc I'd appreciate feedback on the implementation :) |
This is super cool @inigohidalgo - until now SQL has always been a 2nd class citizen in Kedro and this is a huge step forward in terms of usability and robustness. Longer term it would be cool to think about how to enable more comprehensive ibis support in Kedro. A first step may be to bring this into the I'd love to provide blanket support for all ibis datasets, the problem I'm not sure how to solve is how to manage optional dependencies twice (e.g. |
AFAIK @astrojuanlu is the python packaging 👑 so he might have more of an insight into what is possible using the new PEP621 metadata, but bringing it into kedro-datasets would probably complicate that a bit more than keeping it as a separate package (or subpackage within the kedro-extras repo), which would allow for specifying different dependency groups for kedro-ibis-dataset aligned with those Ibis defines As far as project maintenance goes, a GH action which parses Ibis' dependency groups and reproduces those into kedro-ibis-dataset groups shouldn't be too complicated to implement, though it might be kind of brittle if Ibis decides to move onto a package manager different from poetry. Though I do know the tool I'm using (PDM) has a workflow to import poetry dependencies into PEP621-compatible deps. Sidenote: is there a better place we could be having this discussion? I guess the package would solve this issue, but maybe visibility is limited buried inside an issue. |
Yeah I don't think there's a way to have two levels of optional dependencies. We either include those as parts of the optional dependencies groups we already have: # kedro-datasets/setup.py
extras_require["spark.SparkDataSet"] = [SPARK, HDFS, S3FS, "ibis[pyspark]"] or we provide another group: extras_require["spark.SparkDataSet"] = [SPARK, HDFS, S3FS]
extras_require["spark.SparkDataSet_lazy"] = [SPARK, HDFS, S3FS, "ibis[pyspark]"]
We don't have a well-defined process on how a dataset should "graduate" to be part of kedro-datasets or whether we should control the maintenance burden in any way, so I'm improvising here. Since everybody can now do
We can connect on |
I think the more solid path forward would be to say "all things SQL in Kedro go through Ibis" and make it a "first level" optional dependency. No need to add more levels of indirection I'd say. To confidently commit to that plan, I propose we put this initial dataset in the hands of users and play around with it a bit more. |
Dropping this here for future reference https://delta.io/blog/2023-06-15-using-ibis-pyspark-delta-lake/ |
we did recently add let us know if prioritizing that would be helpful! |
That’s an amazing addition for interoperability. I don’t use spark so I can’t comment on that, but the ability to read/write to delta without need of a cluster running is very valuable.
|
That's super cool @lostmygithubaccount |
Now that I think the next steps are using the package, finding bugs and corner cases, and promote it a bit more heavily. Maybe present a conference talk in 2024 :) Other than that, any other loose ends? Or can we go ahead and close this issue? |
🎉 https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis 🎉 Source code: https://github.com/deepyaman/jaffle-shop So, technically this is possible. And now the question is what do we do about it:
A third option is not doing anything, but I'd love to take one of the two paths. We've discussed sometimes that, since Ibis has lots of different backends, managing that from within |
So I have 3 cascading questions:
Longer term I'd love to decrease the number of 1st party datasets we maintain on |
From my experience, at the moment ibis comes with quite a lot of dependencies even by default, so adding it to kedro-datasets imo would be difficult to manage. I think there might be something that can be done on the ibis side to lighten that a bit (@cpcloud), but I wouldn't really see the current dependencies as something desirable to be installed by default. Keeping it as a separate package and managing that complexity on a more 1:1 basis between the IbisDataset package and Ibis makes more sense to me. |
Agree with @inigohidalgo. Installing the union of all supported backends' dependencies by default would place quite a burden on end users. We're in the middle of removing |
We have to deal with roughly the same issue on our side, since most end users don't need 20 backends 😂 Could kedro-datasets take a similar approach to Ibis and use setuptools extras to allow users to install it like |
Yeah they already do that AFAIK, but here if it were included within kedro-datasets it would have to be
That would be easier to manage if you just pip installed kedro-ibis which directly had the different dependency groups which would align with ibis'. |
Ah, right. Thanks for the link, I didn't realize we're dealing with two levels of optionality here. Hm, I'm not entirely sure what the best option is here. Perhaps including it as part of the existing extras might be okay? |
I'd be in favour of having a
Essentially this 💯
In progress: kedro-org/kedro-plugins#535 |
This is the key design question I'd say. |
I will defer to @deepyaman for his opinion on different backends, as I have only really worked with mssql and polars backends, but I'm not sure what the functional difference would be between the SQL and Dataframe datasets as the API should (here is where I will defer to Deepyaman and Phillip) be the same. Instantiating the connection will have its own parameters, but I know the team tries to keep ibis.connect as a common entrypoint for all backends. The main complexity is ibis' different optional dependencies, and that would probably be the main maintenance burden within this hypothetical IbisDataset project, though it could probably be automated relatively easily, as we would only need to keep track of ibis' dependency groups and make the same groups available within IbisDataset just pointing to the correct |
Thanks @inigohidalgo. It's largely correct that there are no differences in backend output for a given API assuming that that both backends implement the operation. There's are two major places where there is a difference in output when using The first is when dealing with The second is NULL output. This one is kind of a mess right now because Pandas has 3 options for representing NULL and we haven't been very rigorous in Ibis about enforcing a single one in We're going to address both of these soon, likely in the next quarter. |
To solve the 2-level optional dependency conundrum, @deepyaman suggested
so we can keep the ibis dataset inside kedro-datasets (helping discoverability). |
Turns out @inigohidalgo actually suggested the same thing 5 days ago. 🤣 #2374 (comment) |
Hah, right 😄 credit where it's due @inigohidalgo !! |
@deepyaman and I are in sync 🤣 This implementation makes sense to me as long as the kedro team is happy to take on that complexity of keeping track of Ibis' published dependency groups, as we would have to maintain 1:1 parity of kedro-datasets[ibis-{{backend}}] for every backend which Ibis publishes |
I might be wrong, but in principle shouldn't it be [project.optional-dependencies]
...
ibis-pandas = ["ibis[pandas]"] ? |
if we do this in a separate plug-in I wonder if we could automate the 1:1 parity with some sort of tooling which raises PRs in a dependabot like fashion |
The only exception is for Flink right now, where you have to include |
I was hoping to avoid this, because you can programmatically access the list of Ibis backends. As such, I'd written https://github.com/kedro-org/kedro-plugins/blob/09493963cf854cc852ef2377023e2aae13504f71/kedro-datasets/setup.py#L19-L27. However, with the move to a static In the absence of an automated solution, I'm willing to maintain parity as Ibis publishes additional backends for now, as this is fairly infrequent. Furthermore, the extras are a convenience function, and will not block somebody from using a new Ibis backend with Kedro. |
If we were to have a seperate |
I'm okay with closing this as the usecase is wonderfully solved by ibis now. |
Description
We are pulling from some upstream tables that have a long history, and when running our feature engineering, training etc pipelines we only want to grab a certain window of data (last year for example). This isn't a static query and can change with each successive run of the pipeline.
Currently, the SQL Datasets only load either a static query or an entire table. I would like to be able to dynamically filter these tables based on certain conditions.
Context
It is important because currently if we are consuming upstream tables with 6 years of history but only care about the last year, it takes way too long for me to then immediately discard >80% of the ingested data.
Possible Implementation
An idea I am considering, thanks to @datajoely's feedback, is to create a new dataset type which when loaded returns an object which when called will actually load the data, working the same way as PartitionedDataSet. This object can be manipulated within a node to add filters to it before finally calling
pd.read_sql
or the spark equivalent. This is a possible implementation which is working for me. I haven't written up a true Kedro dataset as I saw a lot of extra code which would not actually help illustrate the functionality. The first block of code is the dataset definition, and the second block simulates how it would run in a kedro workflow (without actually implementing any nodes or pipelines, just for illustration).lazy_load_sql_query_dataset.py
test_dataset.py
Possible Alternatives
Messing with the context and other ways of modifying kedro behavior through templatedconfigloader and hooks, but it's just hacky and not scalable.
The text was updated successfully, but these errors were encountered: