Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for inserting data in ibis.TableDataset #834

Open
vishu1994 opened this issue Sep 14, 2024 · 1 comment
Open

Support for inserting data in ibis.TableDataset #834

vishu1994 opened this issue Sep 14, 2024 · 1 comment
Assignees

Comments

@vishu1994
Copy link

Description

In ETL pipelines, loading transformed data into various data warehouses is a critical requirement. Currently, the ibis.TableDataset connector in Kedro does not support data insertion into Ibis backends.

Context

Why is this change important to me?

We are developing ETL pipelines in our organization, and inserting records into data warehouses is an essential requirement. At present, without support for data insertion, we must bypass the Kedro DataCatalog and rely on external ORM tools to handle native data storage operations, such as SQLAlchemy , dataset etc .

How would I use it?

Supporting data insertion in ibis.TableDataset would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.

How can it benefit other users?

By enabling this feature, users could avoid writing custom loading logic, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.

@deepyaman
Copy link
Member

Sounds good! I'm going to assign you, since you've expressed interest in contributing to Kedro, and I think this is a great starting point. Happy to help provide guidance (and I think anybody on the Kedro team can also help answer questions, as this should be fairly standard to add).

ibis.TableDataset currently works by calling create_table or create_view here: https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/ibis/table_dataset.py#L181

You will need to figure out an ergonomic way to specify that it's going to be an "insert" operation. One possible way is to define a mode argument, similar to https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html#pyspark.sql.DataFrameWriter.mode or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html mode. I feel like this would be pretty familiar to Kedro users, but I also haven't given much thought to alternatives so far. :)

Please feel free to further discuss how you want to implement it here, or raise a PR with an initial stab that we can discuss—whatever works best for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants