Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add delta format to FileSource, add support for it in ibis/duckdb #4123

Merged
merged 2 commits into from
Apr 20, 2024

Conversation

tokoko
Copy link
Collaborator

@tokoko tokoko commented Apr 20, 2024

What this PR does / why we need it:

This PR adds delta format to FileSource (only parquet was supported before), also implements logic for handling delta sources in ibis offline store. It also adds integration tests for duckdb offline store with delta sources, dukdb tests are effectively run twice, once for parquet and then for delta.

Signed-off-by: tokoko <togurg14@freeuni.edu.ge>
Signed-off-by: tokoko <togurg14@freeuni.edu.ge>
Copy link
Collaborator

@HaoXuAI HaoXuAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
I was wondering how we can use delta lake. One question, how is it possible to use the delta format with the spark offline store or spark materialization?

@tokoko
Copy link
Collaborator Author

tokoko commented Apr 20, 2024

SparkSource already supports it afaik, SparkSource can also be used to basically just describe the source as a table name without actually caring about the format (I think). while that is probably fine for spark offline store, one downside is that it's tied to just a single implementation.

What I'm hoping to achieve with FileSource is that we can have a single source type that can be read by multiple offline stores (dask, duckdb, spark and others) so you can define a set of sources in your feature repository that won't lock you to use a specific offline store only.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Apr 20, 2024

SparkSource already supports it afaik, SparkSource can also be used to basically just describe the source as a table name without actually caring about the format (I think). while that is probably fine for spark offline store, one downside is that it's tied to just a single implementation.

What I'm hoping to achieve with FileSource is that we can have a single source type that can be read by multiple offline stores (dask, duckdb, spark and others) so you can define a set of sources in your feature repository that won't lock you to use a specific offline store only.

Make sense. I guess for the pyspark, in order to use delta format, it needs to install a delta spark plugin?

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Apr 20, 2024

LGTM!
Would be great to update the document as well

@HaoXuAI HaoXuAI merged commit 2b6f1d0 into feast-dev:master Apr 20, 2024
21 checks passed
@tokoko tokoko deleted the delta-file-source branch April 20, 2024 17:07
@tokoko
Copy link
Collaborator Author

tokoko commented Apr 20, 2024

Make sense. I guess for the pyspark, in order to use delta format, it needs to install a delta spark plugin?

Yeah, delta data source needs to be configured beforehand in some way.

@ion-elgreco
Copy link

@tokoko nice PR!

I kind of had put this on a back burner : P and focused my energy more on delta-rs.

@tokoko
Copy link
Collaborator Author

tokoko commented Apr 23, 2024

thanks, this wouldn't be possible without delta-rs.

franciscojavierarceo pushed a commit that referenced this pull request May 24, 2024
# [0.38.0](v0.37.0...v0.38.0) (2024-05-24)

### Bug Fixes

* Add vector database doc ([#4165](#4165)) ([37f36b6](37f36b6))
* Change checkout action back to v3 from v5 which isn't released yet ([#4147](#4147)) ([9523fff](9523fff))
* Change numpy version <1.25 dependency to <2 in setup.py ([#4085](#4085)) ([2ba71ff](2ba71ff)), closes [#4084](#4084)
* Changed the code the way mysql container is initialized.  ([#4140](#4140)) ([8b5698f](8b5698f)), closes [#4126](#4126)
* Correct nightly install command, move all installs to uv ([#4164](#4164)) ([c86d594](c86d594))
* Default value is not set in Redis connection string using environment variable ([#4136](#4136)) ([95acfb4](95acfb4)), closes [#3669](#3669)
* Get container host addresses from testcontainers (java) ([#4125](#4125)) ([9184dde](9184dde))
* Get rid of empty string `name_alias` during feature view projection deserialization  ([#4116](#4116)) ([65056ce](65056ce))
* Helm chart `feast-feature-server`, improve Service template name ([#4161](#4161)) ([dedc164](dedc164))
* Improve the code related to on-demand-featureview. ([#4203](#4203)) ([d91d7e0](d91d7e0))
* Integration tests for async sdk method ([#4201](#4201)) ([08c44ae](08c44ae))
* Make sure schema is used when calling `get_table_query_string` method for Snowflake datasource ([#4131](#4131)) ([c1579c7](c1579c7))
* Make sure schema is used when generating `from_expression` for Snowflake ([#4177](#4177)) ([5051da7](5051da7))
* Pass native input values to `get_online_features` from feature server ([#4117](#4117)) ([60756cb](60756cb))
* Pass region to S3 client only if set (Java) ([#4151](#4151)) ([b8087f7](b8087f7))
* Pgvector patch ([#4108](#4108)) ([ad45bb4](ad45bb4))
* Update doc ([#4153](#4153)) ([e873636](e873636))
* Update master-only benchmark bucket name due to credential update ([#4183](#4183)) ([e88f1e3](e88f1e3))
* Updating the instructions for quickstart guide. ([#4120](#4120)) ([0c30e96](0c30e96))
* Upgrading the test container so that local tests works with updated d… ([#4155](#4155)) ([93ddb11](93ddb11))

### Features

* Add a Kubernetes Operator for the Feast Feature Server ([#4145](#4145)) ([4a696dc](4a696dc))
* Add delta format to `FileSource`, add support for it in ibis/duckdb ([#4123](#4123)) ([2b6f1d0](2b6f1d0))
* Add materialization support to ibis/duckdb ([#4173](#4173)) ([369ca98](369ca98))
* Add optional private key params to Snowflake config ([#4205](#4205)) ([20f5419](20f5419))
* Add s3 remote storage export for duckdb ([#4195](#4195)) ([6a04c48](6a04c48))
* Adding DatastoreOnlineStore 'database' argument. ([#4180](#4180)) ([e739745](e739745))
* Adding get_online_features_async to feature store sdk ([#4172](#4172)) ([311efc5](311efc5))
* Adding support for dictionary writes to online store  ([#4156](#4156)) ([abfac01](abfac01))
* Elasticsearch vector database ([#4188](#4188)) ([bf99640](bf99640))
* Enable other distance metrics for Vector DB and Update docs ([#4170](#4170)) ([ba9f4ef](ba9f4ef))
* Feast/IKV datetime edgecase errors ([#4211](#4211)) ([bdae562](bdae562))
* Feast/IKV documenation language changes ([#4149](#4149)) ([690a621](690a621))
* Feast/IKV online store contrib plugin integration ([#4068](#4068)) ([f2b4eb9](f2b4eb9))
* Feast/IKV online store documentation ([#4146](#4146)) ([73601e4](73601e4))
* Feast/IKV upgrade client version ([#4200](#4200)) ([0e42150](0e42150))
* Incorporate substrait ODFVs into ibis-based offline store queries ([#4102](#4102)) ([c3a102f](c3a102f))
* Isolate input-dependent calculations in `get_online_features` ([#4041](#4041)) ([2a6edea](2a6edea))
* Make arrow primary interchange for online ODFV execution ([#4143](#4143)) ([3fdb716](3fdb716))
* Move data source validation entrypoint to offline store ([#4197](#4197)) ([a17725d](a17725d))
* Upgrading python version to 3.11, adding support for 3.11 as well. ([#4159](#4159)) ([4b1634f](4b1634f)), closes [#4152](#4152) [#4114](#4114)

### Reverts

* Reverts "fix: Using version args to install the correct feast version" ([#4112](#4112)) ([b66baa4](b66baa4)), closes [#3953](#3953)
franciscojavierarceo pushed a commit that referenced this pull request May 27, 2024
# [0.38.0](v0.37.0...v0.38.0) (2024-05-24)

### Bug Fixes

* Add vector database doc ([#4165](#4165)) ([37f36b6](37f36b6))
* Change checkout action back to v3 from v5 which isn't released yet ([#4147](#4147)) ([9523fff](9523fff))
* Change numpy version <1.25 dependency to <2 in setup.py ([#4085](#4085)) ([2ba71ff](2ba71ff)), closes [#4084](#4084)
* Changed the code the way mysql container is initialized.  ([#4140](#4140)) ([8b5698f](8b5698f)), closes [#4126](#4126)
* Correct nightly install command, move all installs to uv ([#4164](#4164)) ([c86d594](c86d594))
* Default value is not set in Redis connection string using environment variable ([#4136](#4136)) ([95acfb4](95acfb4)), closes [#3669](#3669)
* Get container host addresses from testcontainers (java) ([#4125](#4125)) ([9184dde](9184dde))
* Get rid of empty string `name_alias` during feature view projection deserialization  ([#4116](#4116)) ([65056ce](65056ce))
* Helm chart `feast-feature-server`, improve Service template name ([#4161](#4161)) ([dedc164](dedc164))
* Improve the code related to on-demand-featureview. ([#4203](#4203)) ([d91d7e0](d91d7e0))
* Integration tests for async sdk method ([#4201](#4201)) ([08c44ae](08c44ae))
* Make sure schema is used when calling `get_table_query_string` method for Snowflake datasource ([#4131](#4131)) ([c1579c7](c1579c7))
* Make sure schema is used when generating `from_expression` for Snowflake ([#4177](#4177)) ([5051da7](5051da7))
* Pass native input values to `get_online_features` from feature server ([#4117](#4117)) ([60756cb](60756cb))
* Pass region to S3 client only if set (Java) ([#4151](#4151)) ([b8087f7](b8087f7))
* Pgvector patch ([#4108](#4108)) ([ad45bb4](ad45bb4))
* Update doc ([#4153](#4153)) ([e873636](e873636))
* Update master-only benchmark bucket name due to credential update ([#4183](#4183)) ([e88f1e3](e88f1e3))
* Updating the instructions for quickstart guide. ([#4120](#4120)) ([0c30e96](0c30e96))
* Upgrading the test container so that local tests works with updated d… ([#4155](#4155)) ([93ddb11](93ddb11))

### Features

* Add a Kubernetes Operator for the Feast Feature Server ([#4145](#4145)) ([4a696dc](4a696dc))
* Add delta format to `FileSource`, add support for it in ibis/duckdb ([#4123](#4123)) ([2b6f1d0](2b6f1d0))
* Add materialization support to ibis/duckdb ([#4173](#4173)) ([369ca98](369ca98))
* Add optional private key params to Snowflake config ([#4205](#4205)) ([20f5419](20f5419))
* Add s3 remote storage export for duckdb ([#4195](#4195)) ([6a04c48](6a04c48))
* Adding DatastoreOnlineStore 'database' argument. ([#4180](#4180)) ([e739745](e739745))
* Adding get_online_features_async to feature store sdk ([#4172](#4172)) ([311efc5](311efc5))
* Adding support for dictionary writes to online store  ([#4156](#4156)) ([abfac01](abfac01))
* Elasticsearch vector database ([#4188](#4188)) ([bf99640](bf99640))
* Enable other distance metrics for Vector DB and Update docs ([#4170](#4170)) ([ba9f4ef](ba9f4ef))
* Feast/IKV datetime edgecase errors ([#4211](#4211)) ([bdae562](bdae562))
* Feast/IKV documenation language changes ([#4149](#4149)) ([690a621](690a621))
* Feast/IKV online store contrib plugin integration ([#4068](#4068)) ([f2b4eb9](f2b4eb9))
* Feast/IKV online store documentation ([#4146](#4146)) ([73601e4](73601e4))
* Feast/IKV upgrade client version ([#4200](#4200)) ([0e42150](0e42150))
* Incorporate substrait ODFVs into ibis-based offline store queries ([#4102](#4102)) ([c3a102f](c3a102f))
* Isolate input-dependent calculations in `get_online_features` ([#4041](#4041)) ([2a6edea](2a6edea))
* Make arrow primary interchange for online ODFV execution ([#4143](#4143)) ([3fdb716](3fdb716))
* Move data source validation entrypoint to offline store ([#4197](#4197)) ([a17725d](a17725d))
* Upgrading python version to 3.11, adding support for 3.11 as well. ([#4159](#4159)) ([4b1634f](4b1634f)), closes [#4152](#4152) [#4114](#4114)

### Reverts

* Reverts "fix: Using version args to install the correct feast version" ([#4112](#4112)) ([b66baa4](b66baa4)), closes [#3953](#3953)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants