-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate transformation and materialization #4365
Comments
@franciscojavierarceo Hey, thanks for starting a discussion on this. Two initial questions from me so far:
I don't really get why you think this is the case. I guess they are sort of mixed in the streaming engine, is that what you're referring to? As far as batch materialization goes, materialization right now doesn't do any transforms whatsoever (it will after we introduce BatchFeatureViews) and they are imho properly decoupled.
Again, I don't understand why we're talking about write patterns here. The way I think about it transformations don't have anything to do with materialization directly. Different types of Feature Views specify at what stage in feast workflow transformations should be applied (and by what component):
Lastly, I don't understand how a generic |
Correct but if we follow the existing What I'm really after is a good representation of that use case; i.e., "Compute features on demand and write them for efficient retrieval". At the moment that can be done via a ODFV+FV (where you write to the FV using the output of the ODFV) but really this should just be a transformation before the FV (like what is done with the ODFV code) is written to the online store. |
Let's look at the docs: From the feature view page:
From the stream feature view page:
And the data source is described as:
Going back to my goal of finding a way to represent "Compute features on demand and write them for efficient retrieval", I don't think the existing conventions make this obvious. What do you think? Any thoughts on either a naming convention? Maybe Batch Feature View is fine and maybe we add a new FV construct to represent what we want like Persistent On Demand or something else? |
Because transformations are independent of materialization. For example, with a Spark offline store and MySQL online store you could use the same UDF to create your historical features and execute them in the use case I outlined. |
Sure, this probably comes down to naming. I'm not a big fan of |
What about |
Here are 17 suggestions from ChatGPT:
|
Here are the ones I like:
|
From my perspective, The It's probably more clear if we introduced the OnDemandFeatureView is special, I would think of it as "online" feature view, while stream/batch are offline feature view. The transformation of OnDemandFeatureView is pretty much up to the user's run environment. |
Agreed.
Agreed. I like the idea of creating a "Transformation Engine" construct, though I think for the
Currently, "On Demand" means "do this transformation at request time" and that language is fairly consistent for For a For this new use case, the language gets a little more opaque in my opinion.I suppose that's why I like |
If we think of it in terms of what happens at run time: flowchart LR
D[On Demand] --> | get_online_features | E(Transformation + Retrieval)
A["Batch (Transformation)"] --> | get_online_features | B(Pure Retrieval)
C["Stream (Transformation + Write)"] --> | get_online_features | B(Pure Retrieval)
F["On Demand with Writes (Transformation + Write)"] --> | get_online_features | B(Pure Retrieval)
And so I think the clarity is worthwhile and valuable. |
Personally not that into the "PrecomputedFeatureView", it's not a standard or widely used thing in industry... Also the Batch and Feature has nothing to do with the |
I think we have to steer the industry here as this use case is needed but not well understood. Other feature stores settle on "On-Demand" or "Streaming" see Feature Form as another example as well as Tecton and Databricks. The precomputed relies on a synchronous api call from the client to write to the feature store, so that's really the only difference, otherwise it would behave like a |
I just read through the notes--fascinating discussion here. I have some thoughts on transformations and UX. I like the idea of a transform method implementation on a FeatureView, that makes chaining (or more complex DAG operations) possible and more intuitive to code.
The transform function would output another FeatureView, and have the option of persisting the transformation to a store (or possibly locally in memory)(and possibly lazily). This would make it easier to support "feature pipelines" later. It would also be more intuitive to keep transformation functionality centered in the transformation method (rather than in other parts of the FeatureView class). (The idea of a FeatureView as a "noun", and transform as a "verb"). Last thought would be considering different types of transforms (Spark, Python func, Ray, Dask, etc.) and supported options (do we have several transform methods, or just one generic?). Curious on your thoughts. |
Cool idea!
This was my thinking with the decorator approach.
We would be and we would declare this in the decorator via the |
@dandawg you should probably check #4277 out as well. The syntax should probably follow odfv pattern instead, but sure, the idea is the same.
As Francisco pointed out we have a concept of |
I hear you, I think there are pros and cons to this approach. Since you implemented ibis already, I think we should share this with folks but if someone wants to contribute a mode in |
I love support for ibis, as long as we don't force users to have to learn ibis (I know, it's easy, but there are lots of reasons developers may want/need to use specific tools). Also, I absolutely love odfv functionality. I'm glad we have it, but the current pattern makes it hard to understand where/how/when transformations are happening. I'd love for it to be more explicitly ordered and composable. In this example (from the docs), the transform and feature itself are conceptually convolved, which is somewhat confusing.
Above, we have "transformed_conv_rate". To know that this is an odfv, we have to traverse the code. In most IDEs, if it was a python method, we could right click and view the definition and all references where the transform is used (but I can't if it's a string). I'm also not as confident about stack traces being able to reference the right line of code in the even of an error. I realize we may not want to address all of these issues here, but I wanted to comment on them for visibility. |
Sure, unless ibis itself becomes a lot more popular in the future, it's unlikely we can get away with just ibis. Another easy-to-maintain alternative is to have a generic sql mode with users being responsible that SQL queries provided match up with the desired dialect that the engine expects. Or we can have more engine-specific modes as well, of course.
I hate to use "it's a feature, not a bug" argument here, but there are good reasons for this separation (and API expecting strings as arguments) due to feast architecture. Even though feast has a "definitions as python code" approach, object definition and actual execution are completely decoupled. In other words, when an "administrator" runs a In a realistic production setting, those two environments may have nothing in common in fact, python functions won't even be directly accessible most of the times, instead they are serialized with |
+1 There are ways to treat the UDFs as actual code but that has important consequences to the feature server behavior. |
Correct. In my previous role, we actually did couple them and the big consequence was that new features required a full deployment, which was costly in time. |
#4376 will solve this. |
Is your feature request related to a problem? Please describe.
As briefly mentioned in #4277, our current structure of having feature view decorators with a naming convention that references the ingestion and transformation pattern is confusing.
Transformation and Materialization are two separate constructs that should be decoupled.
Feature Views are simply schema definitions that can be used online and offline and historically did not support transformation. We should change this.
As a concrete, simple example suppose a user had a Spark offline store and MySQL online store using the Python feature server.
Suppose further that the user of the Feast service had 3 sets of data that required 3 different write patterns:
write-to-online-store
endpoint).Cases (1) and (2) are asynchronous and have no guarantees about the consistency of the data when a client requests those features but (3), if explicitly chosen to be a synchronous write, would have much stronger guarantees about the consistency of the data.
If Feature Views allowed for Feature Transformations before writes, then the current view of Feature Views representing Batch Features alone breaks down. This poor clarity is rooted in the mixture of transformations and materializations. Transformations can happen as a part of a batch job, a streaming pipeline, or during an api call by different computation engines (Spark, a Flink Application, or a simple python microservice). Materialization can technically be done independently of the computation engine (e.g., the output of a spark job can be materialized to the online store using something else).
If we want to enable Feature Views to allow for transformations, it no longer only represents a batch feature view so adding a decorator (as proposed in #4277) to represent that would be confusing.
Describe the solution you'd like
We should update Feast to use a
transform
decorator and the write patterns should be more tightly coupled with the Feature View type. For example, Stream, On Demand, Batch, and regular Feature Views could all use the same transformation code but offer different guarantees about how the data will be written (Stream: asynchronously, On Demand: not at all, Batch: asynchronously, and Feature View: synchronously).Describe alternatives you've considered
N/a
Additional context
@tokoko @HaoXuAI @shuchu what do you think here? My thoughts here aren't perfectly fleshed out but it's something that I've been thinking about and trying to find the way to articulate it well.
The text was updated successfully, but these errors were encountered: