-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feast API: Feature references, concept hierarchy, and data model #479
Comments
7 Proposed changesThis section will contain proposed changes to Feast. These changes can serve as "straw men" to further the discussion. 7.1 Remove versions and migrate user data on changesThis change attempts to address problem (6.1 Feature set versions are unnecessary) The removal of feature set version has been proposed and discussed in #386. After the introduction of mutable feature sets in Feast 0.6, there will no longer be a value to keeping versions. Users will be able to make changes to their existing feature sets and reuse the feature set name. This would be a major quality of life upgrade for our users. 7.2 Projects as a property of feature setsThis change attempts to address problems (6.2 Projects could be unnecessary at the top of the concept hierarchy, 6.3 Projects are a cause for code smell in the data model, 6.4 Feature sets are a leaky abstraction) Any changes to the data model or user facing API would require migrations, so the following proposal should not be taken lightly. That being said, if we believe the current API needs to change then the change should be incorporated as soon as possible. With the removal of feature set versions in 0.6, teams will have to migrate their data in any case, which would be an opportune time to cement these changes. 7.2.1 Change 1: Make projects an attribute/property of featuresThe idea here is to remove projects at the top of the concept hierarchy
Pros
Cons
Feature reference changes Data model changes: Concept hierarchy changes: 7.2.2 Change 2: Remove projects from data modelThis is an extension of (Change 1). This proposed change is to make projects purely a retrieval abstraction.
Pros
Cons
Feature reference changes Data model changes:
Concept hierarchy changes: 7.2.3 Change 3: Allow feature sharing between projectsThis is an extension of (Change 2). In (Change 1) projects would become an attribute of a feature. The goal is still to provide a convenient way for access control, isolation, and referencing features. If features are still only unique up to a project or feature set level, then referencing features in different contexts (projects or domains) or directly by name will still be difficult. References would become Instead of having a one-to-many mapping from project to features, this proposal is to make the mapping a many to many relationship. The same features can be found in multiple projects. This would allow a user to set their project and reference these features directly by name. Pros
Cons
Feature reference changes Data model changes: Concept hierarchy changes: 7.2.4 Change 4: Consider renaming projectsThis is an extension of (Change 3). Given that features would occur in multiple projects, these projects would probably be logically grouped according to various contexts that are not necessarily related to user projects, instead they could be grouped arbitrarily. A potentially more intuitive name might be applicable for referencing features, such as feature groups, domains, or repositories. This proposal would require more thought, and is probably safe to ignore for the time being. 7.2.5 Change 5: Remove feature references from feature rowsThis change would attempt to address (6.4 Feature sets are a leaky abstraction) Instead of having FeatureRows contain a "feature_set" field where producers should set the identity of a feature set, instead FeatureRows should be unique based on source location (table, topic). The means of identifying the source data should be contained within the feature set specification. Pros
Cons
Feature reference changes Data model changes: Concept hierarchy changes: |
So just to share a possibility from my experience and wheelhouse, the plan for feast in my org is to have a features repo that defines avro schemas for feature sets. The feature set schemas (similarly to all of our event schemas) are generated programmatically, along with python and go glue code, annotated with version for evolution purposes, and then applied on master merge to the feast clusters. When a user wants to ingest features, they use the generated schema object to ingest, validate, and publish a dataframe (note as an organization we use "ts" instead of datetime, so this also abstracts this difference): from pmfeatures.buyer import CustomerFeatures
def generate_features():
...
customer_features = CustomerFeatures(
customer_uuid=features["customer_uuid"],
ts=features['ts'],
features=features
)
customer_features.publish()
... We automatically annotate schema changes with an updated version, and enforce schema evolution rules (it seems we would want similar rules for updating feature sets in bigquery if you want to use the same table) to make sure schemas are forward compatible. If feast had an ability to specify the version, this is the one I would use. However, when ingesting features the version doesn't usually appear, and by enforcing schema evolution rules we can be sure that any serving code will work with updated schemas, since the only allowed operations are adding new nullable fields and relaxing the type of a field. I mention this because we are adopting the confluent schema registry in our general kafka strategy so that we don't have to have schema information encoded in the body of the message, so it seems like it could be used to help solve the outlined issues about an event knowing about it's feature set (6.3). Additionally, we have a concept of namespace in our schemas, and we use that in the feature set name, and I've found that most want the latest version of a feature set. it's for this reason that project and version seem safe to remove, perhaps by incorporating into the 7.2.1 Change 1. The first piece of utility code that I wrote for my feature set objects was a method that takes a list of features and annotates them with the latest version (calls feast core I believe) |
We've run into this pain point before. Essentially the users have data in various formats and they need to map it to a schema in Feast. Feast now introduces its own format and supported data types, which is in some part driven by protos. But we could in theory move towards a model where the protos are limited to defining the Feast API, while the data still conforms to a standard like Avro or Arrow. I think that would make data handling somewhat easier. @ches, not sure if you have opinions here.
We are removing versions as they stand right now. After that we won't have a way to set metadata on a feature set, but we will have a way to set it on a feature. I think this would be a strong use case for being able to capture metadata on a feature set.
Is this simply a prefix to the name? Who defines these names?
The million dollar question is whether projects are still valuable as a means of isolation. We could implement 7.2.1, but how would we deal with users wanting their own workspaces where they could create their own features sets and features. Name conflicts would be mysterious, because you wont be able to see the feature sets in another project that you have a conflict with. I just want to be 100% sure that this change in the way we use |
Just an update on this issue. I'd like to delay a decision on this until we have higher adoption. The most conservative approach, and the one I think we will try to enforce for 0.5, is as follows: Changes to Feast:
Changes to policies around feature creation:
What does this buy us?
|
I've spoken to quite a few folks over the last couple of weeks on this topic. It seems everyone wants something different so I would appreciate some input in order to get everyone aligned and to commit to an approach. BackgroundFeast 0.3
Feast 0.4
Feast 0.5
What do we want?
ProposalBoth proposals
Proposal 1: Projects as selectors/tags
Proposal 2: Projects as foldersKeep projects out of feature references, enable global uniqueness in feature names again in Feast 0.5.
I am ignoring entities as part of this discussion because adding them to a feature reference (in the way that Uber does it) would not be a breaking change. Feedback would be highly appreciated now in order to avoid breaking changes in the future. If I don't hear back from anyone then we will probably proceed with |
What Currently ExistsFeature Sets
Feature Sets are introduced as a way to group data sources with a schema of what Features are ingested into Feast: It stored configuration on the data source and soured features schema. As a purely ingestion concept, it typically bares no relation with how the users retrieve their features. Hence users will think that it is a hassle to have to deal with Feature Sets.
I think this happens when we are trying to stretch the Feature Set, a purely ingestion concept to also serve as a way to logically group Features (ie driver features, customer features). Feature Set schemas are coupled to resemble how the data is stored in the data sources, so its not possible to serve the other aim of logically grouping features with just feature sets alone.
Here the intuitive need for logical grouping and namespacing of Features is present. Users work around this prefixing their feature names: Projects
Projects were introduced to attempt to solve the shortcomings of Feature Sets and as a stepping stone for Authentication. Projects also gave the ability for users to namespace Features. However, how projects should be used is not clearly defined or fits with the data model. Should each project be meant for an entire team/company or should each projects be created for each model? Just like Feature Sets, this added another layer of complexity that they would like not have to worry. Projects present a isolated view for users of Feast, providing a bubble for users to work with. On the surface, this is beneficial as users will never step on another's toes in another project. However, this an antithesis to the objectives of Feast: Feature Reuse. Users would not be able to discover new features on Feast as we present only the features in this artificial project bubble by default. Projects, as it stands currently, does not seem to find a proper place in the Feast data model. Within Gojek, we are trying to move away from projects by moving all Features/Feature Sets into one mono project. What We Want
In my opinion, this is misnomer, as users will engineer complex feature names to manually organize and namespace their Features. (ie ProposalsProposal 2: Projects as folders
Globally unique feature names pushes users to manually namespace their Feature names on their own accord. This wild west of naming with no conventions enforced might result in Feast becoming a sea of cryptic Feature names. Some users might start naming features
Removing the namespacing effect of Projects effective is akin to doing what we did to Feature Sets in v0.4: turning it into another leaky abstraction. Proposal 1: Projects as selectors/tags
As outlined above, there is still a need for a way to create a logically grouping of Features that is not tied to the data source like Feature Set is. This "reincarnation" of projects as a view of Features exposed by Feature Sets could serve as that missing piece. However, I think trying to anticipate what could be relevant to the retriever in "view" should be a non goal, as its hard to anticipate. Firming up this "view" into something more concrete: a Feature Entity that logically groups features based on a specific concrete entity: For example, lets say that I would like to track features for an driver entity. I would like to track his average rating and vehicle model. I can create a Feature Entity Each Feature Entity should be an authoritative view of the Features for that entity. A ConsiderationAdding a new concept would add even more complexity to an already bloated data model Currently, we regularly see these Features names as some combination of entity name and actual feature name (ie |
Thanks for this post @mrzzy, you've outdone yourself as usual.
Yip, agreed.
I agree with the point you are making, but I am not sure how this relates to
Yes, agreed. I don't see a major downside to more verbose feature names for the time being though, especially if it buys us time to make more informed decisions.
As @ches, auth can happen at other layers, it doesn't have to be at the project layer. Projects was meant to abstract away feature sets (and ingestion level grouping) and to allow for direct referencing of features within your project (or across). I think the design mistake here, looking back, was incorporating it into the feature reference. Projects as an isolation system still adds value in my opinion, but should not add complexity to the workflow of the user, as you have rightly pointed out.
I wouldn't say that this is true. The original design of projects was to allow for sharing of features (and later possibly entities) across projects. So you should be able to retrieve features from multiple projects, not just one, in a single query. That is why projects was included in the feature reference.
This isn't exactly true. We aren't moving away from projects, we are collocating our production feature sets in a a single project, which happens to be the default project, in order to finish the very discussion we are having right now and decide no the future direction of the data/concept model. The reason for this approach is to ensure forward compatibility. If we go the tag/label based approach then we might need to have unique feature set names, so collocating all feature sets in a project ensures that. If we go the folder based approach then we have also ensured that because all features and feature sets are in one project. Finally, using one project, especially the default project, also ensures that all feature references on clients only have the feature names. So it's less likely that we will have a breaking change in the future than having multi-component feature references client side.
At the modeling stage the feature reference will be collapsed into a single string, so whatever concepts we come up with will just fall away. The question is just how prescriptive we want to be and how many concepts we want to introduce. So we might not be able to get away with just feature names, but feature names are the essential complexity we have right now. Projects,
I don't think anybody would disagree with you, but from an API design perspective it is easier to go from
I resisted talking about entities because I thought the conversation might be orthogonal, but now might be a good time to bring it up. My first question to you is: How is a The reason I thought this was an orthogonal discussion is because unique features could be a good first step. For example lets say you have an account balance feature on drivers. Feature ref is just The way that Uber does feature references is something like What I was expecting this Your suggestion seems to be more focused on the Am I correct in saying that |
Between proposal 1 and 2, i will choose 2, simply due to limited scope of changes and less work needed for a migration. However, as @mrzzy mentioned, going this path will only make feature names more complicated. If we are allowed to propose non backward compatible changes: If we use relational database as an analogy:
|
So by implication you are suggesting |
Yes, just like how it is not possible to do an SQL queries without referencing the table names. I am aware, however, that this is a very big change from the previous versions of Feast. Using familiar database concepts would address one of the pain points about Feast, namely the abstract concepts. Database, tables, views are already concepts which the data scientist are familiar with. |
It's not 100% clear what you mean. Tables / Columns function quite similar feature sets and features right now. They are inferred, and otherwise conflict. If you have
Then you don't specify the table as part of the column reference. If you have two tables then you have something like
If you don't provide a table alias then you get a conflict. The only real difference is that the table is provided per query/request.
Yea I agree with this principle, but not necessarily the recommendation. |
Hi friends, just going to chime in here because I've been thinking about this from the perspective of bigtable keys, and also the expressed desire for teams to collaborate and reuse feature references. we have a few different feature sets that have the same name, but because we auto prefix all of our feature set applys with the namespace of the feature set. We have also taken the route right now of only using one project, but slip a namespace into our keys. We end up with something like
as our cassandra keys, with the feature column having the feature name. While imagining how to construct our bigtable keys, we want to make reads performant when looking up features from the same feature set, thus they must be ordered lexicographically. And teams may want to use features from the same feature set name across namespaces (the risk team and the buyer team both have features associated with places for example,) so the idea would be something like
This would allow for performant reads of prefixes, where you only need to read one interval with every call, and you don't have to read other namespaces features if you don't want to, but if you do it will be performant, since every entity's features are adjacent lexicographically. This requires coordination from ML org since you need to name your feature sets corresponding to the entities that represent their keys. I'm pretty agnostic as to what we ultimately do, since I fit the implementation to be performant even if I need to do some fork trickery, but it does seem like project (I think in terms of namespaces) is a useful piece of information if you want to colocate yet partition features |
Can I ask why features are stored with this table structure? For example, why are the features not split across tables by feature instead of feature sets. Is it for performance issues at query time? Maybe it is tangental to this issue, but I am just trying to understand the reasoning for the design choice. |
If there is no mechanism for feature name consistency, then I believe that people will have poorly constructed and inconsistent names regardless of whether or not there are unique feature names. For example, the same happens with column names in databases, with inconsistencies with the use of lower cases and upper cases. This happens regardless of whether people can use the same name in different tables. If you want some form of consistency in naming, then I think you should have a mechanism to do that and it is not clear to me that projects is that. I agree that consistent naming is important. I am just not sure having the ability to use the same name in different projects is going to solve that issue. |
To my mind, the main proposition of projects would be to have more granular control of how features are organised across a company's feature store. This comes not just in terms of individuals wanting to develop in isolation, but also in terms of adding logical structure to the features such that they are easier to use and maintain across the business. To use the example of data engineering, in the company where I work our BigQuery data warehouse is split across different BigQuery projects. For example, we have core data models are made available across the business and come with a high level of maturity in terms of things such as SLAs on the corresponding tables. Meanwhile, we also have extension data models that are used by individual teams or groups of teams in a particular department. The barrier for putting ETLS into production in these extension models is less, but the SLAs are sufficient for the needs of these teams. Not only does this type of separation make logical sense for the business, but this type of separation allows for easy control of user access settings. For example, if a user needs access to tables that are relevant to a given department, then it is only necessary to given them to the few appropriate projects. Naively I could see projects in FEAST playing an analogous role to this type of separation, e.g., one project could be the core features used across the company for a particular entity of interest, such as a customer. |
Performance. We used to do this for Feast 0.1, but the performance was much worse. It requires a lookup for each feature, and this is especially costly for large joins. Not to mentioned that you would have 75% of the storage being used for the keys and 25% for the values.
Agreed.
Correct, projects just tries to solve namespacing and isolation. It wont solve the naming issue.
This was one of the reasons for bringing in projects in the first place. We wanted to have isolated namespaces with some form of access control. I still think that is valuable, and I agree with everything you said. Projects are an intuitive concept that people just "get". And in fact our auth PR adds more controls there #554 which I hope we can expand upon. All this being said, @ches has also called out that it would be just as simple to allow access control on feature set as it is on projects How we are internally structuring our projects are as follows.
Once we roll out access control we will also allow teams to start using custom project namespaces for their development and iteration, with different SLAs.
Division of projects along entities seems like it could lead to users wanting to request features from multiple projects in one go, which leads to more complex feature references. One approach that has been mentioned that I think can achieve the same thing is namespacing through entities, so basically But we would still have all features relevant to a consumer in a single project. |
Any chance you will open-source this CI pipeline? We want to do much the same thing. |
Don't mind open sourcing it, but not sure if we will have an opportunity any time soon. I'll set a reminder for myself. |
Thanks for the response @woop
OK, I see. Thanks for explaining. I assumed it must be something like this. This pattern still ties consumption patterns to the way in which features are ingested though, right? This is something that still feels a bit odd to me. It feels like a lot of these issues have a data engineering/data modelling type feel, .e.g., how to structure your underlying data so that queries are performant. I just wonder whether there is a one-fits-all solution to this that will work for everyone, or whether it would be possible to provide users with more flexibility in how they structure the underlying data. For example, allowing people to make intermediary tables themselves that contain the features they want and that will be synced to the underlying features. I think several people have suggested something similar above. Again, I know this is probably tangential to this issue, so feel free to ignore me. I was just wondering if this is something you have considered. |
@tfurmston for the record I 100% agree that there is room for optimizing the data model here and to provide users with more flexibility. One way to do this is allow feature sets to be defined as materialized views.
During ingestion we could write to these materialized views as well as the original feature set tables. However, I don't see too much value of doing this for historical data. I do think it would be useful for online serving. In the case of online serving though, it would probably require a read + write since data will be coming in separate events. Alternatively we could maintain state in the ingestion jobs to support this. |
Perhaps that would work. To be honest, I still don't have enough usage of feast from the user perspective to know either way. Just from reading this thread, it does seem that there are issues with the data model. Hence my comments. Re-reading @mrzzy proposal from the 25th, i.e., grouping by entity. I think this makes a lot of sense. Maybe this would make a good default and then if it transpires that people need more flexibility, then address it then. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
hey folks, is there a decision here? I think the discussion is super relevant, any reason for the issue to be closed? |
This issue is meant to be a discussion of the current Feast API as it relates to
feature references
, a key component of the user facing API. Additionally, it will also discuss the current data model and our concept hierarchy.1. Background
The Feast user facing API and data model changed dramatically from 0.1 to 0.2+. The original intention was to simplify the API as much as possible and gradually evolve it as new user requirements available.
Two important reference documents on this topic are
2. Problem statement
The Feast API is evolving as more and more teams adopt the software and share their requirements with us. In most cases this means an expansion of the API, but in some cases it means a reversal.
With the introduction of projects into Feast (Feast Projects RFC), our API has evolved again. This change has affected feature references, the data model, and concept hierarchy.
The most critical feedback on this change has been that it introduces unnecessary complexity to address problems (isolation, namespacing, security), that could be solved in a different way.
3. Objective
The point of this GitHub issue is to settle our API for feature references, our concept hierarchy, and data model in such a way that we
Put simply, we want to make sure that we are on the right path and make the necessary changes now when its least disruptive.
4. What are feature references?
Feature references (previously Feature Ids) are strings/objects within Feast that allows Feast and users of Feast to reference specific features. Feature references are primarily used as a means of indicating to Feast which features a user would like to retrieve.
Originally, feature references were defined as follows
<feature-set>:<feature-name>:<feature-version>
All parts of the above reference were required at the time.
Feature references have recently been updated (as part of the Projects RFC)
The move towards project namespaces now moves feature sets and features/entities into the following hierarchy
Feature references are now defined as:
<project>/<feature-name>:<feature-version>
The following constraints apply
One of our primary motivations was to allow users to reference features directly by name. With
versions
becoming optional and allowing theproject
to be set externally, this is now possible. Users can provide features as a list of feature namesAn example of feature references being used below (from the Python SDK):
5. How are feature references used?
5.1 During online serving
During online serving the user will provide two sets of information to Feast during feature retrieval.
Feast wants to construct a response object with all of the data from these features on all of these entities.
For example, if a user sends a request with a single feature reference as
daily_transactions
, Feast will attempt to add the missing information. It will add theproject
id (which currently must be provided by the user), it will then determine thefeature set
that contains that feature name, and then finally it will determine the latestversion
of the feature set in which the feature occurs.Internally, Feast is left with something that resembles the following
my_customer_project/my_customer_feature_set:daily_transactions:3
Since features are stored based on feature sets, Feast first converts the above into what we can informally define as a feature set reference, resembling the following
<project>/<feature-set-name>:<feature-set-version>
or tangibly
my_customer_project/my_customer_feature_set:3
In the case of Redis, Feast will use the above feature set reference, along with the entities the user has provided, to construct a list of keys to look up. The responses from the database are then used to build a response object that is returned to the user.
5.2 During batch serving
The batch serving case is very similar to the online serving case, but with more complexity on queries and joins.
The user provides the following during batch retrieval
Feature references are converted into their full form, as well as used to create feature set references (as in online serving). In the case of BigQuery, the feature set reference maps directly to a table. For each feature set table that Feast needs to query features from, Feast runs a point in time correct query using the entities+timestamps for the specific feature columns. This produces a resultant table with the users requested feature data, over the timestamps and features, but one specific feature set.
Feast then uses the entity columns in each feature set table as a means of joining the results of these sub-queries into a single resultant dataframe.
5.3 During ingestion of data into stores
When loading data into Feast, data first needs to be converted into FeatureRow format and then pushed into a Kafka stream.
During this conversion to feature row form, it is necessary to set a field called
feature_set
with the feature set reference. To reiterate, the feature set reference looks something like:<project>/<feature-set-name>:<feature-set-version>
Ingestion jobs that pick up these rows are then able to easily identify the row as belonging to a specific project and feature set. The jobs then write all of these rows to all of the stores that subscribe to these feature sets.
6. Problems with the current implementation
6.1 Feature set versions are unnecessary:
The concept of feature set versions was introduced in order to allow users to reuse feature set names. However, they add additional complexity at both ingestion time as well as retrieval time. Users need to maintain a knowledge of the correct version of feature set to ingest data to and to retrieve data from. If they dont pin their retrieval to a specific version then they risk having their system go down at a version increment.
6.2 Projects could be unnecessary at the top of the concept hierarchy:
Projects as a concept was introduced to provide a means of
The problem with
projects
is that it introduces a layer into the concept hierarchy that makes Feast harder to understand and could be introducing unnecessary complexity. It's possible that all of the above requirements for introducing projects could be addressed while still maintaining feature sets as the top level concept.6.3 Projects are a cause for code smell in the data model:
There are currently three locations where projects occur.
The current approach has code smell in the fact that FeatureRows have to know their own identity. Today, having each FeatureRow know its own identify allows Feast to consume from topics that contain mixed feature sets (versions and names). Feast is able to differentiate FeatureRows from each other and can know how to interpret their contents based on a feature reference contained within the row.
However, In the case that Feast were to consume features from an external stream that it had no control over (not even the data model), Feast would not have the feature set reference conveniently available inside the event payload.
The second occurrence of projects is in the store. Tables are currently named according to
projectName_featureSet_version
. Projects are a necessity here since feature set names can be duplicated across projects. However, projects are not essential complexity in the same way a feature set is, and doesnt seem natural to encode into the data model itself.6.4 Feature sets are a leaky abstraction:
Feature sets are a core part of the existing data model. Feature data is stored on a feature set within a feast store like Redis or BigQuery. In order to find the features a user is looking for, it is still necessary to determine the feature set they need from their
feature reference
. This seems to work at retrieval time since Feast Serving can maintain a cache of available feature sets (albeit introducing a new inefficiency during lookup). Two problems exist here:feature set references
) and how users are consuming data (feature references
). Users are loading in FeatureRows into feature sets, but they are querying out features from projects. Ideally these two concepts wouldn't be so distinct.<project>/<feature-name>:<feature-version>
. However, the concept of afeature-version
doesn't exist. Feature are currently inheriting their version from their feature set. So right now afeature references
still contain trace information about the parent feature set.The text was updated successfully, but these errors were encountered: