Entity types as a higher-level concept #405

woop · 2020-01-04T03:30:09Z

Introduction

Currently an entity, or more formally an entity type, is treated as a special type of field within a feature set. There has been an attempt to simplify the creation and management of entities and to keep them consistent with features, however some challenges exist with our current approach.

Note: The terms entity and entity type will be used interchangeable in the following issue.

How are entities created?

Users define an entity as part of a feature set. An entity in this case is a field like any other within the feature set. More than one entity can exist within a feature set.
An entity's name must be unique within a feature set.
There are no constraints on entities outside of a feature set, either at the project or global level. This means that multiple feature sets can define the same entities again.

How are entities used?

Retrieving feature values: Entities are used as a key for retrieving features. In order to retrieve feature values within a feature set, all entities must be provided as part of the lookup.
Joining feature sets: In the event that feature values are being retrieved from multiple feature sets, entities are used to look up these feature values. Entities are also used to join across these feature sets to construct a single result set.

What is the problem?

Discovery: It seems intuitive that users would start their discovery experience from the point of view of an entity type, since their business problem is generally framed around one or more entities. By nesting entities within feature sets and within projects and not providing a discovery means, it makes discovery harder.
Consistency: Entities are typically consistent across all projects and systems in most organizations. This consistency is not enforced in Feast at the moment. Users are bound to redefine entities in their local projects if no consistency is enforced at an organizational level. Failure would occur when lookups happen or when joins happen across feature sets, especially when joins need to happen across projects.
Key building: If entities and features maintain mutual compatibility in terms of supported data types, then support must be maintained for building keys from all feature value types. This adds a lot of complexity to key building since support must be maintained to serialize complex composite data structures in order to build these keys.

Proposals

1. Project-level entities

Functionality

Entities are created outside of feature sets, but they still reside in a specific project namespace.
Entities have their own distinct API and supported data types (which may be more limited than features)
Entities must be unique within a project namespace, but can be duplicated across an organization. Uniqueness is ensured through a full entity reference (gojek/customer).
Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

Entities receive all the sharing and isolation benefits of "projects". Entities would not have to be treated separately from a logical and/or development standpoint. There would also be no explosion of a global entity namespace
Users are free to experiment and develop within their projects without affecting other users, since duplication is allowed across projects.
No need for a central team to gate-keep the creation of entities.

Disadvantages

By not elevating entities to the global level, end users would be required to know which projects contain the entities they should be referencing. This means an organizational process must exist in order to select these entities.
Most projects would have to reference entities from another more authoritative project. In fact, it's likely that an organization will have a central project which contains only entities. This could be a little counter-intuitive if a feature set contains fields that are referencing an external project.

2. Global-level entities

Functionality

Entities are defined globally for a Feast deployment.
Entities have their own distinct API and supported data types (which may be more limited than features).
Entities must be globally unique.
Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

Central authoritative listing of entities within an organization.
Easier to discover which entities should be used, without needing an organizational policy.
Easy to reason about and easier to understand when referencing an entity within a feature set.

Disadvantages

Requires development of separate logic from projects, feature sets, and features.
Requires a team and process to manage the creation of entities.
No way to isolate conflicts. If one team wants to use a float and another wants to use a string for an entity data type, then it would likely result in two entities being created. This would still be the case in the Project-level entity proposal, but at least in that proposal the unorthodox approach (maybe string) could be isolated to a specific project.

3. Default project entities

Functionality

If a user does not specify a project, then they are automatically located inside of the default project. This would be similar to how Kubernetes does namespacing.
All other functionality would be the same as the project level entities proposal, except users don't actually have to create an entity inside of a named project.
Feature references could be created that allow users to reference entities without a project. So instead of having my_company/customer, it would be possible to refer to "global" entities by either using customer or default/customer.

Advantages

All of the advantages of project-level entities.
Most of the advantages of global-level entities, except that this default project would still not be a true global namespace. There would still need to be an organizational process that informs users to use the entities in this project.
Simplifies development since project-level sharing and isolation can be reused.

Disadvantages

Still requires access control on the default namespace.

The text was updated successfully, but these errors were encountered:

khorshuheng · 2020-01-05T12:22:12Z

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

woop · 2020-01-05T14:08:24Z

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

The example you are referring to would be for project-level entities. Meaning an organization could have authoritative projects like:

gojek/customer
gopay/customer

It seems to provide a cleaner isolation, but it is also the case that "users" would have to define their own projects and feature sets from which they would reference these authoritative entities.

So I am only seeing one option here, not two. The disadvantage comes from having to know whether to use either of these two projects.

woop · 2020-01-22T04:59:55Z

Another possible solution would be a hybrid model between global and project level entities. I have added this as (3) in the comment above, titled 3. Default project entities

khorshuheng · 2020-01-22T06:02:33Z

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driver_sg and driver_ th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

khorshuheng · 2020-01-22T06:14:17Z

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

woop · 2020-01-22T06:45:06Z

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driver_sg and driver_ th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

woop · 2020-01-22T06:45:57Z

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

Absolutely, that was my hope as well!

khorshuheng · 2020-01-22T06:57:52Z

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driver_sg and driver_ th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

Actually, yeah you are correct, I can just have driver in a global project instead of having the entity defined in each regional project. Too entrenched in the code base that I am currently working on and didn't consider this possibility.

stale · 2020-04-14T06:55:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

woop · 2020-06-04T01:36:01Z

Moving this out of the 0.6 milestone because I think we can live without it for the time being.

dr3s · 2020-07-01T00:27:33Z

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

woop · 2020-07-01T06:00:47Z

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

Correct.

KshitizLohia · 2022-03-03T14:31:25Z

Entity as a construct I believe is increasing complexity in the system. What I fail to understand is how the notion of entity is helping in grouping semantically related features together (as per the definition of entity in the documentation). Also, it introduces more problem as joins are happening at the later point of time and entity is defined at the start of user experience.

Few questions:

Shouldn't entity just be a logical container specifying join keys? In which case, how can we specify join keys before join operation. For instance, let's say join on entity A and entity B could use one join key and join for entity A and entity C could use another join key.
How can we chain the join operations and perform complex join operations. For example ((A left join B) right join C)?
How can we handle shadow mapping using entities. For example, customer id of entity customer is linked to user id of entity user?

Just want to take others suggestion on the same!

woop assigned zhilingc, woop, thirteen37, davidheryanto and khorshuheng Jan 4, 2020

woop mentioned this issue Jan 19, 2020

Feature Search/List/Browse #435

Closed

woop added kind/discussion kind/feature New feature or request priority/p1 labels Jan 26, 2020

lgvital mentioned this issue Feb 11, 2020

Duplicate entity row created when multiple feature set apply() calls happen asynchronously #470

Closed

ches added this to the v0.6.0 milestone Feb 14, 2020

stale bot added the wontfix This will not be worked on label Apr 14, 2020

woop added the keep-open label Apr 15, 2020

stale bot removed the wontfix This will not be worked on label Apr 15, 2020

woop mentioned this issue Apr 18, 2020

Add computation and retrieval of batch feature statistics #612

Closed

ches mentioned this issue Apr 26, 2020

Add feature and feature set labels, for metadata #536

Merged

woop mentioned this issue May 26, 2020

Feast API: Feature references, concept hierarchy, and data model #479

Closed

woop removed this from the v0.6.0 milestone Jun 4, 2020

woop mentioned this issue Jun 29, 2020

Feast 0.7 Release #834

Closed

terryyylim mentioned this issue Sep 23, 2020

Introduce Entity as higher-level concept #1014

Merged

feast-ci-bot closed this as completed in #1014 Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity types as a higher-level concept #405

Entity types as a higher-level concept #405

woop commented Jan 4, 2020 •

edited

Loading

khorshuheng commented Jan 5, 2020

woop commented Jan 5, 2020

woop commented Jan 22, 2020

khorshuheng commented Jan 22, 2020

khorshuheng commented Jan 22, 2020

woop commented Jan 22, 2020

woop commented Jan 22, 2020

khorshuheng commented Jan 22, 2020 •

edited

Loading

stale bot commented Apr 14, 2020

woop commented Jun 4, 2020

dr3s commented Jul 1, 2020

woop commented Jul 1, 2020

KshitizLohia commented Mar 3, 2022

Entity types as a higher-level concept #405

Entity types as a higher-level concept #405

Comments

woop commented Jan 4, 2020 • edited Loading

Introduction

How are entities created?

How are entities used?

What is the problem?

Proposals

1. Project-level entities

2. Global-level entities

3. Default project entities

khorshuheng commented Jan 5, 2020

woop commented Jan 5, 2020

woop commented Jan 22, 2020

khorshuheng commented Jan 22, 2020

khorshuheng commented Jan 22, 2020

woop commented Jan 22, 2020

woop commented Jan 22, 2020

khorshuheng commented Jan 22, 2020 • edited Loading

stale bot commented Apr 14, 2020

woop commented Jun 4, 2020

dr3s commented Jul 1, 2020

woop commented Jul 1, 2020

KshitizLohia commented Mar 3, 2022

woop commented Jan 4, 2020 •

edited

Loading

khorshuheng commented Jan 22, 2020 •

edited

Loading