Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of storing historical value of feature in serving storage #53

Closed
pradithya opened this issue Jan 10, 2019 · 18 comments
Closed

Removal of storing historical value of feature in serving storage #53

pradithya opened this issue Jan 10, 2019 · 18 comments
Assignees
Labels
kind/feature New feature or request

Comments

@pradithya
Copy link
Collaborator

pradithya commented Jan 10, 2019

Is your feature request related to a problem? Please describe.
Storing historical value of feature in serving store consume a lot of storage space. It becomes a real problem for limited storage such as Redis.

Describe the solution you'd like
Remove the storing feature as time series data functionality and only store the latest value of feature (which currently we already have)

Impact

  1. Ingestion
    Removal of code which writes feature as time series data in Redis and Bigtable.
  2. Serving
    Serving currently has 1 API which can request LAST feature or LIST of feature.
    If the time series data is removed then LIST request will always return empty feature.
    Currently, LAST request ignore time range filter, it has to be changed to respect it to allow filtering stale feature.
  3. Client
    Client will have to always use LAST request. (currently it uses LIST when time range filtering is not requested)

To be discussed further

  • Should we add new API to serving service which simplify request and response since we know it will always return LAST? Currently the request and response is complex to cater the needs LAST and LIST.

Thoughts?
@tims @zhilingc @woop @baskaranz @mansi @budi

@pradithya pradithya added the kind/feature New feature or request label Jan 10, 2019
@woop
Copy link
Member

woop commented Jan 10, 2019

@pradithya , I think you should also state why we have this functionality in the first place, and also why the impact doesn't affect that requirement. That is more important than the technical implementation details.

In this case there was a need for time-series "intermediate" features to be stored in Feast, and these features are then turned into final features by the client. However, it seems like this is only a requirement in exceptional cases and that most models only require the latest value, which is why the impact of removing it is low.

@tims
Copy link
Contributor

tims commented Jan 10, 2019

I support this. This would also mean we don't need granularities, as discussed in #17.

I agree it seems that the need for time series features was unusual. And thus is probably shouldn't be supported in a way that requires so much extra effort. Especially when it has impacted the pleasantness of the serving api. If we remove them from the serving api, it will ensure that serving features matches a format that is more consumable by models.

Instead of providing time-series for model creators who want to do extra feature transformations after feast, we should provide better tooling upstream. To make creating features super easy. I think because we said feature creation is out of scope for Feast, we ended up shoehorning in a workaround and now we realise we probably don't need it anyway.

For developers that truly need to access to raw time series, they can still do so with the warehouse. This is a typical approach when coming up with new features. They can then create features from those and ingest them in.

@pradithya
Copy link
Collaborator Author

I support this. This would also mean we don't need granularities, as discussed in #17.

True, but I think removing granularity will require bigger changes, so let's consider it out of scope for now.

@tims
Copy link
Contributor

tims commented Jan 11, 2019

True, but I think removing granularity will require bigger changes, so let's consider it out of scope for now.

Agreed, #17 will be dependent on this one.

@woop
Copy link
Member

woop commented Jan 11, 2019

I support this. This would also mean we don't need granularities, as discussed in #17.

I agree it seems that the need for time series features was unusual. And thus is probably shouldn't be supported in a way that requires so much extra effort. Especially when it has impacted the pleasantness of the serving api. If we remove them from the serving api, it will ensure that serving features matches a format that is more consumable by models.

Agreed, it adds a lot of complexity.

Instead of providing time-series for model creators who want to do extra feature transformations after feast, we should provide better tooling upstream. To make creating features super easy. I think because we said feature creation is out of scope for Feast, we ended up shoehorning in a workaround and now we realise we probably don't need it anyway.

The original requirement wasn't a workaround, it was built for a use case which seems to be an exception, where you are essentially doing "feature transformations" from within your model. Your point is still valid though, in that we should reevaluate whether this is worth all the extra complexity. It does not seem to be.

Also, feature creation is meant to be upstream from Feast, but we should also evaluate what we are classifying as feature creation. Dataset construction would be happening from queries to Feast to both APIs, and those queries are not fully matured yet. It's possible that this area might require more advanced functionalities, which could be akin to doing feature transformations. In fact, this is what is currently available in the Uber Feature Store in the form of a DSL. Whether or not we actually need that or should implement that is another matter, and probably not something we should pick up soon.

For developers that truly need to access to raw time series, they can still do so with the warehouse. This is a typical approach when coming up with new features. They can then create features from those and ingest them in.

Agreed.

@pradithya pradithya self-assigned this Jan 14, 2019
@pradithya
Copy link
Collaborator Author

For this issue I am planning to simplify the serving API's request and response structure as shown below.

  1. Request.
  • Remove requestDetails and replace it with only list of feature ID.

Old

message QueryFeatures {
    message Request {
        // e.g. "driver", "customer", "city".
        string entityName = 1;
        // List of entity ID.
        repeated string entityId = 2;
        // List of request details, contains: featureId, type of query, and limit.
        repeated RequestDetail requestDetails = 3;
        // Filter specifying only to retrieve features having timestamp within this range.
        TimestampRange timestampRange = 4;
    }
}

New

message QueryFeaturesRequest {
    // e.g. "driver", "customer", "city".
    string entityName = 1;
    // List of entity ID.
    repeated string entityId = 2;
    // List of feature ID.
    // feature ID is in the form of [entity_name].[granularity].[feature_name]
    // e.g: "driver.day.total_accepted_booking"
    // all requested feature ID shall have same entity name.
    repeated string featureId = 3;
    // [optional] time range filter. If not specified all feature with any timestamp will be returned.
    TimestampRange timeRange = 4;
}
  1. Response
  • Replace FeatureValueList with FeatureValue

Old

message QueryFeatures {
    message Response {
        // e.g. "driver", "customer", "city".
        string entityName = 1;
        // map of entity ID and its entity's properties.
        map<string, Entity> entities = 2;
    }
}
message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValueList> features = 1;
}

message FeatureValueList {
    // list of feature value
    // if "type" in "requestDetail" is "LAST", then the length will always be 1.
    // if "type" in "requestDetail" is "LIST", then the length is <= "limit".
    feast.types.ValueList valueList = 1;
    // list of timestamp of the value.
    // the i-th timestamps correspond to the i-th value.
    feast.types.TimestampList timestampList = 2;
}

New

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValue> features = 1;
}

message FeatureValue {
    // value of feature
    feast.types.Value value = 1;
    // timestamp of the feature
    google.protobuf.Timestamp timestamp = 2;
}

What do you guys think? @tims @zhilingc @woop
SDK will also needs to be changed @mansiib

@pradithya
Copy link
Collaborator Author

@tims if we only update latest value per key, is there a way to prevent late data from overwriting newer data?

@tims
Copy link
Contributor

tims commented Jan 15, 2019

We could. I don't think it's practical to lookup what is already in the stores on startup, but we could add a window in ingestion. And the select the max timestamp in the window, emitting on every feature row with a newer event timestamp.

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

@pradithya
Copy link
Collaborator Author

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

wouldn't it be just a missing value for that feature?

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

What do you mean by resources?

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

The problem is currently for batch data with different timestamp the resulting last value is random.

@tims
Copy link
Contributor

tims commented Jan 15, 2019

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

wouldn't it be just a missing value for that feature?

I assume you wouldn't want to delete it? And you'd definitely want to include it, so you need to coalesce the features and detect updates. If you get an older row last, you need to write the parts that were missing in the newer row that you wrote first.

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

What do you mean by resources?

CPU and memory

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

The problem is currently for batch data with different timestamp the resulting last value is random.

For batch could choose to combine globally just fine and that would solve your issue. I'd actually prefer to do this just for batch.

Thinking more, for streaming we'd actually probably want to use a session based window. But regardless, with a large backlog, you could potentially have multiple windows competing with each other. Basically it's hard.

@tims
Copy link
Contributor

tims commented Jan 15, 2019

documenting our discussion in slack:

For streaming, the way to actually make it consistent is to use a global window for the entity. This will utilises the beam model as it's intended. The tradeoff is using more resources during ingestion, dependent on the size of the entity keyspace.

A global window with a combiner per key, the same approach will apply to batch we just need triggering in the global window.

I think this should have a setting to disable it per import spec, so users can just make a best effort when they want to, or they are confident there are unique entity keys.

@tims
Copy link
Contributor

tims commented Feb 12, 2019

With the merge of #88, only the latest values are being stored in the serving stores now. Historical values are not retrievable, so we can address this issue now.

The only task left is changing the serving api?

Feedback on the api,

I don't think we need timestamps grouped per value now. Since many use cases may not care about the timestamp, we could make the responses (especially in json) prettier by separating them.

So rather than the suggested:

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValue> features = 1;
}

message FeatureValue {
    // value of feature
    feast.types.Value value = 1;
    // timestamp of the feature
    google.protobuf.Timestamp timestamp = 2;
}

How about this:

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID to feature value.
    map<string, feast.types.Value> features = 1;
    // map of feature ID to event timestamp for the corresponding values.
    map<string, google.protobuf.Timestamp> timestamps = 2;
}

Or maybe even this?

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID to map of feature IDs to values
    map<string,  map<string, feast.types.Value>> entities = 2;
    map<string,  map<string, google.protobuf.Timestamp>> timestamps = 3;
}

@pradithya
Copy link
Collaborator Author

pradithya commented Feb 12, 2019

Yes, the implementation based on first proposal is already in #87.

Your last suggestion is not possible in protobuf. I like the second one though.

@pradithya
Copy link
Collaborator Author

Tried to render sample JSON

Original proposal:

{
    "entityName": "entity",
    "entities": {
        “1234": {
            "features": {
                "entity.granularity.feature1": {
                    "value": {
                        "doubleVal": 1.5795105695724487
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                },
                "entity.granularity.feature2": {
                    "value": {
                        "doubleVal": 2.122074604034424
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                }
            }
        },
        “5678": {
            "features": {
                "entity.granularity.feature1": {
                    "value": {
                        "doubleVal": 1.5795105695724487
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                },
                "entity.granularity.feature2": {
                    "value": {
                        "doubleVal": 2.122074604034424
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                }
            }
        }
    }
}

2nd one


{
  "entityName": "entity",
  "entities": {
    "1234": {
      "features": {
        "entity.granularity.feature1": {
          "int32Val": 12
        },
        "entity.granularity.feature2": {
          "int32Val": 12
        }
      },
      "timestamps": {
        "entity.granularity.feature1": "1970-01-02T10:17:36Z",
        "entity.granularity.feature2": "1970-01-02T10:17:36Z"
      }
    },
    "5678": {
      "features": {
        "entity.granularity.feature1": {
          "int32Val": 12
        },
        "entity.granularity.feature2": {
          "int32Val": 12
        }
      },
      "timestamps": {
        "entity.granularity.feature1": "1970-01-02T10:17:36Z",
        "entity.granularity.feature2": "1970-01-02T10:17:36Z"
      }
    }
  }
}

Which one do you think it's more preferable? @tims @woop @zhilingc

@zhilingc
Copy link
Collaborator

I prefer the former, but imo it shouldnt matter that much since the SDK does the parsing of these responses, right?

@pradithya
Copy link
Collaborator Author

yes, doesn't matter that much since SDK will wrap it.

@tims
Copy link
Contributor

tims commented Feb 13, 2019

I'm happy to defer to you and go with the former.

@pradithya
Copy link
Collaborator Author

closed via #87

dmartinol added a commit to dmartinol/feast that referenced this issue Jul 26, 2024
Add feast CLI command to identify resources not covered by permissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants