Removal of storing historical value of feature in serving storage #53

pradithya · 2019-01-10T04:19:36Z

Is your feature request related to a problem? Please describe.
Storing historical value of feature in serving store consume a lot of storage space. It becomes a real problem for limited storage such as Redis.

Describe the solution you'd like
Remove the storing feature as time series data functionality and only store the latest value of feature (which currently we already have)

Impact

Ingestion
Removal of code which writes feature as time series data in Redis and Bigtable.
Serving
Serving currently has 1 API which can request LAST feature or LIST of feature.
If the time series data is removed then LIST request will always return empty feature.
Currently, LAST request ignore time range filter, it has to be changed to respect it to allow filtering stale feature.
Client
Client will have to always use LAST request. (currently it uses LIST when time range filtering is not requested)

To be discussed further

Should we add new API to serving service which simplify request and response since we know it will always return LAST? Currently the request and response is complex to cater the needs LAST and LIST.

Thoughts?
@tims @zhilingc @woop @baskaranz @mansi @budi

woop · 2019-01-10T04:49:46Z

@pradithya , I think you should also state why we have this functionality in the first place, and also why the impact doesn't affect that requirement. That is more important than the technical implementation details.

In this case there was a need for time-series "intermediate" features to be stored in Feast, and these features are then turned into final features by the client. However, it seems like this is only a requirement in exceptional cases and that most models only require the latest value, which is why the impact of removing it is low.

tims · 2019-01-10T08:09:14Z

I support this. This would also mean we don't need granularities, as discussed in #17.

I agree it seems that the need for time series features was unusual. And thus is probably shouldn't be supported in a way that requires so much extra effort. Especially when it has impacted the pleasantness of the serving api. If we remove them from the serving api, it will ensure that serving features matches a format that is more consumable by models.

Instead of providing time-series for model creators who want to do extra feature transformations after feast, we should provide better tooling upstream. To make creating features super easy. I think because we said feature creation is out of scope for Feast, we ended up shoehorning in a workaround and now we realise we probably don't need it anyway.

For developers that truly need to access to raw time series, they can still do so with the warehouse. This is a typical approach when coming up with new features. They can then create features from those and ingest them in.

pradithya · 2019-01-10T10:27:52Z

I support this. This would also mean we don't need granularities, as discussed in #17.

True, but I think removing granularity will require bigger changes, so let's consider it out of scope for now.

tims · 2019-01-11T00:05:04Z

True, but I think removing granularity will require bigger changes, so let's consider it out of scope for now.

Agreed, #17 will be dependent on this one.

woop · 2019-01-11T00:36:02Z

I support this. This would also mean we don't need granularities, as discussed in #17.

I agree it seems that the need for time series features was unusual. And thus is probably shouldn't be supported in a way that requires so much extra effort. Especially when it has impacted the pleasantness of the serving api. If we remove them from the serving api, it will ensure that serving features matches a format that is more consumable by models.

Agreed, it adds a lot of complexity.

Instead of providing time-series for model creators who want to do extra feature transformations after feast, we should provide better tooling upstream. To make creating features super easy. I think because we said feature creation is out of scope for Feast, we ended up shoehorning in a workaround and now we realise we probably don't need it anyway.

The original requirement wasn't a workaround, it was built for a use case which seems to be an exception, where you are essentially doing "feature transformations" from within your model. Your point is still valid though, in that we should reevaluate whether this is worth all the extra complexity. It does not seem to be.

Also, feature creation is meant to be upstream from Feast, but we should also evaluate what we are classifying as feature creation. Dataset construction would be happening from queries to Feast to both APIs, and those queries are not fully matured yet. It's possible that this area might require more advanced functionalities, which could be akin to doing feature transformations. In fact, this is what is currently available in the Uber Feature Store in the form of a DSL. Whether or not we actually need that or should implement that is another matter, and probably not something we should pick up soon.

For developers that truly need to access to raw time series, they can still do so with the warehouse. This is a typical approach when coming up with new features. They can then create features from those and ingest them in.

Agreed.

pradithya · 2019-01-14T06:46:41Z

For this issue I am planning to simplify the serving API's request and response structure as shown below.

Request.

Remove requestDetails and replace it with only list of feature ID.

Old

message QueryFeatures {
    message Request {
        // e.g. "driver", "customer", "city".
        string entityName = 1;
        // List of entity ID.
        repeated string entityId = 2;
        // List of request details, contains: featureId, type of query, and limit.
        repeated RequestDetail requestDetails = 3;
        // Filter specifying only to retrieve features having timestamp within this range.
        TimestampRange timestampRange = 4;
    }
}

New

message QueryFeaturesRequest {
    // e.g. "driver", "customer", "city".
    string entityName = 1;
    // List of entity ID.
    repeated string entityId = 2;
    // List of feature ID.
    // feature ID is in the form of [entity_name].[granularity].[feature_name]
    // e.g: "driver.day.total_accepted_booking"
    // all requested feature ID shall have same entity name.
    repeated string featureId = 3;
    // [optional] time range filter. If not specified all feature with any timestamp will be returned.
    TimestampRange timeRange = 4;
}

Response

Replace FeatureValueList with FeatureValue

Old

message QueryFeatures {
    message Response {
        // e.g. "driver", "customer", "city".
        string entityName = 1;
        // map of entity ID and its entity's properties.
        map<string, Entity> entities = 2;
    }
}
message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValueList> features = 1;
}

message FeatureValueList {
    // list of feature value
    // if "type" in "requestDetail" is "LAST", then the length will always be 1.
    // if "type" in "requestDetail" is "LIST", then the length is <= "limit".
    feast.types.ValueList valueList = 1;
    // list of timestamp of the value.
    // the i-th timestamps correspond to the i-th value.
    feast.types.TimestampList timestampList = 2;
}

New

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValue> features = 1;
}

message FeatureValue {
    // value of feature
    feast.types.Value value = 1;
    // timestamp of the feature
    google.protobuf.Timestamp timestamp = 2;
}

What do you guys think? @tims @zhilingc @woop
SDK will also needs to be changed @mansiib

pradithya · 2019-01-15T01:49:41Z

@tims if we only update latest value per key, is there a way to prevent late data from overwriting newer data?

tims · 2019-01-15T02:09:06Z

We could. I don't think it's practical to lookup what is already in the stores on startup, but we could add a window in ingestion. And the select the max timestamp in the window, emitting on every feature row with a newer event timestamp.

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

pradithya · 2019-01-15T02:19:03Z

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

wouldn't it be just a missing value for that feature?

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

What do you mean by resources?

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

The problem is currently for batch data with different timestamp the resulting last value is random.

tims · 2019-01-15T08:48:30Z

Different feature rows can contain different features though so it would add some complexity. What do you do if you get a newer timestamp and a feature is now missing?

wouldn't it be just a missing value for that feature?

I assume you wouldn't want to delete it? And you'd definitely want to include it, so you need to coalesce the features and detect updates. If you get an older row last, you need to write the parts that were missing in the newer row that you wrote first.

This would obviously require resources for ingestion, so we'd want it to be configurable on windows size, and possible to disable.

What do you mean by resources?

CPU and memory

We can do this for streaming and batch, but for batch it's only meaningful for what is ingested during that job.

The problem is currently for batch data with different timestamp the resulting last value is random.

For batch could choose to combine globally just fine and that would solve your issue. I'd actually prefer to do this just for batch.

Thinking more, for streaming we'd actually probably want to use a session based window. But regardless, with a large backlog, you could potentially have multiple windows competing with each other. Basically it's hard.

tims · 2019-01-15T10:07:17Z

documenting our discussion in slack:

For streaming, the way to actually make it consistent is to use a global window for the entity. This will utilises the beam model as it's intended. The tradeoff is using more resources during ingestion, dependent on the size of the entity keyspace.

A global window with a combiner per key, the same approach will apply to batch we just need triggering in the global window.

I think this should have a setting to disable it per import spec, so users can just make a best effort when they want to, or they are confident there are unique entity keys.

tims · 2019-02-12T07:10:45Z

With the merge of #88, only the latest values are being stored in the serving stores now. Historical values are not retrievable, so we can address this issue now.

The only task left is changing the serving api?

Feedback on the api,

I don't think we need timestamps grouped per value now. Since many use cases may not care about the timestamp, we could make the responses (especially in json) prettier by separating them.

So rather than the suggested:

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID and its feature value.
    map<string, FeatureValue> features = 1;
}

message FeatureValue {
    // value of feature
    feast.types.Value value = 1;
    // timestamp of the feature
    google.protobuf.Timestamp timestamp = 2;
}

How about this:

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID and its entity's properties.
    map<string, Entity> entities = 2;
}

message Entity {
    // map of feature ID to feature value.
    map<string, feast.types.Value> features = 1;
    // map of feature ID to event timestamp for the corresponding values.
    map<string, google.protobuf.Timestamp> timestamps = 2;
}

Or maybe even this?

message QueryFeaturesResponse {
    // Entity name of the response
    string entityName = 1;
    // map of entity ID to map of feature IDs to values
    map<string,  map<string, feast.types.Value>> entities = 2;
    map<string,  map<string, google.protobuf.Timestamp>> timestamps = 3;
}

pradithya · 2019-02-12T14:47:12Z

Yes, the implementation based on first proposal is already in #87.

Your last suggestion is not possible in protobuf. I like the second one though.

pradithya · 2019-02-13T08:03:48Z

Tried to render sample JSON

Original proposal:

{
    "entityName": "entity",
    "entities": {
        “1234": {
            "features": {
                "entity.granularity.feature1": {
                    "value": {
                        "doubleVal": 1.5795105695724487
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                },
                "entity.granularity.feature2": {
                    "value": {
                        "doubleVal": 2.122074604034424
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                }
            }
        },
        “5678": {
            "features": {
                "entity.granularity.feature1": {
                    "value": {
                        "doubleVal": 1.5795105695724487
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                },
                "entity.granularity.feature2": {
                    "value": {
                        "doubleVal": 2.122074604034424
                    },
                    "timestamp": "2019-02-13T07:44:00Z"
                }
            }
        }
    }
}

2nd one


{
  "entityName": "entity",
  "entities": {
    "1234": {
      "features": {
        "entity.granularity.feature1": {
          "int32Val": 12
        },
        "entity.granularity.feature2": {
          "int32Val": 12
        }
      },
      "timestamps": {
        "entity.granularity.feature1": "1970-01-02T10:17:36Z",
        "entity.granularity.feature2": "1970-01-02T10:17:36Z"
      }
    },
    "5678": {
      "features": {
        "entity.granularity.feature1": {
          "int32Val": 12
        },
        "entity.granularity.feature2": {
          "int32Val": 12
        }
      },
      "timestamps": {
        "entity.granularity.feature1": "1970-01-02T10:17:36Z",
        "entity.granularity.feature2": "1970-01-02T10:17:36Z"
      }
    }
  }
}

Which one do you think it's more preferable? @tims @woop @zhilingc

zhilingc · 2019-02-13T08:13:18Z

I prefer the former, but imo it shouldnt matter that much since the SDK does the parsing of these responses, right?

pradithya · 2019-02-13T08:22:06Z

yes, doesn't matter that much since SDK will wrap it.

tims · 2019-02-13T08:45:02Z

I'm happy to defer to you and go with the former.

pradithya · 2019-03-18T03:37:29Z

closed via #87

Add feast CLI command to identify resources not covered by permissions

pradithya added the kind/feature New feature or request label Jan 10, 2019

pradithya self-assigned this Jan 14, 2019

budi mentioned this issue Jan 16, 2019

Newest (latest?) value of a feature #77

Closed

tims mentioned this issue Jan 18, 2019

Create a release #65

Closed

pradithya closed this as completed Mar 18, 2019

dmartinol added a commit to dmartinol/feast that referenced this issue Jul 26, 2024

Merge pull request feast-dev#53 from dmartinol/feast-rbac-new

74c934a

Add feast CLI command to identify resources not covered by permissions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal of storing historical value of feature in serving storage #53

Removal of storing historical value of feature in serving storage #53

pradithya commented Jan 10, 2019 •

edited

Loading

woop commented Jan 10, 2019

tims commented Jan 10, 2019

pradithya commented Jan 10, 2019

tims commented Jan 11, 2019

woop commented Jan 11, 2019

pradithya commented Jan 14, 2019

pradithya commented Jan 15, 2019

tims commented Jan 15, 2019

pradithya commented Jan 15, 2019

tims commented Jan 15, 2019 •

edited

Loading

tims commented Jan 15, 2019 •

edited

Loading

tims commented Feb 12, 2019

pradithya commented Feb 12, 2019 •

edited

Loading

pradithya commented Feb 13, 2019

zhilingc commented Feb 13, 2019

pradithya commented Feb 13, 2019

tims commented Feb 13, 2019

pradithya commented Mar 18, 2019

Removal of storing historical value of feature in serving storage #53

Removal of storing historical value of feature in serving storage #53

Comments

pradithya commented Jan 10, 2019 • edited Loading

woop commented Jan 10, 2019

tims commented Jan 10, 2019

pradithya commented Jan 10, 2019

tims commented Jan 11, 2019

woop commented Jan 11, 2019

pradithya commented Jan 14, 2019

pradithya commented Jan 15, 2019

tims commented Jan 15, 2019

pradithya commented Jan 15, 2019

tims commented Jan 15, 2019 • edited Loading

tims commented Jan 15, 2019 • edited Loading

tims commented Feb 12, 2019

pradithya commented Feb 12, 2019 • edited Loading

pradithya commented Feb 13, 2019

zhilingc commented Feb 13, 2019

pradithya commented Feb 13, 2019

tims commented Feb 13, 2019

pradithya commented Mar 18, 2019

pradithya commented Jan 10, 2019 •

edited

Loading

tims commented Jan 15, 2019 •

edited

Loading

tims commented Jan 15, 2019 •

edited

Loading

pradithya commented Feb 12, 2019 •

edited

Loading