Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Redis memory foot print for ingestion / serving #515

Closed
khorshuheng opened this issue Mar 6, 2020 · 15 comments · Fixed by #530
Closed

Optimize Redis memory foot print for ingestion / serving #515

khorshuheng opened this issue Mar 6, 2020 · 15 comments · Fixed by #530

Comments

@khorshuheng
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
As of now, the Redis Value size is directly proportional to the length of the feature name strings. This becomes more significant when there are several features. As a result the memory foot print can be significantly higher than the equivalent data stored in CSV / parquet / avro.

Describe the solution you'd like
Instead of writing the byte representation of Feature Row, which has a list of Fields, it might be suffice if we simply store the list of Values instead. Feast Online Serving is already aware of the FeatureSetSpec, and hence the field names. This information can be used to reconstruct the Feature Row from the list of Values. Without storing the feature names, the Redis value size can be drastically reduced.

Describe alternatives you've considered
Another alternative would be, instead of storing the feature names in the Redis, we can store the hash code of the feature name instead. Again, the corresponding Feature Row can be reconstructed on the Feast online serving side.

Last alternative would be to apply some compression algorithm (eg Gzip/Bzip) on the byte array representation of Feature Row, though it is unlikely to surpass the above alternatives in terms of efficiency. There will also be impact in terms of latency due to additional processing required during retrieval.

Additional context
Any approach to resolve this issue will likely resulted in backward incompatible changes: the existing Redis value will not longer be interpretable once the way we stored Redis value changes. Therefore, we might have to include a Store level configuration, to toggle this feature on only when the user is ready to migrate.

@zhilingc
Copy link
Collaborator

zhilingc commented Mar 6, 2020

I agree with this, but i think there a few additional side effects (which may not be a bad thing tbh)

  • Lower tolerance for malformed feature rows. Currently, we are able to ingest feature rows containing too many or too few values, because ingestion rebuilds the feature row and feature row key based on the feature names in the field object. If we strictly follow the spec for the schema of the row, it means that such rows should be rejected.
  • Changing field order in the spec breaks backward compatibility. Validation should be added to prevent this (or we can highly advise against it in docs)

@idahoakl
Copy link

idahoakl commented Mar 6, 2020

I like the idea of reducing the size of data stored per feature row. I've been communicating to team members to avoid storing features as a single massive FeatureSet due to the performance implications for retrieving large data values from Redis.

Has there been any investigation into using the hash data structure in Redis and store each feature as a hash entry? The key for the value in Redis would still represent the FeatureRow. It would be an implementation specific to Redis but would also allow for a more efficient implementation in terms of storage space and speed.

@khorshuheng
Copy link
Collaborator Author

@idahoakl We will investigate that approach and see whether it can effectively reduce the memory footprint

@woop
Copy link
Member

woop commented Mar 6, 2020

From a user facing API perspective I would want to be consistent when it comes to fields are ordered and it matters vs field names matter but order doesn't. It is true that we are storing these values as lists, but the order doesn't matter.

Lower tolerance for malformed feature rows. Currently, we are able to ingest feature rows containing too many or too few values, because ingestion rebuilds the feature row and feature row key based on the feature names in the field object. If we strictly follow the spec for the schema of the row, it means that such rows should be rejected.

Is it possible for us to maintain a tolerance for extra/missing fields in source data? I don't see why we need to impose a new restriction there based on how Redis stores its values, unless that is functionality that we actually want.

Changing field order in the spec breaks backward compatibility. Validation should be added to prevent this (or we can highly advise against it in docs)

Would the user now have to maintain both consistent naming as well as order in the spec? How would this play with auto-migration? Isn't it possible for us to abstract this from users as long as they either provide consistent naming or ordering?

Also some comments on compression

though [compression] is unlikely to surpass the above alternatives in terms of efficiency.

This depends on the data being stored. If a row contains a bunch of zeros then it can probably be compressed to 10% of its original size. A similar effect will happen if the user is storing a lot of strings/text.

There will also be impact in terms of latency due to additional processing required during retrieval.

True, but with the proper algorithm this would probably be very small.

Instead of going on gut instinct, can we run a test to compare our options here (no field names, compression, no field names with compression)? We can use an internal dataset as a benchmark if we want something representative

Has there been any investigation into using the hash data structure in Redis and store each feature as a hash entry? The key for the value in Redis would still represent the FeatureRow. It would be an implementation specific to Redis but would also allow for a more efficient implementation in terms of storage space and speed.

@idahoakl It's not clear what you mean by this The key for the value in Redis would still represent the FeatureRow, but if you are suggesting storing each feature separately then I don't think that is an efficient structure. Redis storage in Feast 0.1 used to be on a per feature basis. The problem with this was that we had to do a 10-50x more writes and reads, depending on the amount of features being stored. We also dont have room for optimizations like compression.

Nevermind, did a bit of research on the data type. It's distinct from 0.1. Worth exploring!

@idahoakl
Copy link

idahoakl commented Mar 6, 2020

I should have been more clear in my explanation. The entities for a FeatureRow could still be used to identify the hash within Redis. This value could be the "key" in the hash commands (HMGET for instance https://redis.io/commands/hmget). The "field" entries in each could be mapped to individual features within the FeatureRow. A while back I had found some articles that looked at the memory efficiency of storing keys within a hash. I'll see if I can dig up the results.

Regarding compression, I believe there is an operational benefit for to storing the data in a human readable format in the serving data store. Being able to query the data store and determine what data is stored there can be immensely helpful for debugging a live system.

@khorshuheng
Copy link
Collaborator Author

khorshuheng commented Mar 9, 2020

We have tested three different approaches, and their impacts on processing and storage. The sample dataset we chose contains 600k rows, and occupied 160Mb when stored as CSV file. Feast 0.4 currently requires ~1Gb memory in Redis to store these data. On local development machine with direct runner, processing 600k rows takes 20 minutes.

Findings so far:

  1. Storing the features as hash, where the fields are individual features within the Feature Row, does not help reducing the memory. It still requires close to 1Gb.

  2. Compressing the Feature Row using BZip2 prior to storing to Redis, slash the memory requirement to 436Mb. This is because the sample dataset happens to have many features which are named similarly (same suffix / prefix). However, the processing time has also increased significantly. Processing time now takes 40 minutes.

  3. Storing only the field values reduce the memory requirement to 240 Mb, the processing time stays similar.

@khorshuheng
Copy link
Collaborator Author

Elaboration on the 3rd approach:
The Redis value will contain both the feature set version and the feature values, sorted according to the feature order in the spec. If the Feature Row from Kafka is missing some features specified in the spec, the value of the feature will be empty.

As both the ingestion job and Feast serving has information on the specs of all the feature set, during retrieval, the correct Feature row can be reconstructed again.

@woop
Copy link
Member

woop commented Mar 9, 2020

The Redis value will contain both the feature set version and the feature values, sorted according to the feature order in the spec.

Does this mean users need to manage both the order and uniqueness of names? They should ideally only care about one of the two.

@woop
Copy link
Member

woop commented Mar 9, 2020

sorted according to the feature order in the spec.

Is there any reason why we can't use our own encoding instead of having it come from the user? Let's say storing the values alphabetically by name?

@khorshuheng
Copy link
Collaborator Author

sorted according to the feature order in the spec.

Is there any reason why we can't use our own encoding instead of having it come from the user? Let's say storing the values alphabetically by name?

Yes, that could be done.

Does this mean users need to manage both the order and uniqueness of names?

They shouldn't need to maintain the order, but uniqueness of names should be required, even for the current implementation. Otherwise, there could be scenarios where there are two features with the same name with different values, and it would be ambiguous.

@woop
Copy link
Member

woop commented Mar 9, 2020

FYI there is already a card tracking this issue #310, but since we are already discussing it here I will close that one.

@woop
Copy link
Member

woop commented Mar 9, 2020

This is an internal blocker for us, so I am adding it to our roadmap for 0.5. I think we can just use this specific issue to flesh out whatever implementation. If you do want to create a new issue then please move the 0.5 milestone to that issue.

@woop woop added this to the v0.5.0 milestone Mar 9, 2020
@woop woop mentioned this issue Mar 9, 2020
@idahoakl
Copy link

idahoakl commented Mar 9, 2020

The 3rd approach sounds like a good one. Less storage size with the same ingest performance.

If field values are only going to be associated with field names at runtime by external configuration has any thought been given to a method for ensuring that the same configuration that was used to write the data is the configuration used to read the data? Something such as a checksum/fingerprint of the FeatureSet configuration stored alongside the data in Redis (or in the key) will help prevent a mismatch of configuration and data due to a bug somewhere else in the system.

@woop
Copy link
Member

woop commented Mar 9, 2020

If field values are only going to be associated with field names at runtime by external configuration has any thought been given to a method for ensuring that the same configuration that was used to write the data is the configuration used to read the data? Something such as a checksum/fingerprint of the FeatureSet configuration stored alongside the data in Redis (or in the key) will help prevent a mismatch of configuration and data due to a bug somewhere else in the system.

This is a good idea. We havent discussed that yet, but it should be pretty straightforward to achieve. We are already generating hashes for object comparison, so it would be pretty straightforward to standardize that and store it as part of the row. It should be pretty easy to add later though, and I think it might be out of scope for this specific issue.

@khorshuheng
Copy link
Collaborator Author

#530 is now in release 0.4.7.
To upgrade existing installation:
Since modification was made to the ingestion step, the existing dataflow jobs correspond to the Redis stores will need to be stop (which will then be rescheduled automatically by Feast Core) for the changes to take effect. Otherwise the dataflow jobs will be automatically updated when the feature set version is updated.

Before the jobs are updated, do make sure that Feast Serving are already updated to 0.4.7. Otherwise, Feast Serving 0.4.6 (and earlier) will not be able to retrieve the features ingested by the updated job. Feast Serving 0.4.7 is backward compatible with Feast Core 0.4.6, hence it can be upgraded in place without issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants