-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Redis memory foot print for ingestion / serving #515
Comments
I agree with this, but i think there a few additional side effects (which may not be a bad thing tbh)
|
I like the idea of reducing the size of data stored per feature row. I've been communicating to team members to avoid storing features as a single massive FeatureSet due to the performance implications for retrieving large data values from Redis. Has there been any investigation into using the hash data structure in Redis and store each feature as a hash entry? The key for the value in Redis would still represent the FeatureRow. It would be an implementation specific to Redis but would also allow for a more efficient implementation in terms of storage space and speed. |
@idahoakl We will investigate that approach and see whether it can effectively reduce the memory footprint |
From a user facing API perspective I would want to be consistent when it comes to
Is it possible for us to maintain a tolerance for extra/missing fields in source data? I don't see why we need to impose a new restriction there based on how Redis stores its values, unless that is functionality that we actually want.
Would the user now have to maintain both consistent naming as well as order in the spec? How would this play with Also some comments on compression
This depends on the data being stored. If a row contains a bunch of zeros then it can probably be compressed to 10% of its original size. A similar effect will happen if the user is storing a lot of strings/text.
True, but with the proper algorithm this would probably be very small. Instead of going on gut instinct, can we run a test to compare our options here (no field names, compression, no field names with compression)? We can use an internal dataset as a benchmark if we want something representative
@idahoakl Nevermind, did a bit of research on the data type. It's distinct from 0.1. Worth exploring! |
I should have been more clear in my explanation. The entities for a FeatureRow could still be used to identify the hash within Redis. This value could be the "key" in the hash commands (HMGET for instance https://redis.io/commands/hmget). The "field" entries in each could be mapped to individual features within the FeatureRow. A while back I had found some articles that looked at the memory efficiency of storing keys within a hash. I'll see if I can dig up the results. Regarding compression, I believe there is an operational benefit for to storing the data in a human readable format in the serving data store. Being able to query the data store and determine what data is stored there can be immensely helpful for debugging a live system. |
We have tested three different approaches, and their impacts on processing and storage. The sample dataset we chose contains 600k rows, and occupied 160Mb when stored as CSV file. Feast 0.4 currently requires ~1Gb memory in Redis to store these data. On local development machine with direct runner, processing 600k rows takes 20 minutes. Findings so far:
|
Elaboration on the 3rd approach: As both the ingestion job and Feast serving has information on the specs of all the feature set, during retrieval, the correct Feature row can be reconstructed again. |
Does this mean users need to manage both the order and uniqueness of names? They should ideally only care about one of the two. |
Is there any reason why we can't use our own encoding instead of having it come from the user? Let's say storing the values alphabetically by name? |
Yes, that could be done.
They shouldn't need to maintain the order, but uniqueness of names should be required, even for the current implementation. Otherwise, there could be scenarios where there are two features with the same name with different values, and it would be ambiguous. |
FYI there is already a card tracking this issue #310, but since we are already discussing it here I will close that one. |
This is an internal blocker for us, so I am adding it to our roadmap for 0.5. I think we can just use this specific issue to flesh out whatever implementation. If you do want to create a new issue then please move the 0.5 milestone to that issue. |
The 3rd approach sounds like a good one. Less storage size with the same ingest performance. If field values are only going to be associated with field names at runtime by external configuration has any thought been given to a method for ensuring that the same configuration that was used to write the data is the configuration used to read the data? Something such as a checksum/fingerprint of the FeatureSet configuration stored alongside the data in Redis (or in the key) will help prevent a mismatch of configuration and data due to a bug somewhere else in the system. |
This is a good idea. We havent discussed that yet, but it should be pretty straightforward to achieve. We are already generating hashes for object comparison, so it would be pretty straightforward to standardize that and store it as part of the row. It should be pretty easy to add later though, and I think it might be out of scope for this specific issue. |
#530 is now in release 0.4.7. Before the jobs are updated, do make sure that Feast Serving are already updated to 0.4.7. Otherwise, Feast Serving 0.4.6 (and earlier) will not be able to retrieve the features ingested by the updated job. Feast Serving 0.4.7 is backward compatible with Feast Core 0.4.6, hence it can be upgraded in place without issue. |
Is your feature request related to a problem? Please describe.
As of now, the Redis Value size is directly proportional to the length of the feature name strings. This becomes more significant when there are several features. As a result the memory foot print can be significantly higher than the equivalent data stored in CSV / parquet / avro.
Describe the solution you'd like
Instead of writing the byte representation of Feature Row, which has a list of Fields, it might be suffice if we simply store the list of Values instead. Feast Online Serving is already aware of the FeatureSetSpec, and hence the field names. This information can be used to reconstruct the Feature Row from the list of Values. Without storing the feature names, the Redis value size can be drastically reduced.
Describe alternatives you've considered
Another alternative would be, instead of storing the feature names in the Redis, we can store the hash code of the feature name instead. Again, the corresponding Feature Row can be reconstructed on the Feast online serving side.
Last alternative would be to apply some compression algorithm (eg Gzip/Bzip) on the byte array representation of Feature Row, though it is unlikely to surpass the above alternatives in terms of efficiency. There will also be impact in terms of latency due to additional processing required during retrieval.
Additional context
Any approach to resolve this issue will likely resulted in backward incompatible changes: the existing Redis value will not longer be interpretable once the way we stored Redis value changes. Therefore, we might have to include a Store level configuration, to toggle this feature on only when the user is ready to migrate.
The text was updated successfully, but these errors were encountered: