Count-Min Sketch Support #2496
Replies: 2 comments 9 replies
-
Sorry but I don't know how would you like to storing the CMS Array? With "List" or self encoding int32? Is it in one key-value? Or having subfields? What would the metadata like? |
Beta Was this translation helpful? Give feedback.
-
Sorry, I haven't been able to work on this for a week. For the functionality, does it just need to be a data structure that can persist in the rocksdb storage layer and have elements added to it using the cmsketch's incrby command? |
Beta Was this translation helpful? Give feedback.
-
References: #2425
Overview
Count-Min Sketch is a probabilistic data structure used to estimate the frequency of elements in a data stream. It is particularly useful in scenarios where memory efficiency is important, such as network traffic analysis, real-time analytics, etc.
The core idea behind Count-Min Sketch is to use multiple hash functions to map each element in a data stream to different positions in a set of arrays. Each bucket in these arrays holds a count that represents the estimated frequency of the elements mapped to it.
Hashing: When an element appears in the data stream, it is passed through multiple hash functions. Each hash function produces an index corresponding to a position in a different array. The count at each of these positions is then incremented.
Estimating Frequency: To estimate the frequency of an element, the element is passed through the same hash functions, and the counts at the corresponding positions in the arrays are retrieved. The minimum value among these counts is taken as the estimate for the frequency of that element. This approach helps mitigate overestimation caused by hash collisions.
Structure of Count-Min Sketch
Array of Counters: The Count-Min Sketch maintains a 2-D array, where each row corresponds to a different hash function, and each column represents a possible index generated by these hash functions. The size of the array is determined by the desired accuracy and confidence level. The more rows and columns, the more accurate the results are at the cost of increased memory usage.
Updating the Sketch: When a new element is encountered, it is hashed multiple times, each hash directing the element to a different row of counters. The counts at the corresponding columns are incremented.
Querying: To estimate the frequency of an element, the same hash functions are applied, and the counts from the respective positions are retrieved.
Supported Commands on Redis
CMS.INCRBY - Increases the count of one or more items by increment.
CMS.INFO - Returns information about a sketch.
CMS.INITBYDIM - Initializes a Count-Min Sketch to dimensions specified by the user.
CMS.INITBYPROB - Initializes a Count-Min Sketch to accommodate requested tolerances.
CMS.MERGE - Merges several sketches into one sketch.
CMS.QUERY - Returns the count for one or more items in a sketch.
Implementation
I was looking into Redis' count min sketch implementation and noticed these things:
Hash Function Choice: Redis uses a modified version of the MurmurHash2 function to generate hash values for the items.
Storage Format
CMS Array: The CMS data structure is stored as a 2D array (flattened into a 1D array) of uint32_t integers. Each element in this array represents the count for a particular combination of hash function and item. The size of the array is determined by the width (number of columns) and depth (number of rows) of the CMS.
String Representation: In Redis, data structures like strings and bitmaps are generally treated as "string" types. However, for CMS, the structure is specifically tailored to handle frequency counts. The memory is made to be manipulated easily. Counts will be stored as uin32_t in binary format making it super compact.
Metadata: Metadata like width, depth, and counter can be stored at the beginning of the serialized value if using a single key-value pair.
Tuning
Redis' CMS implementation follows a fixed configuration for the depth and width based on error and probability parameters. However, tuning these parameters can significantly impact the performance and accuracy of the CMS:
Depth and Width Calculation: The width and depth of the CMS can be derived from the desired error rate (error) and the probability of failure (delta). The function CMS_DimFromProb calculates the width as ceil(2 / error) and the depth as ceil(log10f(delta) / log10f(0.5)). This fixed-size approach simplifies the usage and ensures consistent performance.
Merging CMS Structures: The function CMS_Merge allows combining multiple CMS instances into one. This is particularly useful in distributed systems where frequency data from different nodes need to be aggregated. The function takes care of summing the counts from multiple sources while checking for potential overflows to prevent data corruption.
I will be looking into how to properly integrate this with the storage layer, and support the commands on top of it, I will share my solution here once I have properly mapped it out. If anyone has any insights please feel free to share
Note: will be adding complexities
Resources
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
https://github.com/RedisBloom/RedisBloom/blob/master/docs/docs/count-min-sketch.md
https://redis.io/docs/latest/develop/data-types/probabilistic/count-min-sketch/
https://redis.io/docs/latest/commands/?group=cms
ty @mapleFU for the format outline :)
Beta Was this translation helpful? Give feedback.
All reactions