-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation/pointers to internal datastructures used by tinysets #5
Comments
I'll leave this open until I get documentation written, but the quick answer is that for large sets that require heap storage Robin Hood hash maps are used, but with your number itself as the hash, so probably O(1) for most operations, O(N) for unions. For small sets that can fit in 32 or 24 bytes, we just store the numbers themselves in an unordered array with no heap storage. In any case, we use the fewest bytes we can get away with for each set, based on the largest member, so if all your integers are less than 255 they are stored as single bytes. This lets us pack more numbers onto the stack. |
Thanks for the explanation and looking forward to the more detailed documentation. I implemented bindings for the |
Can you add this to documentation? That would help. I looked for this implementation details in doc too, but found them here. |
I should mention that the explanation here is no longer accurate. Tinysets now are the size of a pointer. On a 64-bit system they can hold up to 7 values without heap allocation, provided the values are small enough. With fewer values, your values may be larger without heap allocation. It's a little complicated to predict when you'll hit the heap, since we store differences between sorted values, so it's not just how big your values are, but also how they are spaced. |
Once your data hits the heap, storage again depends on the pattern of the set. If your set is dense up to its max, it will be stored as a bit set. If more sparse, then your data will be stored as a sort of hash map with the more significant bits as key, and a value that is a small bit set. It's all a little complicated, which is why I haven't found the time to document it. It should work very well at achieving the scaling of a hash set with minimal memory use, particularly on sets corresponding to indexes into an array (or Vec), so there would be a natural maximum. It's especially good if values are "bunched", as could happen if data is added to a Vec at the same time that the corresponding indexes are added to sets. |
So, as far as I understand you keep tracking of max number in set (in something like pow2 form), and cut off "insignificant bits". Did you consider making some |
Yes, we track the max size at runtime. The point of SetU32 is that it stores data in a way that is better if you know at compile time that it can't exceed 32 bits, e.g. we can store the number of elements in a 32-bit number. In SetU64 we store everything in an array of u64, but in SetU32 we store an array of u32 internally. Honestly, you could use Set64 instead for both purposes, if you'd like, and you could time to see if there are differences in performance. |
According to your benchmarks 32 version usually a tiny bit slower... Maybe because u32 register-wise non-native for modern 64bit CPUs.
But since you rip-off most of the number bytes, why that matters? As far as I understand you find max number of bits Anyway - thanks for explanation. Thou, description of that part would still be nice to have in doc. |
Yeah I'll leave open and hopefully add docs in the next week or two. The SetU32 will sometimes use less space than a SetU64. In the limit of a very sparse set, it will occupy half the space, which seems worth keeping as an option. |
Nice library, thanks for releasing it on crates.io!
I did not find anywhere in documentation an explanation of the data structures used by
tinyset
internally and the size/complexity bounds they provide. Even a couple of paragraphs or links to relevant papers would help me to decide if this is a good match for my workloads.I need to store between 0 and 100 16-bit number space efficiently, as well as to compute unions of such sets.
Thanks!
The text was updated successfully, but these errors were encountered: