Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation/pointers to internal datastructures used by tinysets #5

Closed
ryzhyk opened this issue Mar 18, 2019 · 9 comments
Closed

Comments

@ryzhyk
Copy link
Contributor

ryzhyk commented Mar 18, 2019

Nice library, thanks for releasing it on crates.io!

I did not find anywhere in documentation an explanation of the data structures used by tinyset internally and the size/complexity bounds they provide. Even a couple of paragraphs or links to relevant papers would help me to decide if this is a good match for my workloads.

I need to store between 0 and 100 16-bit number space efficiently, as well as to compute unions of such sets.

Thanks!

@droundy
Copy link
Owner

droundy commented Apr 3, 2019

I'll leave this open until I get documentation written, but the quick answer is that for large sets that require heap storage Robin Hood hash maps are used, but with your number itself as the hash, so probably O(1) for most operations, O(N) for unions.

For small sets that can fit in 32 or 24 bytes, we just store the numbers themselves in an unordered array with no heap storage.

In any case, we use the fewest bytes we can get away with for each set, based on the largest member, so if all your integers are less than 255 they are stored as single bytes. This lets us pack more numbers onto the stack.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Apr 3, 2019

Thanks for the explanation and looking forward to the more detailed documentation.

I implemented bindings for the tinysets::Set64 type as a library for our incremental Datalog engine and it's making a huge difference in workloads that manipulate many small sets.

@tower120
Copy link

I'll leave this open until I get documentation written, but the quick answer is that for large sets that require heap storage Robin Hood hash maps are used, but with your number itself as the hash, so probably O(1) for most operations, O(N) for unions.

For small sets that can fit in 32 or 24 bytes, we just store the numbers themselves in an unordered array with no heap storage.

In any case, we use the fewest bytes we can get away with for each set, based on the largest member, so if all your integers are less than 255 they are stored as single bytes. This lets us pack more numbers onto the stack.

Can you add this to documentation? That would help. I looked for this implementation details in doc too, but found them here.

@droundy
Copy link
Owner

droundy commented Sep 19, 2022

I should mention that the explanation here is no longer accurate. Tinysets now are the size of a pointer. On a 64-bit system they can hold up to 7 values without heap allocation, provided the values are small enough. With fewer values, your values may be larger without heap allocation. It's a little complicated to predict when you'll hit the heap, since we store differences between sorted values, so it's not just how big your values are, but also how they are spaced.

@droundy
Copy link
Owner

droundy commented Sep 19, 2022

Once your data hits the heap, storage again depends on the pattern of the set. If your set is dense up to its max, it will be stored as a bit set. If more sparse, then your data will be stored as a sort of hash map with the more significant bits as key, and a value that is a small bit set.

It's all a little complicated, which is why I haven't found the time to document it. It should work very well at achieving the scaling of a hash set with minimal memory use, particularly on sets corresponding to indexes into an array (or Vec), so there would be a natural maximum. It's especially good if values are "bunched", as could happen if data is added to a Vec at the same time that the corresponding indexes are added to sets.

@droundy droundy reopened this Sep 19, 2022
@tower120
Copy link

So, as far as I understand you keep tracking of max number in set (in something like pow2 form), and cut off "insignificant bits".
What's a point of Set32 then? Why not Fit64 everywhere?

Did you consider making some Set64<N> (where N - number of u64's), for greater inline storage?

@droundy
Copy link
Owner

droundy commented Sep 19, 2022

Yes, we track the max size at runtime.

The point of SetU32 is that it stores data in a way that is better if you know at compile time that it can't exceed 32 bits, e.g. we can store the number of elements in a 32-bit number. In SetU64 we store everything in an array of u64, but in SetU32 we store an array of u32 internally.

Honestly, you could use Set64 instead for both purposes, if you'd like, and you could time to see if there are differences in performance.

@tower120
Copy link

... if you'd like, and you could time to see if there are differences in performance.

According to your benchmarks 32 version usually a tiny bit slower... Maybe because u32 register-wise non-native for modern 64bit CPUs.

... In SetU64 we store everything in an array of u64, but in SetU32 we store an array of u32 internally. ...

But since you rip-off most of the number bytes, why that matters? As far as I understand you find max number of bits
that needed to hold the trimmed number, and put THAT amount of bits. For example, if you found that you need 16bits,
you put 2 numbers into one u32, and 4 into u64.
So... wouldn't storing them as [u8] be the same?

Anyway - thanks for explanation. Thou, description of that part would still be nice to have in doc.

@droundy
Copy link
Owner

droundy commented Sep 20, 2022

Yeah I'll leave open and hopefully add docs in the next week or two.

The SetU32 will sometimes use less space than a SetU64. In the limit of a very sparse set, it will occupy half the space, which seems worth keeping as an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants