Documentation/pointers to internal datastructures used by tinysets #5

ryzhyk · 2019-03-18T22:19:52Z

Nice library, thanks for releasing it on crates.io!

I did not find anywhere in documentation an explanation of the data structures used by tinyset internally and the size/complexity bounds they provide. Even a couple of paragraphs or links to relevant papers would help me to decide if this is a good match for my workloads.

I need to store between 0 and 100 16-bit number space efficiently, as well as to compute unions of such sets.

Thanks!

The text was updated successfully, but these errors were encountered:

droundy · 2019-04-03T16:09:09Z

I'll leave this open until I get documentation written, but the quick answer is that for large sets that require heap storage Robin Hood hash maps are used, but with your number itself as the hash, so probably O(1) for most operations, O(N) for unions.

For small sets that can fit in 32 or 24 bytes, we just store the numbers themselves in an unordered array with no heap storage.

In any case, we use the fewest bytes we can get away with for each set, based on the largest member, so if all your integers are less than 255 they are stored as single bytes. This lets us pack more numbers onto the stack.

ryzhyk · 2019-04-03T17:10:32Z

Thanks for the explanation and looking forward to the more detailed documentation.

I implemented bindings for the tinysets::Set64 type as a library for our incremental Datalog engine and it's making a huge difference in workloads that manipulate many small sets.

tower120 · 2022-09-18T06:51:06Z

I'll leave this open until I get documentation written, but the quick answer is that for large sets that require heap storage Robin Hood hash maps are used, but with your number itself as the hash, so probably O(1) for most operations, O(N) for unions.

For small sets that can fit in 32 or 24 bytes, we just store the numbers themselves in an unordered array with no heap storage.

In any case, we use the fewest bytes we can get away with for each set, based on the largest member, so if all your integers are less than 255 they are stored as single bytes. This lets us pack more numbers onto the stack.

Can you add this to documentation? That would help. I looked for this implementation details in doc too, but found them here.

droundy · 2022-09-19T03:22:35Z

I should mention that the explanation here is no longer accurate. Tinysets now are the size of a pointer. On a 64-bit system they can hold up to 7 values without heap allocation, provided the values are small enough. With fewer values, your values may be larger without heap allocation. It's a little complicated to predict when you'll hit the heap, since we store differences between sorted values, so it's not just how big your values are, but also how they are spaced.

droundy · 2022-09-19T03:47:28Z

Once your data hits the heap, storage again depends on the pattern of the set. If your set is dense up to its max, it will be stored as a bit set. If more sparse, then your data will be stored as a sort of hash map with the more significant bits as key, and a value that is a small bit set.

It's all a little complicated, which is why I haven't found the time to document it. It should work very well at achieving the scaling of a hash set with minimal memory use, particularly on sets corresponding to indexes into an array (or Vec), so there would be a natural maximum. It's especially good if values are "bunched", as could happen if data is added to a Vec at the same time that the corresponding indexes are added to sets.

tower120 · 2022-09-19T05:19:12Z

So, as far as I understand you keep tracking of max number in set (in something like pow2 form), and cut off "insignificant bits".
What's a point of Set32 then? Why not Fit64 everywhere?

Did you consider making some Set64<N> (where N - number of u64's), for greater inline storage?

droundy · 2022-09-19T23:03:56Z

Yes, we track the max size at runtime.

The point of SetU32 is that it stores data in a way that is better if you know at compile time that it can't exceed 32 bits, e.g. we can store the number of elements in a 32-bit number. In SetU64 we store everything in an array of u64, but in SetU32 we store an array of u32 internally.

Honestly, you could use Set64 instead for both purposes, if you'd like, and you could time to see if there are differences in performance.

tower120 · 2022-09-20T07:10:24Z

... if you'd like, and you could time to see if there are differences in performance.

According to your benchmarks 32 version usually a tiny bit slower... Maybe because u32 register-wise non-native for modern 64bit CPUs.

... In SetU64 we store everything in an array of u64, but in SetU32 we store an array of u32 internally. ...

But since you rip-off most of the number bytes, why that matters? As far as I understand you find max number of bits
that needed to hold the trimmed number, and put THAT amount of bits. For example, if you found that you need 16bits,
you put 2 numbers into one u32, and 4 into u64.
So... wouldn't storing them as [u8] be the same?

Anyway - thanks for explanation. Thou, description of that part would still be nice to have in doc.

droundy · 2022-09-20T18:54:04Z

Yeah I'll leave open and hopefully add docs in the next week or two.

The SetU32 will sometimes use less space than a SetU64. In the limit of a very sparse set, it will occupy half the space, which seems worth keeping as an option.

droundy closed this as completed in f1b90ef May 16, 2019

droundy reopened this Sep 19, 2022

droundy closed this as completed in 3712588 Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation/pointers to internal datastructures used by tinysets #5

Documentation/pointers to internal datastructures used by tinysets #5

ryzhyk commented Mar 18, 2019

droundy commented Apr 3, 2019

ryzhyk commented Apr 3, 2019

tower120 commented Sep 18, 2022

droundy commented Sep 19, 2022

droundy commented Sep 19, 2022

tower120 commented Sep 19, 2022

droundy commented Sep 19, 2022

tower120 commented Sep 20, 2022

droundy commented Sep 20, 2022

Documentation/pointers to internal datastructures used by tinysets #5

Documentation/pointers to internal datastructures used by tinysets #5

Comments

ryzhyk commented Mar 18, 2019

droundy commented Apr 3, 2019

ryzhyk commented Apr 3, 2019

tower120 commented Sep 18, 2022

droundy commented Sep 19, 2022

droundy commented Sep 19, 2022

tower120 commented Sep 19, 2022

droundy commented Sep 19, 2022

tower120 commented Sep 20, 2022

droundy commented Sep 20, 2022