Added new method #559

leerho · 2024-05-10T19:08:54Z

Added new method to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).

The other method of a similar name really didn't need to be public, but it is used extensively in test.
Making it package-private now would force a major release.

Deprecated the old method as a public method.

Added a test.

to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).

AlexanderSaydakov · 2024-05-10T23:29:53Z

src/main/java/org/apache/datasketches/theta/Sketch.java

+   * @return the maximum number of storage bytes required for a CompactSketch with the given
+   * nomEntries.
+   */
+  public static int getCompactSketchMaxBytes(final int nomEntries) {


I would suggest reusing the existing logic like this:
HeapQuickSelectSketch.setHashTableThreshold(lgK, lgK + 1) + Family.QUICKSELECT.getMaxPreLongs()) * Long.BYTES

By the way, I noticed that the setHashTableThreshold() method is duplicated in DirectQuickSelectSketch (not introduced by this PR)

By the way, the method name setHashTableThreshold is a bit misleading. It does not set anything. It computes and returns the result.

I changed the name of the HeapQuickSelectSketch function to getHashTableThreshold(..)

However the DirectQuickSelectSketchR function with the same name is actually different and uses a different constant. The Direct version is tuned to delay having to resize less often because the cost of resizing in off-heap is much more expensive than on-heap. Nonetheless, I changed its name to getOffHeapHashTableThreshold(...).

As for the internal computation of getCompactSketchMaxBytes(...) I changed it to be a little different from what you proposed.

You proposed using the HeapQuickSelectSketch.getHashTableThreshold(lgK, lgK + 1).

Unfortunately, the function input was nomEntries and not lgNomEntries, which meant I would have to convert to LgK before using the proposed function. This is not a single CPU cycle operation. If you go look at the JVM source code for Long.numberOfTrailingZeros(long i) you'll see that it is not trivial.

Meanwhile, we have for some time also been using lgNomEntries in Theta. (Go look at the Theta builder.) So I decided to switch the input to lgNomEntries.

The intent of the functions getHashTableThreshold(..) and getOffHeapHashTableThreshold(...) is to allow fine tuning of the resizing operation the sketch, which we actually do because we tune the resizing of the heap sketch differently from the off-heap sketch. Both of these were package-private, and neither was used in test. So I decided to make the heap function private and the off-heap version protected because it is used by the child sketch DirectQuickSelectSketch.

I really didn't want to use these special functions for this getCompactSketchMaxBytes(...), so I rewrote it with one of your suggestions and leveraging the REBUILD_THRESHOLD constant already defined.

I don't quite understand why not use getHashTableThreshold() for max serialized size computation. The size depends exactly on the maximum number of retained entries in the hash table. Now the knowledge about 2 * K * rebuild_threshold is sitting in two places

I see your point. But my point is that the getHashTableThreshold(...) is doing multiple things and forcing it to just return the rebuild Thresh is a hack.

But to fix this I will need to separate the two functions getHashTableResizeThreshold and getHashTableRebuildThreshold. The resize Thresh should never change but the Resize Thresh depends on the type of sketch.

I will fix this, but you will have to approve it again :)

I started to implement this idea of trying to have one place that computes the 2 * K * rebuild_Threshold.
But it turned out to be much more complicated so I decided to leave it alone for now.

The getHashTableThreshold(lgNomLongs, lgArrLongs) in both Direct and Heap sketches actually does the computation from the lgArrLongs parameter, not the lgNomLongs parameter. This makes sense since because this is computing the threshold for the hashTable. The lgNomLongs is ONLY used to determine if the sketch is in a growing/resizing mode or at its full size, when a full rebuild is required.

This means that these 2 methods cannot use a common method that computes maximum capacity based on lgNomLongs.

If, and when, we decide to make an improvement in this area, I want to do one more improvement:

We have 3 static final double constants: REBUILD_THRESHOLD, RESIZE_THRESHOLD, AND DQS_RESIZE_THRESHOLD that are used for tuning the performance of the hash tables in different situations.

Unfortunately, at the time, I decided to use doubles for these constants. This forces a computation like this:
return (int) Math.floor(<double_constant> * (1 << lgArrLongs)); every time they are used.

However, the hash array sizes are always powers of 2, and never smaller than 16, and the constants are either 15/16 or 0.5, or fractions that can exactly produce an integer for all possible sizes of the hash tables.

This means, instead of using double constants, we could use static final methods that eliminate the double arithmetic and Math.floor. So the above equation becomes, for one example:
return ((1 << lgArrLongs) * 15) / 16;
or even better:
return ((16 << lgArrLongs) - (1 << lgArrLongs)) >>> 4
which is much faster.

I looked into this, but eliminating these double constants propagates everywhere and is a more complicated than I want to attempt right now. We can put this into our backlog.

Improvements to getCompactSketchMaxBytes(...) Renaming of two special functions one in HeapQuickSelectSketch and the other in DirectQuickSelectSketchR. Also changed their access privilege.

And DirectQuickSelectSketchR::getOffHeapHashTableThreshold(..) to eliminate the unnecessary Math.floor(..) function.

leerho added 2 commits May 10, 2024 10:36

Added new method

271ad40

to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).

Add test for new method

980c132

leerho requested a review from AlexanderSaydakov May 10, 2024 19:09

AlexanderSaydakov reviewed May 10, 2024

View reviewed changes

leerho requested a review from AlexanderSaydakov May 15, 2024 00:09

leerho added 2 commits May 14, 2024 17:12

Update SketchesCheckstyle

5a3144e

Improvements to getCompactSketchMaxBytes(...) Renaming of two special functions one in HeapQuickSelectSketch and the other in DirectQuickSelectSketchR. Also changed their access privilege.

Oops, forgot to update the test.

ca32da7

AlexanderSaydakov approved these changes May 15, 2024

View reviewed changes

modified HeapQuickSelectSketch::getHashTableThreshold(..)

f979b37

And DirectQuickSelectSketchR::getOffHeapHashTableThreshold(..) to eliminate the unnecessary Math.floor(..) function.

leerho merged commit f8772e4 into master May 15, 2024
4 checks passed

leerho deleted the fix_getMaxCompactSketchBytes branch May 15, 2024 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new method #559

Added new method #559

leerho commented May 10, 2024 •

edited

Loading

AlexanderSaydakov May 10, 2024

AlexanderSaydakov May 10, 2024

leerho May 15, 2024

AlexanderSaydakov May 15, 2024 •

edited

Loading

leerho May 15, 2024

leerho May 15, 2024

Added new method #559

Added new method #559

Conversation

leerho commented May 10, 2024 • edited Loading

AlexanderSaydakov May 10, 2024

Choose a reason for hiding this comment

AlexanderSaydakov May 10, 2024

Choose a reason for hiding this comment

leerho May 15, 2024

Choose a reason for hiding this comment

AlexanderSaydakov May 15, 2024 • edited Loading

Choose a reason for hiding this comment

leerho May 15, 2024

Choose a reason for hiding this comment

leerho May 15, 2024

Choose a reason for hiding this comment

leerho commented May 10, 2024 •

edited

Loading

AlexanderSaydakov May 15, 2024 •

edited

Loading