-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new method #559
Added new method #559
Conversation
to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).
* @return the maximum number of storage bytes required for a CompactSketch with the given | ||
* nomEntries. | ||
*/ | ||
public static int getCompactSketchMaxBytes(final int nomEntries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest reusing the existing logic like this:
HeapQuickSelectSketch.setHashTableThreshold(lgK, lgK + 1) + Family.QUICKSELECT.getMaxPreLongs()) * Long.BYTES
By the way, I noticed that the setHashTableThreshold() method is duplicated in DirectQuickSelectSketch (not introduced by this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the method name setHashTableThreshold is a bit misleading. It does not set anything. It computes and returns the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I changed the name of the
HeapQuickSelectSketch
function togetHashTableThreshold(..)
- However the
DirectQuickSelectSketchR
function with the same name is actually different and uses a different constant. The Direct version is tuned to delay having to resize less often because the cost of resizing in off-heap is much more expensive than on-heap. Nonetheless, I changed its name togetOffHeapHashTableThreshold(...)
.
As for the internal computation of getCompactSketchMaxBytes(...) I changed it to be a little different from what you proposed.
You proposed using the HeapQuickSelectSketch.getHashTableThreshold(lgK, lgK + 1).
Unfortunately, the function input was nomEntries and not lgNomEntries, which meant I would have to convert to LgK before using the proposed function. This is not a single CPU cycle operation. If you go look at the JVM source code for Long.numberOfTrailingZeros(long i)
you'll see that it is not trivial.
Meanwhile, we have for some time also been using lgNomEntries
in Theta. (Go look at the Theta builder.) So I decided to switch the input to lgNomEntries
.
- The intent of the functions
getHashTableThreshold(..)
andgetOffHeapHashTableThreshold(...)
is to allow fine tuning of the resizing operation the sketch, which we actually do because we tune the resizing of the heap sketch differently from the off-heap sketch. Both of these were package-private, and neither was used in test. So I decided to make the heap functionprivate
and the off-heap versionprotected
because it is used by the child sketchDirectQuickSelectSketch
.
I really didn't want to use these special functions for this getCompactSketchMaxBytes(...)
, so I rewrote it with one of your suggestions and leveraging the REBUILD_THRESHOLD constant already defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why not use getHashTableThreshold() for max serialized size computation. The size depends exactly on the maximum number of retained entries in the hash table. Now the knowledge about 2 * K * rebuild_threshold is sitting in two places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But my point is that the getHashTableThreshold(...) is doing multiple things and forcing it to just return the rebuild Thresh is a hack.
But to fix this I will need to separate the two functions getHashTableResizeThreshold and getHashTableRebuildThreshold. The resize Thresh should never change but the Resize Thresh depends on the type of sketch.
I will fix this, but you will have to approve it again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to implement this idea of trying to have one place that computes the 2 * K * rebuild_Threshold.
But it turned out to be much more complicated so I decided to leave it alone for now.
- The getHashTableThreshold(lgNomLongs, lgArrLongs) in both Direct and Heap sketches actually does the computation from the lgArrLongs parameter, not the lgNomLongs parameter. This makes sense since because this is computing the threshold for the hashTable. The lgNomLongs is ONLY used to determine if the sketch is in a growing/resizing mode or at its full size, when a full rebuild is required.
- This means that these 2 methods cannot use a common method that computes maximum capacity based on lgNomLongs.
If, and when, we decide to make an improvement in this area, I want to do one more improvement:
We have 3 static final double constants: REBUILD_THRESHOLD, RESIZE_THRESHOLD, AND DQS_RESIZE_THRESHOLD that are used for tuning the performance of the hash tables in different situations.
Unfortunately, at the time, I decided to use doubles for these constants. This forces a computation like this:
return (int) Math.floor(<double_constant> * (1 << lgArrLongs));
every time they are used.
However, the hash array sizes are always powers of 2, and never smaller than 16, and the constants are either 15/16 or 0.5, or fractions that can exactly produce an integer for all possible sizes of the hash tables.
This means, instead of using double constants, we could use static final methods that eliminate the double arithmetic and Math.floor. So the above equation becomes, for one example:
return ((1 << lgArrLongs) * 15) / 16;
or even better:
return ((16 << lgArrLongs) - (1 << lgArrLongs)) >>> 4
which is much faster.
I looked into this, but eliminating these double constants propagates everywhere and is a more complicated than I want to attempt right now. We can put this into our backlog.
Improvements to getCompactSketchMaxBytes(...) Renaming of two special functions one in HeapQuickSelectSketch and the other in DirectQuickSelectSketchR. Also changed their access privilege.
And DirectQuickSelectSketchR::getOffHeapHashTableThreshold(..) to eliminate the unnecessary Math.floor(..) function.
Added new method to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).
The other method of a similar name really didn't need to be public, but it is used extensively in test.
Making it package-private now would force a major release.
Deprecated the old method as a public method.
Added a test.