Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new method #559

Merged
merged 5 commits into from
May 15, 2024
Merged

Added new method #559

merged 5 commits into from
May 15, 2024

Conversation

leerho
Copy link
Contributor

@leerho leerho commented May 10, 2024

Added new method to compute absolute maximum number of storage bytes required for a CompactSketch given the configured number of nominal entries (power of 2).

The other method of a similar name really didn't need to be public, but it is used extensively in test.
Making it package-private now would force a major release.

Deprecated the old method as a public method.

Added a test.

to compute absolute maximum number of storage bytes required for a
CompactSketch given the configured number of nominal entries (power of
2).
* @return the maximum number of storage bytes required for a CompactSketch with the given
* nomEntries.
*/
public static int getCompactSketchMaxBytes(final int nomEntries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest reusing the existing logic like this:
HeapQuickSelectSketch.setHashTableThreshold(lgK, lgK + 1) + Family.QUICKSELECT.getMaxPreLongs()) * Long.BYTES

By the way, I noticed that the setHashTableThreshold() method is duplicated in DirectQuickSelectSketch (not introduced by this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the method name setHashTableThreshold is a bit misleading. It does not set anything. It computes and returns the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I changed the name of the HeapQuickSelectSketch function to getHashTableThreshold(..)
  • However the DirectQuickSelectSketchR function with the same name is actually different and uses a different constant. The Direct version is tuned to delay having to resize less often because the cost of resizing in off-heap is much more expensive than on-heap. Nonetheless, I changed its name to getOffHeapHashTableThreshold(...).

As for the internal computation of getCompactSketchMaxBytes(...) I changed it to be a little different from what you proposed.

You proposed using the HeapQuickSelectSketch.getHashTableThreshold(lgK, lgK + 1).

Unfortunately, the function input was nomEntries and not lgNomEntries, which meant I would have to convert to LgK before using the proposed function. This is not a single CPU cycle operation. If you go look at the JVM source code for Long.numberOfTrailingZeros(long i) you'll see that it is not trivial.

Meanwhile, we have for some time also been using lgNomEntries in Theta. (Go look at the Theta builder.) So I decided to switch the input to lgNomEntries.

  • The intent of the functions getHashTableThreshold(..) and getOffHeapHashTableThreshold(...) is to allow fine tuning of the resizing operation the sketch, which we actually do because we tune the resizing of the heap sketch differently from the off-heap sketch. Both of these were package-private, and neither was used in test. So I decided to make the heap function private and the off-heap version protected because it is used by the child sketch DirectQuickSelectSketch.

I really didn't want to use these special functions for this getCompactSketchMaxBytes(...), so I rewrote it with one of your suggestions and leveraging the REBUILD_THRESHOLD constant already defined.

Copy link
Contributor

@AlexanderSaydakov AlexanderSaydakov May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why not use getHashTableThreshold() for max serialized size computation. The size depends exactly on the maximum number of retained entries in the hash table. Now the knowledge about 2 * K * rebuild_threshold is sitting in two places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. But my point is that the getHashTableThreshold(...) is doing multiple things and forcing it to just return the rebuild Thresh is a hack.

But to fix this I will need to separate the two functions getHashTableResizeThreshold and getHashTableRebuildThreshold. The resize Thresh should never change but the Resize Thresh depends on the type of sketch.

I will fix this, but you will have to approve it again :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to implement this idea of trying to have one place that computes the 2 * K * rebuild_Threshold.
But it turned out to be much more complicated so I decided to leave it alone for now.

  • The getHashTableThreshold(lgNomLongs, lgArrLongs) in both Direct and Heap sketches actually does the computation from the lgArrLongs parameter, not the lgNomLongs parameter. This makes sense since because this is computing the threshold for the hashTable. The lgNomLongs is ONLY used to determine if the sketch is in a growing/resizing mode or at its full size, when a full rebuild is required.
  • This means that these 2 methods cannot use a common method that computes maximum capacity based on lgNomLongs.

If, and when, we decide to make an improvement in this area, I want to do one more improvement:

We have 3 static final double constants: REBUILD_THRESHOLD, RESIZE_THRESHOLD, AND DQS_RESIZE_THRESHOLD that are used for tuning the performance of the hash tables in different situations.

Unfortunately, at the time, I decided to use doubles for these constants. This forces a computation like this:
return (int) Math.floor(<double_constant> * (1 << lgArrLongs)); every time they are used.

However, the hash array sizes are always powers of 2, and never smaller than 16, and the constants are either 15/16 or 0.5, or fractions that can exactly produce an integer for all possible sizes of the hash tables.

This means, instead of using double constants, we could use static final methods that eliminate the double arithmetic and Math.floor. So the above equation becomes, for one example:
return ((1 << lgArrLongs) * 15) / 16;
or even better:
return ((16 << lgArrLongs) - (1 << lgArrLongs)) >>> 4
which is much faster.

I looked into this, but eliminating these double constants propagates everywhere and is a more complicated than I want to attempt right now. We can put this into our backlog.

Improvements to getCompactSketchMaxBytes(...)

Renaming of two special functions one in HeapQuickSelectSketch and the
other in DirectQuickSelectSketchR.
Also changed their access privilege.
And DirectQuickSelectSketchR::getOffHeapHashTableThreshold(..)
to eliminate the unnecessary Math.floor(..) function.
@leerho leerho merged commit f8772e4 into master May 15, 2024
4 checks passed
@leerho leerho deleted the fix_getMaxCompactSketchBytes branch May 15, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants