-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize TypedSet and map_concat, array_union #15362
Conversation
The perf numbers look good. But is there a memory overhead? Is there a way to measure that? |
980ce35
to
7fdfe62
Compare
@kaikalur Sreeni, thank you very much for reviewing the PR. Let's see the top memory consumers in OptimizedTypedSet
So look at the retained size calculation, it's now
vs. the one in TypedSet
In BenchmarkMapConcat, the baseline TypeSet retained size was 530 bytes while the new implementation is 232 bytes. |
@kaikalur This is the GC profile for BenchmarkMapConcat (note that with GC profiling enabled, the numbers for both cases went up a bit): Baseline
Now:
The |
startEntryOffset * 2, | ||
(endEntryOffset - startEntryOffset) * 2, | ||
this); | ||
(endEntryOffset - startEntryOffset) * 2, this); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this
on a new line or everything on same line
int positionCount = block.getPositionCount(); | ||
|
||
currentBlockIndex++; | ||
blocks[currentBlockIndex] = block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe blocks[++currentBlockIndex] = block
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe using ++ on the same line is not preferred by Presto. But it does save one line.
} | ||
} | ||
} | ||
OptimizedTypedSet typedSet = new OptimizedTypedSet(type, 2, leftArrayCount + rightArrayCount, "array_union"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's declare the 2 as a constant at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think making caller to set the maxBlockCount
was not very good. In the new iteration, I removed maxBlockCount
and made the number of blocks growable. So this constant 2 is now removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions - I don't see NULL being handled explicitly anywhere. Hopefully we have some tests that take care of it.
Secondly, we are creating many arrays in these functions. I wonder if that will cause some memory churn causing gc issues.
@kaikalur Thanks for reviewing the PR. NULL was actually handled. In the past, there was a history where array functions didn't handle nulls correctly, please see #11978. It was because it was using In fact, all tests in
To have even more test coverage, I added a new test in
|
@kaikalur It may appear so but it's not as simple as that. We have discussed array_union last time. Let's look at array_intersect and array_except. Take array_intersect for example, the original implementation requires to create two TypedSet and one external BlockBuilder. Each TypedSet contains an internal BlockBuilder:
But the new implementation only build one OptimizedTypedSet and does not use any external BlockBuilder.
Let's look at retained size first: Baseline 5180
Using OptimizedTypedSet 2592
Even though blockPositionByHash is larger because it's using long[] instead of int[], the total retained size is still much smaller. Now let's look at allocations. In fact almost all allocation metrics for the new implementation is better than the old one for these functions as shown in GC profile.
To
This reduced the allocation rate from 6038 bytes/op to 5633 bytes/op and further reduced the CPU and elapsed cost for about 15%. |
f35720d
to
13ed5a5
Compare
*/ | ||
public void union(Block block) | ||
{ | ||
int positionCount = block.getPositionCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this assignment to L90 where it is being used
*/ | ||
public void intersect(Block block) | ||
{ | ||
int positionCount = block.getPositionCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this inside the else
to keep the scope small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we don't even need this variable. Can inline it into the loop directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we don't even need this variable. Can inline it into the loop directly
positionCount
is used in the performance critical for
loop. I'm sure the function call will be inlined here, but it may still be an indirect memory access through the block
pointer. Besides, it's used in two different places and having a variable doesn't reduce readability. I'd prefer keeping the variable. I did move the declaration to inside of the else
block.
int positionCount = block.getPositionCount(); | ||
|
||
currentBlockIndex++; | ||
ensureBlocksCapacity(currentBlockIndex + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it is cleaner to do the +1
inside the ensureCapacity
instead of here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujay-jain The ensureCapacity
function, by reading its name, and by numerous use cases of similar function, implies that it ensures the object size to be the value of the input capacity
parameter. Here the capacity to be ensured was currentBlockIndex + 1
because the size of an array is the largest index plus 1. If we move the +1
logic inside of the function, then the function becomes ensure the capacity of the object array to be capacity + 1
which is weird. So I think it's better to keep it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. I missed the part that we were passing in the index
int positionWithinBlock = (int) (blockPosition & 0xffff_ffff); | ||
if (positionEqualsPosition(elementType, blocks[blockIndex], positionWithinBlock, block, position)) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the else part can be extracted into a function since getInsertPosition
is also doing the same thing more or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujay-jain I understand they share lots of similarities but there are some subtle differences. addElement
returns boolean while getInsertPosition
returns int. addElement
does insert the element into the hashtable but getInsertPosition
doesn't. The reason having both was to avoid expensive hash value calculation when there're multiple hash tables, so that calculating of the hash value on the first hash table can be reused by the second. I can change the addElement
to return an int, but then the callsite would become something like this:
if (addElement(newBlockPositionByHash, hash, block, i) == INVALID_POSITION) {
positions[positionsIndex++] = i;
}
This again looks not as easy to understand than returning a boolean. It is common in multiple standard libraries that a function to add a new element into a set to return boolean than an int. What do you think?
union(block); | ||
return; | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's common functionality between except
and intersect
eg: the for
loop. Can we try to extract the common things to functions - makes it more readable and reduces changes of bugs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujay-jain I understand they look similar, but the logic is different. intersect
has
if (positionInBlockPositionByHash == INVALID_POSITION) {
// Add the new element
}
But the except
has
if (positionInBlockPositionByHash != INVALID_POSITION) {
// Add the new element
}
This makes it hard to extract common method. The only part can be extracted was
if (addElement(hashtable, hash, block, position)) {
positions[positionsIndex++] = position;
}
I tried to extract it to a private method:
private int addElementIfNotExists(Block block, int position, int[] positions, long[] hashtable, int hash, int positionsIndex)
{
if (addElement(hashtable, hash, block, position)) {
positions[positionsIndex++] = position;
}
return positionsIndex;
}
And now the callsites become
intersect:
for (int i = 0; i < positionCount; i++) {
int hash = getMaskedHash(hashPosition(elementType, block, i));
int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i);
if (positionInBlockPositionByHash == INVALID_POSITION) {
// add to the hash table if it exists in blockPositionByHash
positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex);
}
}
exept:
for (int i = 0; i < positionCount; i++) {
int hash = getMaskedHash(hashPosition(elementType, block, i));
int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i);
// add to the hash table if it does not exist in blockPositionByHash
if (positionInBlockPositionByHash != INVALID_POSITION) {
positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex);
}
}
Now you have to jump and scroll down to check what addElementIfNotExists
does. Its 6 parameters makes it hard to know what's done inside. Also, it mixes operations on the hash table with the operation on positions
array which are two different objects in the class that serves different purposes. It doesn't save lines either: the two lines that got extracted into a function just adds inconvenience for the readers.
Combining these considerations I think keeping it as original may be better. If you see other ways doing this, please let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think we can leave it as is if you feel that's cleaner. Thanks for trying it.
@sujay-jain Thank you very much for reviewing! I addressed your comments. Do you want to read again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@sujay-jain thanks again for reviewing this PR. @mbasmanova Will you be able to review it for the second round? |
@yingsu00 This is a rather large PR. I don't believe I have bandwidth to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Only some nits
presto-common/src/main/java/com/facebook/presto/common/block/SingleMapBlock.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java
Outdated
Show resolved
Hide resolved
hashtable[hashPosition] = ((long) currentBlockIndex << 32) | position; | ||
return true; | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else is not neccessary.
presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java
Outdated
Show resolved
Hide resolved
@@ -139,42 +142,36 @@ public static Block mapConcat(MapType mapType, Object state, Block[] maps) | |||
return maps[lastMapIndex]; | |||
} | |||
|
|||
Type keyType = mapType.getKeyType(); | |||
Type valueType = mapType.getValueType(); | |||
OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entries / 2
is very confusing to me. Maybe when compute entries
in line 135, just /2 there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it's confusing. This is because SingleMapBlock
counts the positionCount for both key and value block. But line 135 is in the loop and I tend to not add more computation there. I added the following comments
// We need to divide the entries by 2 because the maps array is SingleMapBlocks and it had the positionCount twice as large as a normal Block
OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME);
checkArgument(maxPositionCount >= 0, "maxPositionCount must not be negative"); | ||
|
||
this.elementType = requireNonNull(elementType, "elementType must not be null"); | ||
this.functionName = functionName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rongrong It used to be in TypedSet
and used in the error message for EXCEEDED_FUNCTION_MEMORY_LIMIT
. But the new implementation removed this limit so we don't need this functionName
anymore. I have removed it. Thank you for the catch!
@@ -92,7 +92,7 @@ | |||
private String name = "map_concat"; | |||
|
|||
@Param({"left_empty", "right_empty", "both_empty", "non_empty"}) | |||
private String mapConfig = "left_empty"; | |||
private String mapConfig = "non_empty"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is inteded. left_empty means the left input array is empty and is not the case we cared most.
f9bd8b9
to
6526b0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high-level idea (e.g. keep the block and position in OptimizedTypedSet
instead of copying by value) makes sense.
Curious if the OptimizedTypedSet
supports different operations in a row? (e.g. first do union, then do intersect).
SelectedPositions selectedPositions = getPositionsForBlocks().get(i); | ||
int positionCount = selectedPositions.size(); | ||
|
||
if (!selectedPositions.isList()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: when would this happen? (in that case we will return the current block positions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: when would this happen? (in that case we will return the current block positions.
@wenleix Nice catch. Right now there isn't. I put the code there in case there is cases where the whole block(without duplicates) is added. Do you think I should remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: when would this happen? (in that case we will return the current block positions.
@wenleix I removed the if (!selectedPositions.isList()) {}
code block. Thank you for the catch. Will you be able to take another look? Thank you!
@wenleix Thanks for reviewing! Yes it does support different operations. Example:
|
The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur |
@yingsu00 Ying, I remember that verifier excludes potentially non-deterministic queries, e.g. queries with various map and array functions. How much coverage does this run have for the functions affected in this diff? |
@mbasmanova I built the suite only containing queries with these functions. |
JMH benchmark shows 1.82x improvement for non_empty case when keyCount=100 and POSITIONS=1000: Baseline Benchmark Mode Cnt Score Error Units BenchmarkMapConcat.mapConcat avgt 20 26710.925 ± 2005.756 ns/op Retained Size: 1,402,374 bytes After Benchmark Mode Cnt Score Error Units BenchmarkMapConcat.mapConcat avgt 20 14605.437 ± 1209.786 ns/op Retained Size: 1,373,273 bytes When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The optimized version succeeded. Add different sizes to BenchmarkMapConcat
JMH benchmark shows up to 40% improvement: Type | Baseline | Specialized Baseline | OptimizedTypedSet | Gain% -----------|-----------|----------------------|-------------------|---------- BIGINT | 5511 | 3742 | 3320 | 40% VARCHAR | 20414 | N/A | 14155 | 31%
JMH benchmark shows 40% improvement: Before Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 618074.452 ± 119912.203 ns/op After Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 376854.064 ± 21616.063 ns/op
JMH benchmark shows 35% improvement: Before Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 540349.423 ± 66298.751 ns/op After Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 350934.564 ± 34092.598 ns/op
@kaikalur I made BenchmarkMapConcat taking multiple keySet sizes. Both retained size and allocation was improved. When keyCount=1000 and POSITIONS=1000, the baseline just OOMed but the optimized version succeeded. I think it's safe to say the optimized version is both more memory efficient and CPU efficient. JMH benchmark shows 82% improvement for non_empty case for map_concat when keyCount=100
Do you have more questions about this PR? cc @rongrong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I only over the general logic for TypedSet
and map_concat
and array_union
implementation -- other function implementations are straightforward . As @rongrong , @sujay-jain and @kaikalur have already provided careful and in-depth review.
As a side note -- for a separate PR, might also worthy considering optimizing some of the functions (e.g. array functions) to use PROVIDED_BLOCKBUILDER
convention, to avoid one extra copy, as is done in #13874 ? (essentially, instead of returning the block represent the array, the callee provide a BlockBuilder
to append the result) -- it might only provide moderate speedup but if it's easy to implement probably still good to have ? :)
Merged #15362 . Thanks for the contribution! |
This PR resolves the "Avoid block building inside" part in #15361. The optimizations included in this PR are
Avoid block building inside.
TypedSet has an internal BlockBuilder and appends each Block position being added to it. Block building using BlockBuilder is highly costly and inefficient operation. Here the BlockBuilder is needed to 1) resolve hash table probing collision 2) rehash. However, both are actually not a problem. In most use cases, whole blocks (instead of several positions of a block) are added to the TypedSet, and problem 1) can be resolved by keeping track of the blocks being added. 2) Rehashing is not needed since we can know the max number of entries before creating a TypedSet for most use cases. So we want to provide a method that adds a whole Block and just record the positions in the set using SelectedPositions objects. By providing this new interface, internal operations can be streamlined to more efficient loops. It also gives opportunity to get the result Block using more efficient APIs that allows encapsulated memory copying.
In the new OptimizedTypedSet class, we offer several operations: union, intersect and except. The operations can be called multiple times, but user does need to specify the max set size when creating the OptimizedTypedSet.
Avoid computing the hash positions multiple times
The previous TypedSet usage sometimes require calculating the hash position multiple times if there are multiple TypedSets. For example array_intersect and array_except. Such usages build one TypedSet R, and based on the probe result on R, insert new elements to another TypedSet B. It requires to calculate the hash position multiple times. This can be avoided if the new TypdedSet can encapsulate the operations inside. The hashPosition calculated in hashtable A can be reused by hashtable B if the size and hash functions are the same.
JMH benchmark shows up to 40% improvement for
array_union
:JMH benchmark shows 82% improvement for non_empty case for
map_concat
when keyCount=100and POSITIONS=1000:
When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The optimized version succeeded.
Array_intersect showed up to 49% improvement in time and 72% savings in allocation rate. More detailed comparisons:
array_except JMH benchmark shows 40% improvement:
Production testing:
Tested using verifier on 1220 queries with map_concat, array_union, array_intersect, array_except. No failures found.