Optimize TypedSet and map_concat, array_union #15362

yingsu00 · 2020-10-27T08:43:43Z

This PR resolves the "Avoid block building inside" part in #15361. The optimizations included in this PR are

Avoid block building inside.

TypedSet has an internal BlockBuilder and appends each Block position being added to it. Block building using BlockBuilder is highly costly and inefficient operation. Here the BlockBuilder is needed to 1) resolve hash table probing collision 2) rehash. However, both are actually not a problem. In most use cases, whole blocks (instead of several positions of a block) are added to the TypedSet, and problem 1) can be resolved by keeping track of the blocks being added. 2) Rehashing is not needed since we can know the max number of entries before creating a TypedSet for most use cases. So we want to provide a method that adds a whole Block and just record the positions in the set using SelectedPositions objects. By providing this new interface, internal operations can be streamlined to more efficient loops. It also gives opportunity to get the result Block using more efficient APIs that allows encapsulated memory copying.

In the new OptimizedTypedSet class, we offer several operations: union, intersect and except. The operations can be called multiple times, but user does need to specify the max set size when creating the OptimizedTypedSet.
Avoid computing the hash positions multiple times
The previous TypedSet usage sometimes require calculating the hash position multiple times if there are multiple TypedSets. For example array_intersect and array_except. Such usages build one TypedSet R, and based on the probe result on R, insert new elements to another TypedSet B. It requires to calculate the hash position multiple times. This can be avoided if the new TypdedSet can encapsulate the operations inside. The hashPosition calculated in hashtable A can be reused by hashtable B if the size and hash functions are the same.
JMH benchmark shows up to 40% improvement for array_union:

Type Baseline Specialized Baseline OptimizedTypedSet Gain%

BIGINT 5511 3742 3320 40%

VARCHAR 20414 N/A 14155 31%
JMH benchmark shows 82% improvement for non_empty case for map_concat when keyCount=100
and POSITIONS=1000:

    Baseline

    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  26710.925 ± 2005.756  ns/op
    Retained Size: 1,402,374 bytes

    After
    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  14605.437 ± 1209.786  ns/op
    Retained Size: 1,373,273 bytes

When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The optimized version succeeded.

Array_intersect showed up to 49% improvement in time and 72% savings in allocation rate. More detailed comparisons:
array_except JMH benchmark shows 40% improvement:

    Before
    Benchmark                               Mode  Cnt       Score        Error  Units
    BenchmarkArrayIntersect.arrayIntersect  avgt   10  618074.452 ± 119912.203  ns/op

    After
    Benchmark                               Mode  Cnt       Score       Error  Units
    BenchmarkArrayIntersect.arrayIntersect  avgt   10  376854.064 ± 21616.063  ns/op

Production testing:
Tested using verifier on 1220 queries with map_concat, array_union, array_intersect, array_except. No failures found.

== NO RELEASE NOTE ==

kaikalur · 2020-10-27T18:36:55Z

The perf numbers look good. But is there a memory overhead? Is there a way to measure that?

yingsu00 · 2020-10-28T03:28:54Z

The perf numbers look good. But is there a memory overhead? Is there a way to measure that?

@kaikalur Sreeni, thank you very much for reviewing the PR. Let's see the top memory consumers in OptimizedTypedSet

Block[] blocks
This is to keep reference of the incoming blocks. In original implementation, the caller of TypedSet also need to keep track of them implicitly while adding elements to it. This not increasing memory consumption nor prolonging Block object life cycle.
List<SelectedPositions> positionsForBlocks
The list of SelectedPositions is a small fraction of the original block builder elementBlock. Since we removed elementBlock the memory actually dropped.
int[] blockPositionByHash
The blockPositionByHash was the hash table which was the IntArrayList blockPositionByHash in original TypedSet. Since we have the upper limit of the hash table size, there is no need to use a dynamically growing structure. Using int[] is more efficient.

So look at the retained size calculation, it's now

public long getRetainedSizeInBytes()
    {
        long sizeInBytes = INSTANCE_SIZE + ARRAY_LIST_INSTANCE_SIZE + sizeOf(blockPositionByHash);
        for (int i = 0; i <= currentBlockIndex; i++) {
            sizeInBytes += sizeOf(positionsForBlocks.get(i).getPositions());
        }
        return sizeInBytes;
    }

vs. the one in TypedSet

public long getRetainedSizeInBytes()
    {
        return INSTANCE_SIZE + INT_ARRAY_LIST_INSTANCE_SIZE + elementBlock.getRetainedSizeInBytes() + blockPositionByHash.size() * Integer.BYTES;
    }

In BenchmarkMapConcat, the baseline TypeSet retained size was 530 bytes while the new implementation is 232 bytes.

yingsu00 · 2020-10-28T08:15:08Z

@kaikalur This is the GC profile for BenchmarkMapConcat (note that with GC profiling enabled, the numbers for both cases went up a bit):

Baseline

Benchmark                                                      Mode  Cnt     Score      Error   Units
BenchmarkMapConcat.mapConcat                                   avgt   20  1090.895 ±   84.847   ns/op
BenchmarkMapConcat.mapConcat:·gc.alloc.rate                    avgt   20   692.512 ±   46.312  MB/sec
BenchmarkMapConcat.mapConcat:·gc.alloc.rate.norm               avgt   20  1565.082 ±   13.184    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space           avgt   20   703.407 ±  633.867  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space.norm      avgt   20  1606.711 ± 1448.248    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space       avgt   20     0.168 ±    0.285  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space.norm  avgt   20     0.377 ±    0.637    B/op
BenchmarkMapConcat.mapConcat:·gc.count                         avgt   20    10.000             counts
BenchmarkMapConcat.mapConcat:·gc.time                          avgt   20    14.000                 ms

Now:

Benchmark                                                      Mode  Cnt     Score      Error   Units
BenchmarkMapConcat.mapConcat                                   avgt   20   816.268 ±   71.719   ns/op
BenchmarkMapConcat.mapConcat:·gc.alloc.rate                    avgt   20   791.006 ±   58.960  MB/sec
BenchmarkMapConcat.mapConcat:·gc.alloc.rate.norm               avgt   20  1337.685 ±    0.005    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space           avgt   20   739.708 ±  738.168  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space.norm      avgt   20  1296.181 ± 1295.956    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space       avgt   20     0.111 ±    0.316  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space.norm  avgt   20     0.208 ±    0.602    B/op
BenchmarkMapConcat.mapConcat:·gc.count                         avgt   20     9.000             counts
BenchmarkMapConcat.mapConcat:·gc.time                          avgt   20    13.000                 ms

The gc.alloc.rate.norm, gc.churn.PS_Eden_Space.norm, gc.count,gc.time all dropped.

sujay-jain · 2020-11-01T19:16:14Z

presto-common/src/main/java/com/facebook/presto/common/block/AbstractMapBlock.java

                startEntryOffset * 2,
-                (endEntryOffset - startEntryOffset) * 2,
-                this);
+                (endEntryOffset - startEntryOffset) * 2, this);
    }


nit: this on a new line or everything on same line

sujay-jain · 2020-11-01T19:51:46Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+        int positionCount = block.getPositionCount();
+
+        currentBlockIndex++;
+        blocks[currentBlockIndex] = block;


maybe blocks[++currentBlockIndex] = block?

I believe using ++ on the same line is not preferred by Presto. But it does save one line.

sujay-jain · 2020-11-02T15:51:06Z

presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayUnionFunction.java

-            }
-        }
-    }
+        OptimizedTypedSet typedSet = new OptimizedTypedSet(type, 2, leftArrayCount + rightArrayCount, "array_union");



let's declare the 2 as a constant at the top

I actually think making caller to set the maxBlockCount was not very good. In the new iteration, I removed maxBlockCount and made the number of blocks growable. So this constant 2 is now removed.

kaikalur

Couple of questions - I don't see NULL being handled explicitly anywhere. Hopefully we have some tests that take care of it.

Secondly, we are creating many arrays in these functions. I wonder if that will cause some memory churn causing gc issues.

yingsu00 · 2020-11-05T04:06:09Z

Couple of questions - I don't see NULL being handled explicitly anywhere. Hopefully we have some tests that take care of it.

@kaikalur Thanks for reviewing the PR. NULL was actually handled. In the past, there was a history where array functions didn't handle nulls correctly, please see #11978. It was because it was using type.equalTo() which couldn't distinguish 0 and null. This problem was corrected by using positionEqualsPosition() and the internal hashtable can can handle null correctly now. The original logic of having a variable hasNulls was no longer needed.

In fact, all tests in testOptimizedTypedSet have cases for nulls. For example in testUnionWithDistinctValues

@Test
    public void testUnionWithDistinctValues()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1, FUNCTION_NAME);

        Block block = createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2);
        testUnion(typedSet, block, block);

        block = createLongSequenceBlock(0, POSITIONS_PER_PAGE);
        testUnion(typedSet, createLongSequenceBlock(POSITIONS_PER_PAGE / 2, POSITIONS_PER_PAGE), block);

        // Test adding a block with null
        Block blockWithNull = block.appendNull();
        testUnion(typedSet, blockWithNull, blockWithNull);
    }

To have even more test coverage, I added a new test in TestOptimizedTypedSet:

@Test
    public void testNulls()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1, FUNCTION_NAME);

        // Empty block
        Block emptyBlock = createLongSequenceBlock(0, 0);

        testUnion(typedSet, emptyBlock, emptyBlock);

        // Block with a single null
        Block singleNullBlock = emptyBlock.appendNull();

        testUnion(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, emptyBlock, emptyBlock);
        testExcept(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, emptyBlock, emptyBlock);

        // Block with a 0, and block with a 0 and a null
        Block zero = createLongSequenceBlock(0, 0);
        Block zeroAndNull = zero.appendNull();

        testUnion(typedSet, zero, zero);
        testUnion(typedSet, singleNullBlock, zeroAndNull);
        testIntersect(typedSet, zero, zero);
        testExcept(typedSet, singleNullBlock, singleNullBlock);
        testUnion(typedSet, zero, zeroAndNull);
    }

yingsu00 · 2020-11-05T10:04:03Z

Secondly, we are creating many arrays in these functions. I wonder if that will cause some memory churn causing gc issues.

@kaikalur It may appear so but it's not as simple as that.

We have discussed array_union last time. Let's look at array_intersect and array_except. Take array_intersect for example, the original implementation requires to create two TypedSet and one external BlockBuilder. Each TypedSet contains an internal BlockBuilder:

TypedSet rightTypedSet = new TypedSet(type, rightPositionCount, "array_intersect");
        for (int i = 0; i < rightPositionCount; i++) {
            rightTypedSet.add(rightArray, i);
        }

        BlockBuilder blockBuilder = type.createBlockBuilder(null, Math.min(leftPositionCount, rightPositionCount));

        // The intersected set can have at most rightPositionCount elements
        TypedSet intersectTypedSet = new TypedSet(type, blockBuilder, rightPositionCount, "array_intersect");
        for (int i = 0; i < leftPositionCount; i++) {
            if (rightTypedSet.contains(leftArray, i)) {
                intersectTypedSet.add(leftArray, i);
            }
        }

But the new implementation only build one OptimizedTypedSet and does not use any external BlockBuilder.

OptimizedTypedSet typedSet = new OptimizedTypedSet(type, rightPositionCount, "array_intersect");
        typedSet.union(rightArray);
        typedSet.intersect(leftArray);

Let's look at retained size first:
If we print the retained size for array size 100, the original one is 5180 while the new one was 2592:

Baseline 5180

rightTypedSet=2100
- INSTANCE_SIZE=72
- INT_ARRAY_LIST_INSTANCE_SIZE=24
- elementBlock.getRetainedSizeInBytes() = 980
- blockPositionByHash.size()*Integer.BYTES= 1024
intersectTypedSet = 2100
- Same as above
blockBuilder = 980

Using OptimizedTypedSet 2592

INSTANCE_SIZE = 72
ARRAY_LIST_INSTANCE_SIZE =24
sizeOf(blockPositionByHash) = 2064
positionsForBlocks = 432

Even though blockPositionByHash is larger because it's using long[] instead of int[], the total retained size is still much smaller.

Now let's look at allocations. In fact almost all allocation metrics for the new implementation is better than the old one for these functions as shown in GC profile.

The original implementation has 3 BlockBuilders. These BlockBuilders all do memory allocation and copying for its internal arrays like values[], isNull[], offsets[], etc.. The new implementation only allocates one positions array for each set operation while not the real block content.
Original implementation creates two TypedSets, which actually created two hashtables, the same as the OptimizedTypedSet.
The new OptimizedTypedSet creates a positions array of size positionCount each time. This size maybe larger than required because dedup effect. For this one I have an idea to reuse the existing positions array from last set operation. This works because the previous Blocks and blockPositions are no longer needed. Ie. change

int[] positions = new int[positionCount];

To

int[] positions = ensureCapacity(positionsForBlocks.get(currentBlockIndex - 1).getPositions(), positionCount);

This reduced the allocation rate from 6038 bytes/op to 5633 bytes/op and further reduced the CPU and elapsed cost for about 15%.
The allocation rate and GC metrics are actually much better than the original:

sujay-jain · 2020-11-11T20:29:13Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+     */
+    public void union(Block block)
+    {
+        int positionCount = block.getPositionCount();


let's move this assignment to L90 where it is being used

sujay-jain · 2020-11-11T20:35:25Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+     */
+    public void intersect(Block block)
+    {
+        int positionCount = block.getPositionCount();


let's move this inside the else to keep the scope small

perhaps we don't even need this variable. Can inline it into the loop directly

perhaps we don't even need this variable. Can inline it into the loop directly

positionCount is used in the performance critical for loop. I'm sure the function call will be inlined here, but it may still be an indirect memory access through the block pointer. Besides, it's used in two different places and having a variable doesn't reduce readability. I'd prefer keeping the variable. I did move the declaration to inside of the else block.

sujay-jain · 2020-11-11T20:39:48Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+        int positionCount = block.getPositionCount();
+
+        currentBlockIndex++;
+        ensureBlocksCapacity(currentBlockIndex + 1);


perhaps it is cleaner to do the +1 inside the ensureCapacity instead of here

@sujay-jain The ensureCapacity function, by reading its name, and by numerous use cases of similar function, implies that it ensures the object size to be the value of the input capacity parameter. Here the capacity to be ensured was currentBlockIndex + 1 because the size of an array is the largest index plus 1. If we move the +1 logic inside of the function, then the function becomes ensure the capacity of the object array to be capacity + 1 which is weird. So I think it's better to keep it this way.

makes sense. I missed the part that we were passing in the index

sujay-jain · 2020-11-11T20:47:14Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+                int positionWithinBlock = (int) (blockPosition & 0xffff_ffff);
+                if (positionEqualsPosition(elementType, blocks[blockIndex], positionWithinBlock, block, position)) {
+                    return false;
+                }


the else part can be extracted into a function since getInsertPosition is also doing the same thing more or less.

@sujay-jain I understand they share lots of similarities but there are some subtle differences. addElement returns boolean while getInsertPosition returns int. addElement does insert the element into the hashtable but getInsertPosition doesn't. The reason having both was to avoid expensive hash value calculation when there're multiple hash tables, so that calculating of the hash value on the first hash table can be reused by the second. I can change the addElement to return an int, but then the callsite would become something like this:

if (addElement(newBlockPositionByHash, hash, block, i) == INVALID_POSITION) { positions[positionsIndex++] = i; }

This again looks not as easy to understand than returning a boolean. It is common in multiple standard libraries that a function to add a new element into a set to return boolean than an int. What do you think?

sujay-jain · 2020-11-11T21:26:48Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+            union(block);
+            return;
+        }
+        else {


There's common functionality between except and intersect eg: the for loop. Can we try to extract the common things to functions - makes it more readable and reduces changes of bugs :)

@sujay-jain I understand they look similar, but the logic is different. intersect has

if (positionInBlockPositionByHash == INVALID_POSITION) { // Add the new element }

But the except has

if (positionInBlockPositionByHash != INVALID_POSITION) { // Add the new element }

This makes it hard to extract common method. The only part can be extracted was

if (addElement(hashtable, hash, block, position)) { positions[positionsIndex++] = position; }

I tried to extract it to a private method:

private int addElementIfNotExists(Block block, int position, int[] positions, long[] hashtable, int hash, int positionsIndex) { if (addElement(hashtable, hash, block, position)) { positions[positionsIndex++] = position; } return positionsIndex; }

And now the callsites become
intersect:

for (int i = 0; i < positionCount; i++) { int hash = getMaskedHash(hashPosition(elementType, block, i)); int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i); if (positionInBlockPositionByHash == INVALID_POSITION) { // add to the hash table if it exists in blockPositionByHash positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex); } }

exept:

for (int i = 0; i < positionCount; i++) { int hash = getMaskedHash(hashPosition(elementType, block, i)); int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i); // add to the hash table if it does not exist in blockPositionByHash if (positionInBlockPositionByHash != INVALID_POSITION) { positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex); } }

Now you have to jump and scroll down to check what addElementIfNotExists does. Its 6 parameters makes it hard to know what's done inside. Also, it mixes operations on the hash table with the operation on positions array which are two different objects in the class that serves different purposes. It doesn't save lines either: the two lines that got extracted into a function just adds inconvenience for the readers.

Combining these considerations I think keeping it as original may be better. If you see other ways doing this, please let me know!

ok, I think we can leave it as is if you feel that's cleaner. Thanks for trying it.

yingsu00 · 2020-11-12T05:50:58Z

@sujay-jain Thank you very much for reviewing! I addressed your comments. Do you want to read again?

sujay-jain

LGTM

yingsu00 · 2020-11-13T00:36:10Z

@sujay-jain thanks again for reviewing this PR. @mbasmanova Will you be able to review it for the second round?

mbasmanova · 2020-11-13T01:45:21Z

@yingsu00 This is a rather large PR. I don't believe I have bandwidth to review it.

rongrong

Mostly looks good. Only some nits

presto-common/src/main/java/com/facebook/presto/common/block/SingleMapBlock.java

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

rongrong · 2020-11-13T04:09:30Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+                hashtable[hashPosition] = ((long) currentBlockIndex << 32) | position;
+                return true;
+            }
+            else {


else is not neccessary.

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

rongrong · 2020-11-13T04:17:32Z

presto-main/src/main/java/com/facebook/presto/operator/scalar/MapConcatFunction.java

@@ -139,42 +142,36 @@ public static Block mapConcat(MapType mapType, Object state, Block[] maps)
            return maps[lastMapIndex];
        }

+        Type keyType = mapType.getKeyType();
+        Type valueType = mapType.getValueType();
+        OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME);


entries / 2 is very confusing to me. Maybe when compute entries in line 135, just /2 there?

Agree it's confusing. This is because SingleMapBlock counts the positionCount for both key and value block. But line 135 is in the loop and I tend to not add more computation there. I added the following comments

// We need to divide the entries by 2 because the maps array is SingleMapBlocks and it had the positionCount twice as large as a normal Block OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME);

rongrong · 2020-11-13T04:18:30Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+        checkArgument(maxPositionCount >= 0, "maxPositionCount must not be negative");
+
+        this.elementType = requireNonNull(elementType, "elementType must not be null");
+        this.functionName = functionName;


What is this for?

@rongrong It used to be in TypedSet and used in the error message for EXCEEDED_FUNCTION_MEMORY_LIMIT. But the new implementation removed this limit so we don't need this functionName anymore. I have removed it. Thank you for the catch!

rongrong · 2020-11-13T04:19:35Z

presto-main/src/test/java/com/facebook/presto/operator/scalar/BenchmarkMapConcat.java

@@ -92,7 +92,7 @@
        private String name = "map_concat";

        @Param({"left_empty", "right_empty", "both_empty", "non_empty"})
-        private String mapConfig = "left_empty";
+        private String mapConfig = "non_empty";


Is this change intended?

Yes it is inteded. left_empty means the left input array is empty and is not the case we cared most.

wenleix

The high-level idea (e.g. keep the block and position in OptimizedTypedSet instead of copying by value) makes sense.

Curious if the OptimizedTypedSet supports different operations in a row? (e.g. first do union, then do intersect).

wenleix · 2020-11-17T06:13:32Z

presto-main/src/main/java/com/facebook/presto/operator/aggregation/OptimizedTypedSet.java

+            SelectedPositions selectedPositions = getPositionsForBlocks().get(i);
+            int positionCount = selectedPositions.size();
+
+            if (!selectedPositions.isList()) {


curious: when would this happen? (in that case we will return the current block positions)

curious: when would this happen? (in that case we will return the current block positions.

@wenleix Nice catch. Right now there isn't. I put the code there in case there is cases where the whole block(without duplicates) is added. Do you think I should remove this?

curious: when would this happen? (in that case we will return the current block positions.

@wenleix I removed the if (!selectedPositions.isList()) {} code block. Thank you for the catch. Will you be able to take another look? Thank you!

yingsu00 · 2020-11-18T07:31:49Z

Curious if the OptimizedTypedSet supports different operations in a row? (e.g. first do union, then do intersect).

@wenleix Thanks for reviewing! Yes it does support different operations. Example:

@Test
    public void testMultipleOperations()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1);

        Block block = createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2).appendNull();

        testUnion(typedSet, block, block);
        testIntersect(typedSet, block, block);

        block = createLongSequenceBlock(POSITIONS_PER_PAGE / 2, POSITIONS_PER_PAGE);
        testExcept(typedSet, block.appendNull(), block);
        testExcept(typedSet, createLongSequenceBlock(0, POSITIONS_PER_PAGE), createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2));

        testUnion(typedSet, block, createLongSequenceBlock(0, POSITIONS_PER_PAGE));
        testIntersect(typedSet, createEmptyBlock(BIGINT).appendNull(), createEmptyBlock(BIGINT));
    }

yingsu00 · 2020-11-19T09:15:51Z

The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur

mbasmanova · 2020-11-19T11:52:25Z

The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur

@yingsu00 Ying, I remember that verifier excludes potentially non-deterministic queries, e.g. queries with various map and array functions. How much coverage does this run have for the functions affected in this diff?

yingsu00 · 2020-11-19T17:01:53Z

The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur

@yingsu00 Ying, I remember that verifier excludes potentially non-deterministic queries, e.g. queries with various map and array functions. How much coverage does this run have for the functions affected in this diff?

@mbasmanova I built the suite only containing queries with these functions.

JMH benchmark shows 1.82x improvement for non_empty case when keyCount=100 and POSITIONS=1000: Baseline Benchmark Mode Cnt Score Error Units BenchmarkMapConcat.mapConcat avgt 20 26710.925 ± 2005.756 ns/op Retained Size: 1,402,374 bytes After Benchmark Mode Cnt Score Error Units BenchmarkMapConcat.mapConcat avgt 20 14605.437 ± 1209.786 ns/op Retained Size: 1,373,273 bytes When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The optimized version succeeded. Add different sizes to BenchmarkMapConcat

JMH benchmark shows up to 40% improvement: Type | Baseline | Specialized Baseline | OptimizedTypedSet | Gain% -----------|-----------|----------------------|-------------------|---------- BIGINT | 5511 | 3742 | 3320 | 40% VARCHAR | 20414 | N/A | 14155 | 31%

JMH benchmark shows 40% improvement: Before Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 618074.452 ± 119912.203 ns/op After Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 376854.064 ± 21616.063 ns/op

JMH benchmark shows 35% improvement: Before Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 540349.423 ± 66298.751 ns/op After Benchmark Mode Cnt Score Error Units BenchmarkArrayIntersect.arrayIntersect avgt 10 350934.564 ± 34092.598 ns/op

yingsu00 · 2020-11-20T08:42:02Z

@kaikalur I made BenchmarkMapConcat taking multiple keySet sizes. Both retained size and allocation was improved. When keyCount=1000 and POSITIONS=1000, the baseline just OOMed but the optimized version succeeded. I think it's safe to say the optimized version is both more memory efficient and CPU efficient.

JMH benchmark shows 82% improvement for non_empty case for map_concat when keyCount=100
and POSITIONS=1000:

    Baseline

    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  26710.925 ± 2005.756  ns/op
    Retained Size: 1,402,374 bytes

    After
    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  14605.437 ± 1209.786  ns/op
    Retained Size: 1,373,273 bytes

Do you have more questions about this PR?

cc @rongrong

wenleix

LGTM. I only over the general logic for TypedSet and map_concat and array_union implementation -- other function implementations are straightforward . As @rongrong , @sujay-jain and @kaikalur have already provided careful and in-depth review.

As a side note -- for a separate PR, might also worthy considering optimizing some of the functions (e.g. array functions) to use PROVIDED_BLOCKBUILDER convention, to avoid one extra copy, as is done in #13874 ? (essentially, instead of returning the block represent the array, the callee provide a BlockBuilder to append the result) -- it might only provide moderate speedup but if it's easy to implement probably still good to have ? :)

wenleix · 2020-11-30T20:44:07Z

Merged #15362 . Thanks for the contribution!

yingsu00 mentioned this pull request Oct 27, 2020

Optimize aggregation functions #15361

Open

yingsu00 force-pushed the TypedSet branch from e32d6e3 to 65bd19a Compare October 27, 2020 09:00

yingsu00 requested review from mbasmanova, kaikalur and rongrong October 27, 2020 18:28

yingsu00 force-pushed the TypedSet branch 2 times, most recently from 980ce35 to 7fdfe62 Compare October 27, 2020 19:16

yingsu00 requested review from sujay-jain and bhhari October 28, 2020 08:15

sujay-jain reviewed Nov 1, 2020

View reviewed changes

sujay-jain reviewed Nov 2, 2020

View reviewed changes

yingsu00 force-pushed the TypedSet branch from 7fdfe62 to 0863764 Compare November 3, 2020 09:13

kaikalur reviewed Nov 4, 2020

View reviewed changes

yingsu00 force-pushed the TypedSet branch 2 times, most recently from f35720d to 13ed5a5 Compare November 6, 2020 02:10

yingsu00 requested a review from wenleix November 10, 2020 19:44

sujay-jain reviewed Nov 11, 2020

View reviewed changes

yingsu00 force-pushed the TypedSet branch from 13ed5a5 to 4578da7 Compare November 12, 2020 05:49

yingsu00 force-pushed the TypedSet branch from 4578da7 to 683a703 Compare November 12, 2020 05:53

sujay-jain approved these changes Nov 12, 2020

View reviewed changes

rongrong reviewed Nov 13, 2020

View reviewed changes

yingsu00 force-pushed the TypedSet branch 2 times, most recently from f9bd8b9 to 6526b0a Compare November 14, 2020 02:47

wenleix reviewed Nov 17, 2020

View reviewed changes

yingsu00 force-pushed the TypedSet branch from 6526b0a to eeffc97 Compare November 19, 2020 09:29

Ying Su added 5 commits November 19, 2020 20:15

Introducing OptimizedTypedSet

bb9f3f2

yingsu00 force-pushed the TypedSet branch from eeffc97 to 87d7f18 Compare November 20, 2020 08:30

wenleix approved these changes Nov 29, 2020

View reviewed changes

wenleix merged commit 1e00009 into prestodb:master Nov 30, 2020

Optimize TypedSet and map_concat, array_union #15362

Optimize TypedSet and map_concat, array_union #15362

Conversation

yingsu00 commented Oct 27, 2020 • edited Loading

kaikalur commented Oct 27, 2020

yingsu00 commented Oct 28, 2020 • edited Loading

yingsu00 commented Oct 28, 2020 • edited Loading

sujay-jain Nov 1, 2020 • edited Loading

Choose a reason for hiding this comment

sujay-jain Nov 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaikalur left a comment

Choose a reason for hiding this comment

yingsu00 commented Nov 5, 2020 • edited Loading

yingsu00 commented Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sujay-jain Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Nov 12, 2020

sujay-jain left a comment

Choose a reason for hiding this comment

yingsu00 commented Nov 13, 2020

mbasmanova commented Nov 13, 2020

rongrong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenleix left a comment • edited Loading

Choose a reason for hiding this comment

wenleix Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Nov 18, 2020 • edited Loading

yingsu00 commented Nov 19, 2020

mbasmanova commented Nov 19, 2020

yingsu00 commented Nov 19, 2020

yingsu00 commented Nov 20, 2020 • edited Loading

wenleix left a comment • edited Loading

Choose a reason for hiding this comment

wenleix commented Nov 30, 2020

yingsu00 commented Oct 27, 2020 •

edited

Loading

yingsu00 commented Oct 28, 2020 •

edited

Loading

yingsu00 commented Oct 28, 2020 •

edited

Loading

sujay-jain Nov 1, 2020 •

edited

Loading

sujay-jain Nov 1, 2020 •

edited

Loading

yingsu00 commented Nov 5, 2020 •

edited

Loading

yingsu00 commented Nov 5, 2020 •

edited

Loading

sujay-jain Nov 11, 2020 •

edited

Loading

yingsu00 Nov 12, 2020 •

edited

Loading

wenleix left a comment •

edited

Loading

wenleix Nov 17, 2020 •

edited

Loading

yingsu00 commented Nov 18, 2020 •

edited

Loading

yingsu00 commented Nov 20, 2020 •

edited

Loading

wenleix left a comment •

edited

Loading