Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TypedSet and map_concat, array_union #15362

Merged
merged 5 commits into from
Nov 30, 2020

Conversation

yingsu00
Copy link
Contributor

@yingsu00 yingsu00 commented Oct 27, 2020

This PR resolves the "Avoid block building inside" part in #15361. The optimizations included in this PR are

  • Avoid block building inside.

    TypedSet has an internal BlockBuilder and appends each Block position being added to it. Block building using BlockBuilder is highly costly and inefficient operation. Here the BlockBuilder is needed to 1) resolve hash table probing collision 2) rehash. However, both are actually not a problem. In most use cases, whole blocks (instead of several positions of a block) are added to the TypedSet, and problem 1) can be resolved by keeping track of the blocks being added. 2) Rehashing is not needed since we can know the max number of entries before creating a TypedSet for most use cases. So we want to provide a method that adds a whole Block and just record the positions in the set using SelectedPositions objects. By providing this new interface, internal operations can be streamlined to more efficient loops. It also gives opportunity to get the result Block using more efficient APIs that allows encapsulated memory copying.

    In the new OptimizedTypedSet class, we offer several operations: union, intersect and except. The operations can be called multiple times, but user does need to specify the max set size when creating the OptimizedTypedSet.

  • Avoid computing the hash positions multiple times
    The previous TypedSet usage sometimes require calculating the hash position multiple times if there are multiple TypedSets. For example array_intersect and array_except. Such usages build one TypedSet R, and based on the probe result on R, insert new elements to another TypedSet B. It requires to calculate the hash position multiple times. This can be avoided if the new TypdedSet can encapsulate the operations inside. The hashPosition calculated in hashtable A can be reused by hashtable B if the size and hash functions are the same.

  • JMH benchmark shows up to 40% improvement for array_union:

    Type Baseline Specialized Baseline OptimizedTypedSet Gain%
    BIGINT 5511 3742 3320 40%
    VARCHAR 20414 N/A 14155 31%
  • JMH benchmark shows 82% improvement for non_empty case for map_concat when keyCount=100
    and POSITIONS=1000:

    Baseline

    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  26710.925 ± 2005.756  ns/op
    Retained Size: 1,402,374 bytes

    After
    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  14605.437 ± 1209.786  ns/op
    Retained Size: 1,373,273 bytes

When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The optimized version succeeded.

  • Array_intersect showed up to 49% improvement in time and 72% savings in allocation rate. More detailed comparisons:
    Screen Shot 2020-11-05 at 1 59 55 AM
    Screen Shot 2020-11-05 at 2 00 02 AM

  • array_except JMH benchmark shows 40% improvement:

    Before
    Benchmark                               Mode  Cnt       Score        Error  Units
    BenchmarkArrayIntersect.arrayIntersect  avgt   10  618074.452 ± 119912.203  ns/op

    After
    Benchmark                               Mode  Cnt       Score       Error  Units
    BenchmarkArrayIntersect.arrayIntersect  avgt   10  376854.064 ± 21616.063  ns/op

Production testing:
Tested using verifier on 1220 queries with map_concat, array_union, array_intersect, array_except. No failures found.

== NO RELEASE NOTE ==

@kaikalur
Copy link
Contributor

The perf numbers look good. But is there a memory overhead? Is there a way to measure that?

@yingsu00 yingsu00 force-pushed the TypedSet branch 2 times, most recently from 980ce35 to 7fdfe62 Compare October 27, 2020 19:16
@yingsu00
Copy link
Contributor Author

yingsu00 commented Oct 28, 2020

The perf numbers look good. But is there a memory overhead? Is there a way to measure that?

@kaikalur Sreeni, thank you very much for reviewing the PR. Let's see the top memory consumers in OptimizedTypedSet

  • Block[] blocks
    This is to keep reference of the incoming blocks. In original implementation, the caller of TypedSet also need to keep track of them implicitly while adding elements to it. This not increasing memory consumption nor prolonging Block object life cycle.

  • List<SelectedPositions> positionsForBlocks
    The list of SelectedPositions is a small fraction of the original block builder elementBlock. Since we removed elementBlock the memory actually dropped.

  • int[] blockPositionByHash
    The blockPositionByHash was the hash table which was the IntArrayList blockPositionByHash in original TypedSet. Since we have the upper limit of the hash table size, there is no need to use a dynamically growing structure. Using int[] is more efficient.

So look at the retained size calculation, it's now

public long getRetainedSizeInBytes()
    {
        long sizeInBytes = INSTANCE_SIZE + ARRAY_LIST_INSTANCE_SIZE + sizeOf(blockPositionByHash);
        for (int i = 0; i <= currentBlockIndex; i++) {
            sizeInBytes += sizeOf(positionsForBlocks.get(i).getPositions());
        }
        return sizeInBytes;
    }

vs. the one in TypedSet

public long getRetainedSizeInBytes()
    {
        return INSTANCE_SIZE + INT_ARRAY_LIST_INSTANCE_SIZE + elementBlock.getRetainedSizeInBytes() + blockPositionByHash.size() * Integer.BYTES;
    }

In BenchmarkMapConcat, the baseline TypeSet retained size was 530 bytes while the new implementation is 232 bytes.

@yingsu00
Copy link
Contributor Author

yingsu00 commented Oct 28, 2020

@kaikalur This is the GC profile for BenchmarkMapConcat (note that with GC profiling enabled, the numbers for both cases went up a bit):

Baseline

Benchmark                                                      Mode  Cnt     Score      Error   Units
BenchmarkMapConcat.mapConcat                                   avgt   20  1090.895 ±   84.847   ns/op
BenchmarkMapConcat.mapConcat:·gc.alloc.rate                    avgt   20   692.512 ±   46.312  MB/sec
BenchmarkMapConcat.mapConcat:·gc.alloc.rate.norm               avgt   20  1565.082 ±   13.184    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space           avgt   20   703.407 ±  633.867  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space.norm      avgt   20  1606.711 ± 1448.248    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space       avgt   20     0.168 ±    0.285  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space.norm  avgt   20     0.377 ±    0.637    B/op
BenchmarkMapConcat.mapConcat:·gc.count                         avgt   20    10.000             counts
BenchmarkMapConcat.mapConcat:·gc.time                          avgt   20    14.000                 ms

Now:

Benchmark                                                      Mode  Cnt     Score      Error   Units
BenchmarkMapConcat.mapConcat                                   avgt   20   816.268 ±   71.719   ns/op
BenchmarkMapConcat.mapConcat:·gc.alloc.rate                    avgt   20   791.006 ±   58.960  MB/sec
BenchmarkMapConcat.mapConcat:·gc.alloc.rate.norm               avgt   20  1337.685 ±    0.005    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space           avgt   20   739.708 ±  738.168  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Eden_Space.norm      avgt   20  1296.181 ± 1295.956    B/op
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space       avgt   20     0.111 ±    0.316  MB/sec
BenchmarkMapConcat.mapConcat:·gc.churn.PS_Survivor_Space.norm  avgt   20     0.208 ±    0.602    B/op
BenchmarkMapConcat.mapConcat:·gc.count                         avgt   20     9.000             counts
BenchmarkMapConcat.mapConcat:·gc.time                          avgt   20    13.000                 ms

The gc.alloc.rate.norm, gc.churn.PS_Eden_Space.norm, gc.count,gc.time all dropped.

startEntryOffset * 2,
(endEntryOffset - startEntryOffset) * 2,
this);
(endEntryOffset - startEntryOffset) * 2, this);
}
Copy link
Member

@sujay-jain sujay-jain Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this on a new line or everything on same line

int positionCount = block.getPositionCount();

currentBlockIndex++;
blocks[currentBlockIndex] = block;
Copy link
Member

@sujay-jain sujay-jain Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe blocks[++currentBlockIndex] = block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe using ++ on the same line is not preferred by Presto. But it does save one line.

}
}
}
OptimizedTypedSet typedSet = new OptimizedTypedSet(type, 2, leftArrayCount + rightArrayCount, "array_union");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's declare the 2 as a constant at the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think making caller to set the maxBlockCount was not very good. In the new iteration, I removed maxBlockCount and made the number of blocks growable. So this constant 2 is now removed.

Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions - I don't see NULL being handled explicitly anywhere. Hopefully we have some tests that take care of it.

Secondly, we are creating many arrays in these functions. I wonder if that will cause some memory churn causing gc issues.

@yingsu00
Copy link
Contributor Author

yingsu00 commented Nov 5, 2020

Couple of questions - I don't see NULL being handled explicitly anywhere. Hopefully we have some tests that take care of it.

@kaikalur Thanks for reviewing the PR. NULL was actually handled. In the past, there was a history where array functions didn't handle nulls correctly, please see #11978. It was because it was using type.equalTo() which couldn't distinguish 0 and null. This problem was corrected by using positionEqualsPosition() and the internal hashtable can can handle null correctly now. The original logic of having a variable hasNulls was no longer needed.

In fact, all tests in testOptimizedTypedSet have cases for nulls. For example in testUnionWithDistinctValues

@Test
    public void testUnionWithDistinctValues()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1, FUNCTION_NAME);

        Block block = createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2);
        testUnion(typedSet, block, block);

        block = createLongSequenceBlock(0, POSITIONS_PER_PAGE);
        testUnion(typedSet, createLongSequenceBlock(POSITIONS_PER_PAGE / 2, POSITIONS_PER_PAGE), block);

        // Test adding a block with null
        Block blockWithNull = block.appendNull();
        testUnion(typedSet, blockWithNull, blockWithNull);
    }

To have even more test coverage, I added a new test in TestOptimizedTypedSet:

@Test
    public void testNulls()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1, FUNCTION_NAME);

        // Empty block
        Block emptyBlock = createLongSequenceBlock(0, 0);

        testUnion(typedSet, emptyBlock, emptyBlock);

        // Block with a single null
        Block singleNullBlock = emptyBlock.appendNull();

        testUnion(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, emptyBlock, emptyBlock);
        testExcept(typedSet, singleNullBlock, singleNullBlock);
        testIntersect(typedSet, emptyBlock, emptyBlock);

        // Block with a 0, and block with a 0 and a null
        Block zero = createLongSequenceBlock(0, 0);
        Block zeroAndNull = zero.appendNull();

        testUnion(typedSet, zero, zero);
        testUnion(typedSet, singleNullBlock, zeroAndNull);
        testIntersect(typedSet, zero, zero);
        testExcept(typedSet, singleNullBlock, singleNullBlock);
        testUnion(typedSet, zero, zeroAndNull);
    }

@yingsu00
Copy link
Contributor Author

yingsu00 commented Nov 5, 2020

Secondly, we are creating many arrays in these functions. I wonder if that will cause some memory churn causing gc issues.

@kaikalur It may appear so but it's not as simple as that.

We have discussed array_union last time. Let's look at array_intersect and array_except. Take array_intersect for example, the original implementation requires to create two TypedSet and one external BlockBuilder. Each TypedSet contains an internal BlockBuilder:

TypedSet rightTypedSet = new TypedSet(type, rightPositionCount, "array_intersect");
        for (int i = 0; i < rightPositionCount; i++) {
            rightTypedSet.add(rightArray, i);
        }

        BlockBuilder blockBuilder = type.createBlockBuilder(null, Math.min(leftPositionCount, rightPositionCount));

        // The intersected set can have at most rightPositionCount elements
        TypedSet intersectTypedSet = new TypedSet(type, blockBuilder, rightPositionCount, "array_intersect");
        for (int i = 0; i < leftPositionCount; i++) {
            if (rightTypedSet.contains(leftArray, i)) {
                intersectTypedSet.add(leftArray, i);
            }
        }

But the new implementation only build one OptimizedTypedSet and does not use any external BlockBuilder.

OptimizedTypedSet typedSet = new OptimizedTypedSet(type, rightPositionCount, "array_intersect");
        typedSet.union(rightArray);
        typedSet.intersect(leftArray);

Let's look at retained size first:
If we print the retained size for array size 100, the original one is 5180 while the new one was 2592:

Baseline 5180

  • rightTypedSet=2100
    • INSTANCE_SIZE=72
    • INT_ARRAY_LIST_INSTANCE_SIZE=24
    • elementBlock.getRetainedSizeInBytes() = 980
    • blockPositionByHash.size()*Integer.BYTES= 1024
  • intersectTypedSet = 2100
    • Same as above
  • blockBuilder = 980

Using OptimizedTypedSet 2592

  • INSTANCE_SIZE = 72
  • ARRAY_LIST_INSTANCE_SIZE =24
  • sizeOf(blockPositionByHash) = 2064
  • positionsForBlocks = 432

Even though blockPositionByHash is larger because it's using long[] instead of int[], the total retained size is still much smaller.

Now let's look at allocations. In fact almost all allocation metrics for the new implementation is better than the old one for these functions as shown in GC profile.

  • The original implementation has 3 BlockBuilders. These BlockBuilders all do memory allocation and copying for its internal arrays like values[], isNull[], offsets[], etc.. The new implementation only allocates one positions array for each set operation while not the real block content.
  • Original implementation creates two TypedSets, which actually created two hashtables, the same as the OptimizedTypedSet.
  • The new OptimizedTypedSet creates a positions array of size positionCount each time. This size maybe larger than required because dedup effect. For this one I have an idea to reuse the existing positions array from last set operation. This works because the previous Blocks and blockPositions are no longer needed. Ie. change
int[] positions = new int[positionCount];

To

int[] positions = ensureCapacity(positionsForBlocks.get(currentBlockIndex - 1).getPositions(), positionCount);

This reduced the allocation rate from 6038 bytes/op to 5633 bytes/op and further reduced the CPU and elapsed cost for about 15%.
The allocation rate and GC metrics are actually much better than the original:

Screen Shot 2020-11-05 at 1 59 55 AM
Screen Shot 2020-11-05 at 2 00 02 AM

@yingsu00 yingsu00 force-pushed the TypedSet branch 2 times, most recently from f35720d to 13ed5a5 Compare November 6, 2020 02:10
*/
public void union(Block block)
{
int positionCount = block.getPositionCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this assignment to L90 where it is being used

*/
public void intersect(Block block)
{
int positionCount = block.getPositionCount();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this inside the else to keep the scope small

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we don't even need this variable. Can inline it into the loop directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we don't even need this variable. Can inline it into the loop directly

positionCount is used in the performance critical for loop. I'm sure the function call will be inlined here, but it may still be an indirect memory access through the block pointer. Besides, it's used in two different places and having a variable doesn't reduce readability. I'd prefer keeping the variable. I did move the declaration to inside of the else block.

int positionCount = block.getPositionCount();

currentBlockIndex++;
ensureBlocksCapacity(currentBlockIndex + 1);
Copy link
Member

@sujay-jain sujay-jain Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it is cleaner to do the +1 inside the ensureCapacity instead of here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujay-jain The ensureCapacity function, by reading its name, and by numerous use cases of similar function, implies that it ensures the object size to be the value of the input capacity parameter. Here the capacity to be ensured was currentBlockIndex + 1 because the size of an array is the largest index plus 1. If we move the +1 logic inside of the function, then the function becomes ensure the capacity of the object array to be capacity + 1 which is weird. So I think it's better to keep it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I missed the part that we were passing in the index

int positionWithinBlock = (int) (blockPosition & 0xffff_ffff);
if (positionEqualsPosition(elementType, blocks[blockIndex], positionWithinBlock, block, position)) {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the else part can be extracted into a function since getInsertPosition is also doing the same thing more or less.

Copy link
Contributor Author

@yingsu00 yingsu00 Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujay-jain I understand they share lots of similarities but there are some subtle differences. addElement returns boolean while getInsertPosition returns int. addElement does insert the element into the hashtable but getInsertPosition doesn't. The reason having both was to avoid expensive hash value calculation when there're multiple hash tables, so that calculating of the hash value on the first hash table can be reused by the second. I can change the addElement to return an int, but then the callsite would become something like this:

if (addElement(newBlockPositionByHash, hash, block, i) == INVALID_POSITION) {
        positions[positionsIndex++] = i;
 }

This again looks not as easy to understand than returning a boolean. It is common in multiple standard libraries that a function to add a new element into a set to return boolean than an int. What do you think?

union(block);
return;
}
else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's common functionality between except and intersect eg: the for loop. Can we try to extract the common things to functions - makes it more readable and reduces changes of bugs :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujay-jain I understand they look similar, but the logic is different. intersect has

if (positionInBlockPositionByHash == INVALID_POSITION) {
   // Add the new element
}

But the except has

if (positionInBlockPositionByHash != INVALID_POSITION) {
   // Add the new element
}

This makes it hard to extract common method. The only part can be extracted was

if (addElement(hashtable, hash, block, position)) {
        positions[positionsIndex++] = position;
}

I tried to extract it to a private method:

private int addElementIfNotExists(Block block, int position, int[] positions, long[] hashtable, int hash, int positionsIndex)
    {
        if (addElement(hashtable, hash, block, position)) {
            positions[positionsIndex++] = position;
        }
        return positionsIndex;
    }

And now the callsites become
intersect:

for (int i = 0; i < positionCount; i++) {
                int hash = getMaskedHash(hashPosition(elementType, block, i));
                int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i);
                if (positionInBlockPositionByHash == INVALID_POSITION) {
                    // add to the hash table if it exists in blockPositionByHash
                    positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex);
                }
            }

exept:

for (int i = 0; i < positionCount; i++) {
                int hash = getMaskedHash(hashPosition(elementType, block, i));
                int positionInBlockPositionByHash = getInsertPosition(blockPositionByHash, hash, block, i);

                // add to the hash table if it does not exist in blockPositionByHash
                if (positionInBlockPositionByHash != INVALID_POSITION) {
                    positionsIndex = addElementIfNotExists(block, i, positions, newBlockPositionByHash, hash, positionsIndex);
                }
            }

Now you have to jump and scroll down to check what addElementIfNotExists does. Its 6 parameters makes it hard to know what's done inside. Also, it mixes operations on the hash table with the operation on positions array which are two different objects in the class that serves different purposes. It doesn't save lines either: the two lines that got extracted into a function just adds inconvenience for the readers.

Combining these considerations I think keeping it as original may be better. If you see other ways doing this, please let me know!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think we can leave it as is if you feel that's cleaner. Thanks for trying it.

@yingsu00
Copy link
Contributor Author

@sujay-jain Thank you very much for reviewing! I addressed your comments. Do you want to read again?

Copy link
Member

@sujay-jain sujay-jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yingsu00
Copy link
Contributor Author

@sujay-jain thanks again for reviewing this PR. @mbasmanova Will you be able to review it for the second round?

@mbasmanova
Copy link
Contributor

@yingsu00 This is a rather large PR. I don't believe I have bandwidth to review it.

Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Only some nits

hashtable[hashPosition] = ((long) currentBlockIndex << 32) | position;
return true;
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not neccessary.

@@ -139,42 +142,36 @@ public static Block mapConcat(MapType mapType, Object state, Block[] maps)
return maps[lastMapIndex];
}

Type keyType = mapType.getKeyType();
Type valueType = mapType.getValueType();
OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entries / 2 is very confusing to me. Maybe when compute entries in line 135, just /2 there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it's confusing. This is because SingleMapBlock counts the positionCount for both key and value block. But line 135 is in the loop and I tend to not add more computation there. I added the following comments

// We need to divide the entries by 2 because the maps array is SingleMapBlocks and it had the positionCount twice as large as a normal Block
OptimizedTypedSet typedSet = new OptimizedTypedSet(keyType, maps.length, entries / 2, FUNCTION_NAME);

checkArgument(maxPositionCount >= 0, "maxPositionCount must not be negative");

this.elementType = requireNonNull(elementType, "elementType must not be null");
this.functionName = functionName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rongrong It used to be in TypedSet and used in the error message for EXCEEDED_FUNCTION_MEMORY_LIMIT. But the new implementation removed this limit so we don't need this functionName anymore. I have removed it. Thank you for the catch!

@@ -92,7 +92,7 @@
private String name = "map_concat";

@Param({"left_empty", "right_empty", "both_empty", "non_empty"})
private String mapConfig = "left_empty";
private String mapConfig = "non_empty";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is inteded. left_empty means the left input array is empty and is not the case we cared most.

@yingsu00 yingsu00 force-pushed the TypedSet branch 2 times, most recently from f9bd8b9 to 6526b0a Compare November 14, 2020 02:47
Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level idea (e.g. keep the block and position in OptimizedTypedSet instead of copying by value) makes sense.

Curious if the OptimizedTypedSet supports different operations in a row? (e.g. first do union, then do intersect).

SelectedPositions selectedPositions = getPositionsForBlocks().get(i);
int positionCount = selectedPositions.size();

if (!selectedPositions.isList()) {
Copy link
Contributor

@wenleix wenleix Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: when would this happen? (in that case we will return the current block positions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: when would this happen? (in that case we will return the current block positions.

@wenleix Nice catch. Right now there isn't. I put the code there in case there is cases where the whole block(without duplicates) is added. Do you think I should remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: when would this happen? (in that case we will return the current block positions.

@wenleix I removed the if (!selectedPositions.isList()) {} code block. Thank you for the catch. Will you be able to take another look? Thank you!

@yingsu00
Copy link
Contributor Author

yingsu00 commented Nov 18, 2020

Curious if the OptimizedTypedSet supports different operations in a row? (e.g. first do union, then do intersect).

@wenleix Thanks for reviewing! Yes it does support different operations. Example:

@Test
    public void testMultipleOperations()
    {
        OptimizedTypedSet typedSet = new OptimizedTypedSet(BIGINT, POSITIONS_PER_PAGE + 1);

        Block block = createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2).appendNull();

        testUnion(typedSet, block, block);
        testIntersect(typedSet, block, block);

        block = createLongSequenceBlock(POSITIONS_PER_PAGE / 2, POSITIONS_PER_PAGE);
        testExcept(typedSet, block.appendNull(), block);
        testExcept(typedSet, createLongSequenceBlock(0, POSITIONS_PER_PAGE), createLongSequenceBlock(0, POSITIONS_PER_PAGE / 2));

        testUnion(typedSet, block, createLongSequenceBlock(0, POSITIONS_PER_PAGE));
        testIntersect(typedSet, createEmptyBlock(BIGINT).appendNull(), createEmptyBlock(BIGINT));
    }

@yingsu00
Copy link
Contributor Author

@mbasmanova
Copy link
Contributor

The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur

@yingsu00 Ying, I remember that verifier excludes potentially non-deterministic queries, e.g. queries with various map and array functions. How much coverage does this run have for the functions affected in this diff?

@yingsu00
Copy link
Contributor Author

The verifier has passed https://www.internalfb.com/intern/presto/verifier/results/?test_id=47678 cc @kaikalur

@yingsu00 Ying, I remember that verifier excludes potentially non-deterministic queries, e.g. queries with various map and array functions. How much coverage does this run have for the functions affected in this diff?

@mbasmanova I built the suite only containing queries with these functions.

Ying Su added 5 commits November 19, 2020 20:15
JMH benchmark shows 1.82x improvement for non_empty case when keyCount=100
and POSITIONS=1000:

Baseline
Benchmark                     Mode  Cnt      Score      Error  Units
BenchmarkMapConcat.mapConcat  avgt   20  26710.925 ± 2005.756  ns/op
Retained Size: 1,402,374 bytes

After
Benchmark                     Mode  Cnt      Score      Error  Units
BenchmarkMapConcat.mapConcat  avgt   20  14605.437 ± 1209.786  ns/op
Retained Size: 1,373,273 bytes

When keyCount=1000 and POSITIONS=1000, the baseline just OOMed. The
optimized version succeeded.

Add different sizes to BenchmarkMapConcat
JMH benchmark shows up to 40% improvement:

Type       | Baseline  | Specialized Baseline | OptimizedTypedSet | Gain%
-----------|-----------|----------------------|-------------------|----------
BIGINT     |      5511 |                 3742 |            3320   |      40%
VARCHAR    |     20414 |                 N/A  |            14155  |      31%
JMH benchmark shows 40% improvement:

Before
Benchmark                               Mode  Cnt       Score        Error  Units
BenchmarkArrayIntersect.arrayIntersect  avgt   10  618074.452 ± 119912.203  ns/op

After
Benchmark                               Mode  Cnt       Score       Error  Units
BenchmarkArrayIntersect.arrayIntersect  avgt   10  376854.064 ± 21616.063  ns/op
JMH benchmark shows 35% improvement:

Before
Benchmark                               Mode  Cnt       Score       Error  Units
BenchmarkArrayIntersect.arrayIntersect  avgt   10  540349.423 ± 66298.751  ns/op

After
Benchmark                               Mode  Cnt       Score       Error  Units
BenchmarkArrayIntersect.arrayIntersect  avgt   10  350934.564 ± 34092.598  ns/op
@yingsu00
Copy link
Contributor Author

yingsu00 commented Nov 20, 2020

@kaikalur I made BenchmarkMapConcat taking multiple keySet sizes. Both retained size and allocation was improved. When keyCount=1000 and POSITIONS=1000, the baseline just OOMed but the optimized version succeeded. I think it's safe to say the optimized version is both more memory efficient and CPU efficient.

JMH benchmark shows 82% improvement for non_empty case for map_concat when keyCount=100
and POSITIONS=1000:

    Baseline

    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  26710.925 ± 2005.756  ns/op
    Retained Size: 1,402,374 bytes

    After
    Benchmark                     Mode  Cnt      Score      Error  Units
    BenchmarkMapConcat.mapConcat  avgt   20  14605.437 ± 1209.786  ns/op
    Retained Size: 1,373,273 bytes

Do you have more questions about this PR?

cc @rongrong

Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only over the general logic for TypedSet and map_concat and array_union implementation -- other function implementations are straightforward . As @rongrong , @sujay-jain and @kaikalur have already provided careful and in-depth review.

As a side note -- for a separate PR, might also worthy considering optimizing some of the functions (e.g. array functions) to use PROVIDED_BLOCKBUILDER convention, to avoid one extra copy, as is done in #13874 ? (essentially, instead of returning the block represent the array, the callee provide a BlockBuilder to append the result) -- it might only provide moderate speedup but if it's easy to implement probably still good to have ? :)

@wenleix wenleix merged commit 1e00009 into prestodb:master Nov 30, 2020
@wenleix
Copy link
Contributor

wenleix commented Nov 30, 2020

Merged #15362 . Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants