[improve][client][PIP-389] Add a producer config to improve compression performance #23525

liangyepianzhou · 2024-10-29T08:34:33Z

Motivation

The motivation of this PIP is to provide a way to improve the compression performance by skipping the compression of small messages.
We want to add a new configuration compressMinMsgBodySize to the producer configuration.
This configuration will allow the user to set the minimum size of the message body that will be compressed.
If the message body size is less than the compressMinMsgBodySize, the message will not be compressed.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

…ance

lhotari · 2024-10-30T08:56:17Z

Please add the PIP number to the PR title as we usually do.

pulsar-client/src/main/java/org/apache/pulsar/client/impl/conf/ProducerConfigurationData.java

lhotari · 2024-10-30T12:41:43Z

@liangyepianzhou Regarding performance optimizations for compression in Pulsar, there's also work that should be done.
For example for gzip compression/decompression this is very inefficient:

pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/compression/CompressionCodecZLib.java

Lines 60 to 85 in 82237d3

    
           public ByteBuf encode(ByteBuf source) { 
        
               byte[] array; 
        
               int length = source.readableBytes(); 
        
               int sizeEstimate = (int) Math.ceil(source.readableBytes() * 1.001) + 14; 
        
               ByteBuf compressed = PulsarByteBufAllocator.DEFAULT.heapBuffer(sizeEstimate); 
        
               int offset = 0; 
        
               if (source.hasArray()) { 
        
                   array = source.array(); 
        
                   offset = source.arrayOffset() + source.readerIndex(); 
        
               } else { 
        
                   // If it's a direct buffer, we need to copy it 
        
                   array = new byte[length]; 
        
                   source.getBytes(source.readerIndex(), array); 
        
               } 
        
               Deflater deflater = this.deflater.get(); 
        
               deflater.reset(); 
        
               deflater.setInput(array, offset, length); 
        
               while (!deflater.needsInput()) { 
        
                   deflate(deflater, compressed); 
        
               } 
        
               return compressed; 
        
           }

.
Another detail is that the current implementation isn't using "zero copy" approaches that are available.
For example in Snappy:

pulsar/pulsar-common/src/test/java/org/apache/pulsar/common/compression/CompressionCodecSnappyJNI.java

Lines 34 to 64 in 82237d3

    
           @Override 
        
           public ByteBuf encode(ByteBuf source) { 
        
               int uncompressedLength = source.readableBytes(); 
        
               int maxLength = Snappy.maxCompressedLength(uncompressedLength); 
        
               ByteBuffer sourceNio = source.nioBuffer(source.readerIndex(), source.readableBytes()); 
        
               ByteBuf target = PooledByteBufAllocator.DEFAULT.buffer(maxLength, maxLength); 
        
               ByteBuffer targetNio = target.nioBuffer(0, maxLength); 
        
               int compressedLength = 0; 
        
               try { 
        
                   compressedLength = Snappy.compress(sourceNio, targetNio); 
        
               } catch (IOException e) { 
        
                   log.error("Failed to compress to Snappy: {}", e.getMessage()); 
        
               } 
        
               target.writerIndex(compressedLength); 
        
               return target; 
        
           } 
        
           @Override 
        
           public ByteBuf decode(ByteBuf encoded, int uncompressedLength) throws IOException { 
        
               ByteBuf uncompressed = PooledByteBufAllocator.DEFAULT.buffer(uncompressedLength, uncompressedLength); 
        
               ByteBuffer uncompressedNio = uncompressed.nioBuffer(0, uncompressedLength); 
        
               ByteBuffer encodedNio = encoded.nioBuffer(encoded.readerIndex(), encoded.readableBytes()); 
        
               Snappy.uncompress(encodedNio, uncompressedNio); 
        
               uncompressed.writerIndex(uncompressedLength); 
        
               return uncompressed; 
        
           }

In BookKeeper, I added zero-copy for calculating checksums in apache/bookkeeper#4196. The ByteBufVisitor approach could be used to avoid copying source buffers to an extra nio buffer.
Calling Netty's io.netty.buffer.CompositeByteBuf#nioBuffer will allocate a new nio ByteBuffer in the heap and copy the content there. That's not very great from performance perspective, especially when we want to reduce allocations and garbage. With the ByteBufVisitor approach it's possible to read the source direct byte buffers without extra copies.
Have you considered in addressing this performance issue in the Pulsar message compression solution?

liangyepianzhou · 2024-10-30T14:15:20Z

Have you considered in addressing this performance issue in the Pulsar message compression solution?

Sounds good, maybe I can try optimizing it in other PRs

lhotari · 2024-10-30T14:22:12Z

Have you considered in addressing this performance issue in the Pulsar message compression solution?

Sounds good, maybe I can try optimizing it in other PRs

+1, In the Pulsar code base, we have a special module called microbench for microbenchmarks with JMH, https://github.com/apache/pulsar/tree/master/microbench . Testing performance with JMH could be useful for such improvements.

liangyepianzhou · 2024-10-30T14:55:17Z

Have you considered in addressing this performance issue in the Pulsar message compression solution?

Sounds good, maybe I can try optimizing it in other PRs

+1, In the Pulsar code base, we have a special module called microbench for microbenchmarks with JMH, https://github.com/apache/pulsar/tree/master/microbench . Testing performance with JMH could be useful for such improvements.

Thanks for the reminder.

liangyepianzhou marked this pull request as draft October 29, 2024 08:34

github-actions bot added doc-label-missing doc-required Your PR changes impact docs and you will update later. and removed doc-label-missing labels Oct 29, 2024

liangyepianzhou changed the title ~~[improve][client] Add a producer config to improve compaction performance~~ [improve][client] Add a producer config to improve compression performance Oct 29, 2024

xiangying added 2 commits October 29, 2024 20:42

[improve][client] Add a producer config to improve compaction perform…

a2014b6

…ance

ADD TEST

08c2853

liangyepianzhou force-pushed the pip/minCompressSize branch from f56999a to 08c2853 Compare October 29, 2024 12:43

apache deleted a comment from github-actions bot Oct 30, 2024

liangyepianzhou self-assigned this Oct 30, 2024

liangyepianzhou marked this pull request as ready for review October 30, 2024 08:41

liangyepianzhou requested review from lhotari, BewareMyPower, lordcheng10 and poorbarcode October 30, 2024 08:42

lhotari reviewed Oct 30, 2024

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/conf/ProducerConfigurationData.java Show resolved Hide resolved

BewareMyPower changed the title ~~[improve][client] Add a producer config to improve compression performance~~ [improve][client][PIP-389] Add a producer config to improve compression performance Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][client][PIP-389] Add a producer config to improve compression performance #23525

[improve][client][PIP-389] Add a producer config to improve compression performance #23525

liangyepianzhou commented Oct 29, 2024 •

edited

Loading

lhotari commented Oct 30, 2024

lhotari commented Oct 30, 2024

liangyepianzhou commented Oct 30, 2024

lhotari commented Oct 30, 2024

liangyepianzhou commented Oct 30, 2024

[improve][client][PIP-389] Add a producer config to improve compression performance #23525

Are you sure you want to change the base?

[improve][client][PIP-389] Add a producer config to improve compression performance #23525

Conversation

liangyepianzhou commented Oct 29, 2024 • edited Loading

Motivation

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

lhotari commented Oct 30, 2024

lhotari commented Oct 30, 2024

liangyepianzhou commented Oct 30, 2024

lhotari commented Oct 30, 2024

liangyepianzhou commented Oct 30, 2024

liangyepianzhou commented Oct 29, 2024 •

edited

Loading