KAFKA-4514: Add Codec for ZStandard Compression #2267

dongjinleekr · 2016-12-16T09:52:13Z

Hello. This PR resolves KAFKA-4514: Add Codec for ZStandard Compression. Please have a look when you are free. Since I am a total newbie of Apache Kafka, feel free to point out the deficiencies.

Add to the feature itself, I have a question: Should we support an option for ZStandard compression level?

According to ZStandard official documentation, it supports compression level of 1 ~ 22. Because of that, Hadoop added a new configuration option named "io.compression.codec.zstd.level", whose default value is 3. In this PR, I configured the compression level to 1 as a temporary one but wondering following problems:

Should we provide a configurable option?
Would it better to change the default value, from 1 to another one?

I am looking forward to your advice. Thanks.

onurkaraman · 2016-12-16T10:51:38Z

clients/src/main/java/org/apache/kafka/common/record/MemoryRecordsBuilder.java

@@ -69,6 +69,14 @@ public Constructor get() throws ClassNotFoundException, NoSuchMethodException {
        }
    });

+    private static MemoizingConstructorSupplier zStd4OutputStreamSupplier = new MemoizingConstructorSupplier(new ConstructorSupplier() {


I have no context on this patch but I'm guessing you have a typo:
currently: zStd4OutputStreamSupplier
probably intended: zStdOutputStreamSupplier

You have a 4 in the variable name.

Oh my. Got it.

tgravescs · 2016-12-16T14:47:09Z

2 big questions here.

whether this needs to go through the KIP process. @ijuma mentioned that in the jira
Whether we want to pull from luben/std-jni, which is someones private repo? It is bsd licensed but its not in apache so any bug fixes and such they would have to pull in and question about maintenance and such. Since BSD licensed probably could just copy or could also copy the Hadoop implementation ones. Anyway this is more question for the Kafka committers.

tgravescs · 2016-12-16T14:48:10Z

Also, thanks for working on this. Did you have a chance to run any performance comparisons?

ijuma · 2016-12-16T15:22:09Z

Thanks for the PR.

To answer @tgravescs's questions:

Generally we require KIPs for changes that affect the message format because we want such changes to be vetted thoroughly. In this case, we are using an additional bit in the attributes byte and we don't have many of those left, so good to get feedback from a wider group.
That is a potential concern. We would be more likely to include it if we could use an established library (I understand that this is a new compression algorithm so it may be hard to find).

Performance numbers would definitely help the discussion.

dongjinleekr · 2016-12-19T03:14:15Z

@tgravescs @ijuma

Thanks for your advice. Let me summarize:

I will conduct a test benchmark a couple of days, and submit a proposal on KIP with the results. From there, let's discuss the following topics there:

Whether we extend current message format to support ZStandard.
Whether we use the (already written) ZStandard library or write some JNI code directly on Kafka.
Whether we provide a Default compression level and what value it should be.

I just started configuring benchmark environment on AWS. Please tune in!

This patch adds support for zstandard compression to Kafka as documented in KIP-110: https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ZStandard+Compression. Reviewers: Ivan Babrou <ibobrik@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

hachikuji · 2018-10-10T00:28:22Z

Merged to trunk and 2.1. Thanks again for your persistence! Great contribution!

ijuma · 2018-10-10T15:31:07Z

Thanks for your contribution @dongjinleekr! Is https://twitter.com/dongjinleekr, your Twitter handle?

PR #2267 Introduced support for Zstandard compression. The relevant test expects values for `num_nodes` and `num_producers` based on the (now-incremented) count of compression types. Passed the affected, previously-failing test: `ducker-ak test tests/kafkatest/tests/client/compression_test.py` Reviewers: Jason Gustafson <jason@confluent.io>

davewat · 2018-11-13T17:45:07Z

Was a param introduced to adjust the compression level?

scottcarey · 2018-11-13T17:47:25Z

I hope so. Half the power of Zstd is its huge range of compression levels on the pareto frontier.

…

On Tue, Nov 13, 2018, 09:45 davewat ***@***.*** wrote: Was a param introduced to adjust the compression level? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb7nskilsQmwWkXo_NZOYIl5Qw057ks5uuwVHgaJpZM4LPBAI> .

davewat · 2018-11-14T14:44:29Z

@dongjinleekr where is the compression level specified? I can't seem to locate it, even in source. Right now I am getting better compression with gzip still, and want to adjust the level. Thanks.

scottcarey · 2018-11-14T16:13:57Z

It would only need to exist and be configured on the producer. Its irrelevant to the consumer or broker or wire protocol. Therefore it isnt in the KIP or specification, and is not related to compatability.

…

On Wed, Nov 14, 2018, 06:45 davewat ***@***.*** wrote: @dongjinleekr <https://github.com/dongjinleekr> where is the compression level specified? I can't seem to locate it, even in source. Right now I am getting better compression with gzip still, and want to adjust the level. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb_54duoFa9MfwEJZFbaascNVvrWGks5uvCyAgaJpZM4LPBAI> .

eliaslevy · 2018-11-14T16:21:00Z

@scottcarey while it is an anti-pattern, the broker can compress/recompress messages depending on the the topic's compression.type configuration, which could be set to zstd .

ijuma · 2018-11-14T16:21:12Z

Producer configs have to be in the KIP (any config is considered public API). I don't think there's a way to configure the compression level at this point. If someone wants to contribute that, it would make to allow it for other compression types too.

scottcarey · 2018-11-14T16:23:48Z

IIRC this was disallowed for zstd.

…

On Wed, Nov 14, 2018, 08:21 Elias Levy ***@***.*** wrote: @scottcarey <https://github.com/scottcarey> while it is an anti-pattern, the broker can compress/recompress messages depending on the the topic's compression.type configuration, which could be set to zstd . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb6Qh7uWifh-g4dmuGvve64EulvFIks5uvEMXgaJpZM4LPBAI> .

scottcarey · 2018-11-14T16:31:45Z

I may be mistaken. At least some of this was disallowed for zstd when coverting to clients that dont have zstd support. That is distinct from broker recompression. A broker could recompress and the level it does so is a major factor in broker cpu use if so.

…

On Wed, Nov 14, 2018, 08:21 Elias Levy ***@***.*** wrote: @scottcarey <https://github.com/scottcarey> while it is an anti-pattern, the broker can compress/recompress messages depending on the the topic's compression.type configuration, which could be set to zstd . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb6Qh7uWifh-g4dmuGvve64EulvFIks5uvEMXgaJpZM4LPBAI> .

ijuma · 2018-11-14T16:32:13Z

Setting the topic config to zstd while not compressing in the producer is allowed for all compression algorithms.

scottcarey · 2018-11-14T16:38:10Z

Yes i got automatic broker downconversion confused with topic compression. I was under the impression both were disabled for zstd but only one is.

…

On Wed, Nov 14, 2018, 08:32 Ismael Juma ***@***.*** wrote: Setting the topic config to zstd while not compressing in the producer is allowed for all compression algorithms. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb-HdQwblXEQw_7r0gr9dbICoe7nWks5uvEW1gaJpZM4LPBAI> .

davewat · 2018-11-14T17:06:08Z

@scottcarey I wouldn't say it's irrelevant on the broker. We have older producers that don't/ won't speak zstd, but our (testing) Brokers force zstd on the topic for storage/ space requirements. The broker's ability to set the compression level is key in our use case.

I just can't figure out where in this PR the level is set - though now I am thinking we are just utilizing the default (3).

dongjinleekr · 2018-11-15T15:50:50Z

@davewat @scottcarey @ijuma Sorry for the late reply. In fact, I already investigated this issue (support compression level for zstd) but I concluded that it would be much better to put this issue into separated one and focus on implementing the ZSTD feature only.

Here is why: All compression codecs (i.e., Gzip, Snappy, LZ4 and ZSTD) support some parameters to change the degree of compression; However, only LZ4 and ZSTD supports the concept of 'level' - in the case of GZIP and Snappy, they require the block size parameter, not 'level'.

To make the compression level feature available, we must modify the API signatures of MemoryRecordsBuilder to support compression level, and add some validation logic¹. It requires additional modifications to read ProducerConfig value and pass it into MemoryRecordsBuilder. Of course, this work requires a bunch of modifications and some policies on various codecs. It is why I decided to put off this issue and use the default level for ZSTD, that is, 3.

How about your opinion? Do you really need it? Does it sound reasonable?

If the answer is Yes, please file it to Jira with 'needs-kip' tag; Then, I will take the issue - I will make the proposal.

Check whether given CompressionCodec supports the concept of 'level,' and whether given compression level is valid for the CompressionCodec. (e.g., ZSTD supports 22 levels but LZ4 supports 4 levels only.) ↩

luben · 2018-11-15T16:22:03Z

@dongjinleekr , zstd also supports negative levels for faster compression, it's equivalent of the --fast X in the CLI

davewat · 2018-11-15T16:27:36Z

@dongjinleekr our use case needs it, not sure if we are just a corner case. I have created the issue in Jira. Thanks!

https://issues.apache.org/jira/browse/KAFKA-7632

scottcarey · 2018-11-15T16:34:59Z

Gzip supports levels 1 to 9. The api in java hides it a bit, its has a very significant impact on cpu use when compressing. I suspect much of the reason that recompressing broker side has a bad rap is because it was done with the default level of 6. Level 1 is about 10x as fast. Zstd lv 3 is faster than that at higher compression rates, however. Being able to tune the compression level is very important. Compression is all about CPU to I/O tradeoffs, and what tradeoff is the best is use case dependent. Zstd ranges from snappy-like speeds and compression levels to lzma-like compression levels. Setting compression levels will get even more important if dictionary support is added.

…

On Thu, Nov 15, 2018, 07:51 Lee Dongjin ***@***.*** wrote: @davewat <https://github.com/davewat> @scottcarey <https://github.com/scottcarey> @ijuma <https://github.com/ijuma> Sorry for the late reply. In fact, I already investigated this issue (support compression level for zstd) but I concluded that it would be much better to put this issue into separated one and focus on implementing the ZSTD feature only. Here is why: All compression codecs (i.e., Gzip, Snappy, LZ4 and ZSTD) support some parameters to change the degree of compression; However, only LZ4 and ZSTD supports the concept of 'level' - in the case of GZIP and Snappy, they require the block size parameter, not 'level'. To make the compression level feature available, we must modify the API signatures of MemoryRecordsBuilder to support compression level, and add some validation logic[^1]. It requires additional modifications to read ProducerConfig value and pass it into MemoryRecordsBuilder. Of course, this work requires a bunch of modifications and some policies on various codecs. It is why I decided to put off this issue and use the default level for ZSTD, that is, 3. How about your opinion? Do you really need it? Does it sound reasonable? *If the answer is Yes, please file it to Jira with 'needs-kip' tag; Then, I will take the issue - I will make the proposal.* [^1]: Check whether given CompressionCodec supports the concept of 'level,' and whether given compression level is valid for the CompressionCodec. (e.g., ZSTD supports 22 levels but LZ4 supports 4 levels only.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJJb5-43B16StZuZxPXMwzcGAYTUhHrks5uvY2JgaJpZM4LPBAI> .

bobrik · 2018-11-15T17:20:35Z

When I added zstd support for Go library, I also added compression level support:

Add support for compression levels, closes #1042 IBM/sarama#1044

It applies to zstd and gzip.

dongjinleekr · 2018-11-17T04:11:30Z

@scottcarey @luben Thank you for the correction. Right, ZSTD now supports negative compression level (#1 #2) and GZIP is also able to use compression levels, although it is blocked in the official API - but we can make use of it with some workaround. These features should be supported in the implementation.

@davewat Thank you for filing the issue. I just updated the issue applying the comments here and now working on the KIP. I will give you slack when I complete the document and open the discussion thread.

@bobrik Great. Sarama always guides us the direction! I will include sarama's case in the KIP.

dongjinleekr · 2018-11-18T22:09:03Z

@davewat @scottcarey @eliaslevy @ijuma I just opened the discussion thread in dev mailing list. Let's continue the discussion there.

cc/ @bobrik @luben

luben · 2018-11-19T09:11:19Z

@dongeforever , just released zstd-jni-1.3.7-2 with support to query min/max compression levels

This patch adds support for zstandard compression to Kafka as documented in KIP-110: https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ZStandard+Compression. Reviewers: Ivan Babrou <ibobrik@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

PR apache#2267 Introduced support for Zstandard compression. The relevant test expects values for `num_nodes` and `num_producers` based on the (now-incremented) count of compression types. Passed the affected, previously-failing test: `ducker-ak test tests/kafkatest/tests/client/compression_test.py` Reviewers: Jason Gustafson <jason@confluent.io>