Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][client] Fix client side memory leak when call MessageImpl.create and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram #22393

Merged
merged 9 commits into from
Apr 7, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Apr 2, 2024

Motivation

Issue 1: logs is below[1]

Issue 2:

[1]:

2024-04-02T14:37:27,801 - WARN  - [pulsar-client-io-35-4:ProducerImpl] - [persistent://my-property/my-ns/tp-7c4d35f4-0c0b-4415-b565-e88722b4879a] [test-0-0] Got exception while completing the callback for msg 1:
java.lang.NullPointerException: Cannot invoke "io.netty.buffer.ByteBuf.release()" because the return value of "org.apache.pulsar.client.impl.MessageImpl.getDataBuffer()" is null
	at org.apache.pulsar.client.impl.ProducerImpl$1.sendComplete(ProducerImpl.java:421) ~[classes/:?]
	at org.apache.pulsar.client.impl.ProducerImpl$OpSendMsg.sendComplete(ProducerImpl.java:1594) ~[classes/:?]
	at org.apache.pulsar.client.impl.ProducerImpl.ackReceived(ProducerImpl.java:1277) ~[classes/:?]
	at org.apache.pulsar.client.impl.ClientCnx.handleSendReceipt(ClientCnx.java:485) ~[classes/:?]
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:236) ~[classes/:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.handler.flush.FlushConsolidationHandler.channelRead(FlushConsolidationHandler.java:152) ~[netty-handler-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.108.Final.jar:4.1.108.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.108.Final.jar:4.1.108.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

The scope that the issue affects

Regarding issue-2 that was described in Motivation, it will happen if batch sending is enabled(without calling the API ProducerBase.sendAsync(Message)). But it will not cause an OOM because all the message payload created by the public API will be built this way:

  • ByteBuffer jdkBuffer = ByteBuffer.wrap( byte[] )
  • ByteBuf nettyByteBuf = Unpooled.wrappedBuffer(jdkBuffer)

So there is no memory leak because the UnpooledByteBuf will not cause issues even if it is eventually not being released.

Note: If someone builds messages by public static MessageImpl<T> create(ByteBuf payload) and enables batch-send, they will encounter issue 2, but we did not provide an API to send a message typed org.apache.pulsar.client.api.Message, so no worry about this. (Highlight)It only affects the users who use ProducerImpl or ProducerBase

Modifications

  • Fix the memory leak
    • which is caused by issue 1.
    • which is caused by issue 1.
  • Fix the inaccurate metrics:
    • pendingMessagesUpDownCounter
    • pendingBytesUpDownCounter
    • latencyHistogram

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/2.10.7 release/2.11.5 release/3.2.3 release/3.0.5 labels Apr 2, 2024
@poorbarcode poorbarcode added this to the 3.3.0 milestone Apr 2, 2024
@poorbarcode poorbarcode self-assigned this Apr 2, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 2, 2024
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Is there any condition in which this was getting triggered from the public APIs?

@lhotari lhotari changed the title [fix] [client] client side memory leak when call ProducerBase.sendAsync(Message) [fix][client] client side memory leak when call ProducerBase.sendAsync(Message) Apr 2, 2024
Copy link
Member

@dao-jun dao-jun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dao-jun
Copy link
Member

dao-jun commented Apr 2, 2024

Just a minor comment:

Will

if (ReferenceCountUtil.refCnt(payload) > 0) {
   ReferenceCountUtil.release(payload);
} 

be better?

double release a ReferentCount Object will throw an exception, and ReferenceCountUtil.safeRelease(...) prints WARN log.

@lhotari
Copy link
Member

lhotari commented Apr 2, 2024

Is there any condition in which this was getting triggered from the public APIs?

@poorbarcode Please answer to this question. I'm wondering about the valid usages of this API. Is it used internally in Pulsar since this is an internal API? How did you run into the problem?

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Apr 2, 2024

@merlimat

Is there any condition in which this was triggered by the public APIs?

Regarding issue 2 that was described in Motivation, it will happen if batch sending is enabled(without calling the API ProducerBase.sendAsync(Message)). But it will not cause an OOM because all the message payload created by the public API will be built this way:

  • ByteBuffer jdkBuffer = ByteBuffer.wrap( byte[] )
  • ByteBuf nettyByteBuf = Unpooled.wrappedBuffer(jdkBuffer)

So there is no memory leak because the UnpooledByteBuf will not cause issues even if it is eventually not being released.

Note: If someone builds messages by public static MessageImpl<T> create(ByteBuf payload) and enables batch-send, they will encounter issue 2, but we did not provide an API to send a message typed org.apache.pulsar.client.api.Message, so no worry about this. It only affects the users who use ProducerImpl or ProducerBase

@poorbarcode
Copy link
Contributor Author

@lhotari

Is there any condition in which this was getting triggered from the public APIs?

@poorbarcode Please answer to this question. I'm wondering about the valid usages of this API. Is it used internally in Pulsar since this is an internal API? How did you run into the problem?

A custom plugin for our company uses this API.

@poorbarcode
Copy link
Contributor Author

@dao-jun

double release a ReferentCount Object will throw an exception, and ReferenceCountUtil.safeRelease(...) prints WARN log.

I think it is ok. If it will be released twice, we need this error log to let us know something is not as expected.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Apr 2, 2024

@dao-jun

@poorbarcode I just noticed there are some PRs are fixing WARN/ERROR logs recently, maybe it's better to consider it carefully.

ReferenceCountUtil.safeRelease(payload) will do nothing if the payload is null, I added a error log manually.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The current title of the PR isn't optimal since this does fix issues also in other cases and also fixes some glitches with the stats.

@poorbarcode poorbarcode changed the title [fix][client] client side memory leak when call ProducerBase.sendAsync(Message) [fix][client] Fix client side memory leak when call MessageImpl.create and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram Apr 6, 2024
@poorbarcode
Copy link
Contributor Author

LGTM. The current title of the PR isn't optimal since this does fix issues also in other cases and also fixes some glitches with the stats.

Renamed the title.

@poorbarcode poorbarcode merged commit 2469b97 into apache:master Apr 7, 2024
52 checks passed
poorbarcode added a commit that referenced this pull request Apr 10, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (#22393)

(cherry picked from commit 2469b97)
poorbarcode added a commit that referenced this pull request Apr 10, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (#22393)

(cherry picked from commit 2469b97)
poorbarcode added a commit that referenced this pull request Apr 10, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (#22393)

(cherry picked from commit 2469b97)
poorbarcode added a commit that referenced this pull request Apr 10, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (#22393)

(cherry picked from commit 2469b97)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 15, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
(cherry picked from commit 14b6279)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
(cherry picked from commit 14b6279)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 17, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
(cherry picked from commit 14b6279)
lhotari added a commit that referenced this pull request Apr 17, 2024
…s SpotBugs failure

- SpotBugs check failed since there was an unused field "msgSize"
  - remove the unused field in branch-2.11
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 19, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
(cherry picked from commit 14b6279)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
(cherry picked from commit 14b6279)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request May 13, 2024
…e and fix imprecise client-side metrics: pendingMessagesUpDownCounter, pendingBytesUpDownCounter, latencyHistogram (apache#22393)

(cherry picked from commit 2469b97)
@nodece
Copy link
Member

nodece commented May 13, 2024

This PR broke the branch-2.10 CI. Fixed by f89fa72

Error:  Medium: Unread field: org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback.msgSize [org.apache.pulsar.client.impl.ProducerImpl$DefaultSendMessageCallback] At ProducerImpl.java:[line 342] URF_UNREAD_FIELD
Error:  Failed to execute goal com.github.spotbugs:spotbugs-maven-plugin:4.2.2:check (spotbugs) on project pulsar-client-original: failed with 1 bugs and 0 errors -> [Help 1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-2.10 cherry-picked/branch-2.11 cherry-picked/branch-3.0 cherry-picked/branch-3.2 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.10.7 release/2.11.5 release/3.0.5 release/3.2.3 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants