Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket server outgoing message queue memory growth #4824

Closed
jbfru45 opened this issue Apr 28, 2020 · 15 comments · Fixed by #5059
Closed

WebSocket server outgoing message queue memory growth #4824

jbfru45 opened this issue Apr 28, 2020 · 15 comments · Fixed by #5059
Assignees

Comments

@jbfru45
Copy link

jbfru45 commented Apr 28, 2020

Jetty version
9.4.27.v20200227
Java version
OpenJDK 11.0.3
OS type/version
Amazon Linux 2
Description
We have observed excessive JVM heap growth twice in the last couple days, and a heap dump shows the majority of the growth is due to a backup of outgoing Jetty WebSocket FrameEntries stored in a single org.eclipse.jetty.websocket.common.extensions.ExtensionStack (see heap dump screenshot attached). We are calling the RemoteEndpoint.sendString(String text, WriteCallback callback) method to send messages asynchronously to connected websockets. Are there any known scenarios that could cause this to happen, and is there anything we can do to prevent it?
Screen Shot 2020-04-28 at 2 00 40 PM

@jbfru45 jbfru45 changed the title WebSocket server outgoing message queue heap growth WebSocket server outgoing message queue memory growth Apr 28, 2020
@joakime
Copy link
Contributor

joakime commented Apr 28, 2020

Are you paying attention to the results of the WriteCallback?

It is up to the Application to not overwhelm the async write as it's non-blocking and queues messages.
The application should backoff from writing if there is an excess of messages that haven't resulted in success/failure yet.

@jbfru45
Copy link
Author

jbfru45 commented Apr 28, 2020

We are logging the result in WriteCallback but do not track a pending count. CPU was under 15% and network bandwidth was not saturated from our server's perspective.

Is this queue for the entire server or is it per session? If it is a single session, I could track the outgoing count for a connection and kill it if it becomes too backed up.

@joakime
Copy link
Contributor

joakime commented Apr 28, 2020

It is per WebSocket Session.

There is no exposed API for accessing the queue depth either before the extension stack or after.
The queue is an internal implementation detail that has undergone many changes in its lifetime to address reported issues.
Even through these changes, the concept of the queue existing is ever present, and that is the nature of going Async vs Blocking.

Many projects using the Async modes in Jetty WebSocket (or even the javax.websocket API or newer jakarta.websocket API) have the same requirements.
The application (or library) is responsible for:

  • paying attention to the async callbacks (or futures, or other techniques) to know if they are sending too much.
  • perhaps dropping message that are not deemed important anymore (the role of the message is now past).
  • perhaps ordering messages based on some kind of application logic.
  • perhaps making note of messages confirmed sent successfully, and ensuring unsent messages are resent (even on websocket client reconnect).

@jbfru45
Copy link
Author

jbfru45 commented Apr 28, 2020

Okay, the heap dump showed over 300k messages pending in that ExtensionStack, so I assume that was likely just a single slow client connection?

We operate a "fire and forget" outgoing notification system, so I think tracking the pending count per session and dropping messages if the count gets too high is a reasonable approach. We have similar logic at other levels of the stack.

@joakime
Copy link
Contributor

joakime commented Apr 28, 2020

300k is a lot, and it could be from a single slow client if you don't pay attention to the async results.

Have you considered a more robust library for notifying clients like cometd (with websocket support)?
https://cometd.org/

It manages the write queue for you.
Is a pub/sub messaging setup, with client libs (stand alone or browser+javascript), and server libs (works very well with Jetty).
Don't worry about the clustering features, they are optional and not required.
There's even configurations to ensure specific messages are sent and acknowledged by clients (even ones that reconnect).

@gregw @sbordet what do you think about putting a (configurable in the session) upper limit on the writeflusher queue in websocket?
If it hits the limit, all further attempts to add fail the requested writes immediately?
Also, I think this concept would have to be on the pre-extensionStack writeflusher.

@jbfru45
Copy link
Author

jbfru45 commented Apr 29, 2020

We have many 3rd party clients that use our platform, so even if we added support for a new standard on the server side we would have to support both implementations for quite a while. We eventually hope to move to a managed MQTT solution, but that is likely a ways off as well.

I found this issue stating that a websocket async send timeout was implemented in Jetty 10, but not 9.4: #847

Something like that would be quite useful here. Could that be brought back to 9.4? Also, is the OP of that ticket correct when he claims that the WriteCallback is called before the message is actually sent to the client? If so, monitoring those callbacks won't give a good indication of the actual queue length.

@rymsha
Copy link

rymsha commented Apr 29, 2020

Also, is the OP of that ticket correct when he claims that the WriteCallback is called before the message is actually sent to the client? If so, monitoring those callbacks won't give a good indication of the actual queue length.

This is my current problem as well.
Callback is called before ACK is received from the client.
Actually "blocking" send also returns before ACK!

In all examples I found (including cometd) async call immediately followed by future.get() (how is it better than blocking?) or ignore future/callback all together (causing internal queue to grow unlimitedly).

@joakime
Copy link
Contributor

joakime commented Apr 29, 2020

I found this issue stating that a websocket async send timeout was implemented in Jetty 10, but not 9.4: #847

Every time you use the send methods that async send timeout resets.
If you continue to actually send periodically, that timeout will not help you.

@joakime
Copy link
Contributor

joakime commented Apr 29, 2020

Also, is the OP of that ticket correct when he claims that the WriteCallback is called before the message is actually sent to the client? If so, monitoring those callbacks won't give a good indication of the actual queue length.

This is my current problem as well.
Callback is called before ACK is received from the client.
Actually "blocking" send also returns before ACK!

In all examples I found (including cometd) async call immediately followed by future.get() (how is it better than blocking?) or ignore future/callback all together (causing internal queue to grow unlimitedly).

The callback success is called once it's successfully written to the network.

After that, all of the normal TCP behaviors apply.
The network can do all sorts of buffering / aggregation and whatnot on it's own without Java being involved.
With writes to a slow remote, you'll eventually reach a point of TCP congestion where there's enough that's been written, but the remote side hasn't read enough.
That congestion will cause the next write you attempt to not succeed (yet).
In blocking send scenarios the send() method will block until it can actually write (or fail and throw an exception).
In async send scenarios the callback/future will not update until it can either write (or fail).
This is your typical backpressure that you can utilize to know when your have slow client.

This is not unique to WebSocket, this is common/typical behavior for all TCP based communications.

@joakime
Copy link
Contributor

joakime commented Apr 29, 2020

Callback is called before ACK is received from the client.
Actually "blocking" send also returns before ACK!

TCP ACK is layer 4.
We (Jetty, HTTP, WebSocket) are all the way up at Layer 7.
If you want to know with 100% certainty that the remote side got the websocket message, then you need a layer 7 ACK. A websocket messaging specific ACK.
Which is what many libraries built on websocket do (including cometd, which has a cometd ACK extension).

@jbfru45
Copy link
Author

jbfru45 commented Apr 29, 2020

In async send scenarios the callback/future will not update until it can either write (or fail). This is your typical backpressure that you can utilize to know when your have slow client.

Got it, so Java's TCP layer will eventually apply back pressure when its own queue fills up which will cause Jetty's WriteCallback to be delayed as well. That's what I hoped but wanted to confirm. I should be able to try this out tomorrow. Thanks for all your help.

@rymsha
Copy link

rymsha commented Apr 29, 2020

Thanks for such a great great response @joakime
Repeating your question:

what about putting a (configurable in the session) upper limit on the writeflusher queue in websocket?
For me

  • it could prevent severe OOMEs (replacing it with less severe Exception and keep server alive)
  • it could be used as a signal to stop sending more messages to slow client

joakime added a commit that referenced this issue Apr 29, 2020
Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
@joakime
Copy link
Contributor

joakime commented Apr 29, 2020

lachlan-roberts added a commit that referenced this issue Apr 29, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
@lachlan-roberts
Copy link
Contributor

what do you think about putting a (configurable in the session) upper limit on the writeflusher queue in websocket?
If it hits the limit, all further attempts to add fail the requested writes immediately?
Also, I think this concept would have to be on the pre-extensionStack writeflusher.

@joakime I'm not really sure it is correct to fail frames without failing the connection. What if I sent three frames it drops the middle one and then the other side sees only the first and last frame?

But I think this would be pretty simple to implement. I had a go at implementing this in RemoteEndpoint and it seems to work fine. Have a look at commit 71df3b5 and let me know what you think.

joakime added a commit that referenced this issue May 1, 2020
Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
joakime added a commit that referenced this issue May 1, 2020
…rceptor"

This reverts commit a21e833

Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
joakime added a commit that referenced this issue May 1, 2020
Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
joakime added a commit that referenced this issue May 4, 2020
Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
lachlan-roberts added a commit that referenced this issue Sep 4, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Sep 4, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Sep 4, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Sep 10, 2020
…rames

Issue #4824 - add configuration on RemoteEndpoint for maxOutgoingFrames
@lachlan-roberts
Copy link
Contributor

We have added configuration on RemoteEndpoint to optionally restrict the number of data frames which can be queued up to be sent.

This can be set with RemoteEndpoint.setMaxOutgoingFrames(), where the default value of -1 does not restrict the number of frames which can be queued up. The user will need to pay attention to the WriteCallback to ensure they do not send too many frames if they use a value of -1. Otherwise the frames callback will be failed with a WritePendingException if the number of writes exceeds this limit, but the connection will not be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants