Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add connect timeout for exec websockets to avoid hanging #1247

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

benhoyt
Copy link
Collaborator

@benhoyt benhoyt commented Jun 5, 2024

This is the fix for the issue described at #1246. Essentially, if the Pebble timeout has already elapsed, Pebble will happily wait indefinitely for the connect to go through, and the Python side will hang. Add a timeout during the connect phase to cut this short.

Fixes #1246.

ops/pebble.py Outdated
@@ -2832,10 +2832,17 @@ def _cancel_stdin():

def _connect_websocket(self, task_id: str, websocket_id: str) -> '_WebSocket':
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
# Set socket timeout to a short timeout during connection phase, in
# case Pebble side times out (5s), so this side doesn't hang. See:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout for this class is 5s. If pebble timeout is also 5s, the net result feels race-y...

I wonder how to test this exhaustively.

One option would be to set up a dummy but programmable websocket listener. What do you reckon, is it worth the effort?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually racy, as those two timeouts (while they happen to be the same value) are completely independent: the Pebble timeout happens first, then this 5s timeout is added to that. So the total timeout here before the Python side will fail is 10s.

It's easy to test manually with the instructions given in #1246. I wondered whether it was worth an automated test, but decided it was too hard and awkward. We'd need some hook to patch in an arbitrary delay (like the time.sleep(5.1) in the manual test). I decided that because the fix is simple and clear, and it's easy to repro manually, it's probably not worth the hours we'd spend on an automated test for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a caveat to this way of injecting the timeout.

Each raw socket operation may take up to 5s, meaning that the total an unsuccessful websocket connection may take would be 5 * (N+1) where N depends entirely on the websocket-client code and deps.

I figure that this is still way better than not setting a timeout :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I have an idea how to set up automated tests for this, do tell if you'd like me to try.

I had done something similar in the past for some other protocol.

On the other hand, we'd be mostly testing our expectations of the websocket-client library, rather than our code.

@benhoyt
Copy link
Collaborator Author

benhoyt commented Jun 5, 2024

To try this easily in a charm, you can patch the pebble.Client._connect_websocket by including this at the top of src/charm.py (or an imported Python file):

from ops import pebble
import socket
import websocket


def _connect_websocket(self, task_id, websocket_id):
    sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    sock.settimeout(self.timeout)
    sock.connect(self.socket_path)
    url = self._websocket_url(task_id, websocket_id)
    ws = websocket.WebSocket(skip_utf8_validation=True)
    ws.connect(url, socket=sock)
    sock.settimeout(None)
    return ws


pebble.Client._connect_websocket = _connect_websocket

Copy link
Contributor

@tonyandrewmeyer tonyandrewmeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I could reproduce reliably with the steps in #1246, and running from your branch indeed nicely errors out instead.

ops/pebble.py Outdated Show resolved Hide resolved
Co-authored-by: Tony Meyer <tony.meyer@gmail.com>
@benhoyt benhoyt merged commit d8c9807 into canonical:main Jun 6, 2024
26 checks passed
@benhoyt benhoyt deleted the websocket-connect-timeout branch June 6, 2024 22:42
tonyandrewmeyer added a commit to tonyandrewmeyer/operator that referenced this pull request Jun 26, 2024
…cal#1247)

This is the fix for the issue described at
canonical#1246. Essentially, if the
Pebble timeout has already elapsed, Pebble will happily wait
indefinitely for the connect to go through, and the Python side will
hang. Add a timeout during the connect phase to cut this short.

Fixes canonical#1246.

---------

Co-authored-by: Tony Meyer <tony.meyer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websocket connect hangs if Pebble timeout elapses first
3 participants