Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry certain errors between server and gateway #944

Merged
merged 5 commits into from
Aug 15, 2022

Conversation

kevin-bates
Copy link
Member

Recently, after switching from Docker Desktop to Rancher Desktop to deploy Enterprise Gateway in Kubernetes, I found a fair number of requests encountering HTTP errors with codes of 599 (timeout/stream closed) as these were not seen previously, nor when run against other K8s clusters. As such, I believe this to be some issue with the port-forwarding implemented in Rancher Desktop. In looking into this, a common practice is to retry a certain class of HTTP and general connection errors. Typically, these kinds of errors are resolved on the first retry.

While implementing various approaches, I found the Retry class from urllib3.utils which can be used with the requests library coupled with their HTTP adapters. Since it doesn't appear Tornado uses requests and because I didn't want to introduce another dependency, I decided to implement a RetryableHTTPClient that contains an AsyncHTTPClient instance. This class is configured to retry a certain set of HTTPErrors as well as subclasses of ConnectionError with backoffs similar to that of Retry.

When a request has failed and is deemed retryable, an informational message will be logged indicating the retry count, the error triggering the retry, and the endpoint on which the error occurred.

[I 2022-08-12 13:43:36.612 ServerApp] Attempting retry (1) against endpoint 'http://localhost:53433/api/kernels'.  Retried error: 'HTTP 599: Stream closed'
[D 2022-08-12 13:43:36.721 ServerApp] 200 GET /api/kernels?1660337016448 (33fb1efb75cc4be5a127946cf37b178e@::1) 270.29ms

If the error is not retryable or the maximum number of retries has been exceeded, the exception is raised (as was the case prior to this change).

Since this class is particular to interaction with a Gateway server the parameters that determine retries are currently not configurable with the exception of the max retry count (which is only configurable using via env JUPYTER_GATEWAY_MAX_REQUEST_RETRIES) and defaults to 2 with a max value of 10.

@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2022

Codecov Report

Merging #944 (97aae1f) into main (7623966) will decrease coverage by 0.16%.
The diff coverage is 40.42%.

@@            Coverage Diff             @@
##             main     #944      +/-   ##
==========================================
- Coverage   72.30%   72.14%   -0.17%     
==========================================
  Files          65       65              
  Lines        8001     8041      +40     
  Branches     1336     1343       +7     
==========================================
+ Hits         5785     5801      +16     
- Misses       1810     1834      +24     
  Partials      406      406              
Impacted Files Coverage Δ
jupyter_server/gateway/gateway_client.py 73.24% <37.77%> (-7.18%) ⬇️
jupyter_server/gateway/managers.py 81.26% <100.00%> (+0.04%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@blink1073
Copy link
Contributor

You checked in the package-lock.json changes. 😄

@kevin-bates
Copy link
Member Author

Sorry about that. It should be reverted now.

What's the best way to avoid that file being updated? I didn't run git status thinking my change was to the one file I intentionally modified - which is how I've detected the file was updated previously.

@blink1073
Copy link
Contributor

I don't know of a good way, since it is already under source control. It gets updated whenever the npm scripts are run. It isn't a big deal if we accidentally merge it, but better to do it intentionally.

Copy link
Contributor

@kiersten-stokes kiersten-stokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As another Rancher Desktop user, this is working nicely for me! Log excerpt below

[I 2022-08-15 13:22:36.841 ServerApp] Attempting retry (1) against endpoint 'http://localhost:51928/api/kernels/2b5d0896-7bfc-422c-9eea-9ec825cb34f8'.  Retried error: 'HTTP 599: Stream closed'
[D 2022-08-15 13:22:37.043 ServerApp] Kernel retrieved: {'id': '2b5d0896-7bfc-422c-9eea-9ec825cb34f8', 'name': 'python_kubernetes', 'last_activity': '2022-08-15T18:22:34.067261Z', 'execution_state': 'starting', 'connections': 0}

@blink1073 blink1073 merged commit 260351e into jupyter-server:main Aug 15, 2022
@blink1073
Copy link
Contributor

Thanks! Should this be backported?

@kevin-bates
Copy link
Member Author

Thank you for the reviews.

Should this be backported?

I think this will be generally useful, so a back-port would be appreciated. Thank you @blink1073!

@blink1073
Copy link
Contributor

@meeseeksdev please backport to 1.x

meeseeksmachine pushed a commit to meeseeksmachine/jupyter_server that referenced this pull request Aug 15, 2022
blink1073 pushed a commit that referenced this pull request Aug 15, 2022
…nd gateway) (#946)

Co-authored-by: Kevin Bates <kbates4@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants