Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests fail intermittently w/ hyper error #173

Closed
stearnsc opened this issue Apr 17, 2021 · 8 comments
Closed

Requests fail intermittently w/ hyper error #173

stearnsc opened this issue Apr 17, 2021 · 8 comments

Comments

@stearnsc
Copy link
Member

On a small portion of requests, we're getting the error:
error communicating with stripe: connection closed before message completed
and the request is failing.

This appears to be related to hyperium/hyper#2136 - in which hyper attempts to reuse a connection that the server (in this case Stripe) has already closed.

I opened #172 to trial disabling connection pooling to investigate, and after pinning this library to that revision in our production environment, the errors stopped.

In my mind we have 3 viable solutions:

  1. We disable connection pooling entirely, as done in disable connection pooling #172
  2. We determine Stripe's keepalive timeout, and set our timeout to less than that in order to prevent the race condition
  3. We implement retry logic of some sort

I'm reasonably concerned about (3) as a solution, because if we incorrectly retry a request that has succeeded without using idempotency keys, we open our users up to double-charging their customers, which is obviously Very Bad™.

I think right now I'm in favor of (1) as a short-term fix, and then possibly moving to (2) if/when we can determine how stripe handles such connections with confidence. Alternatively, we could disable connection pooling by default, but allow it to be enabled through configuration, which puts the burden on our users to configure it correctly, handle retries, etc.

@arlyon
Copy link
Collaborator

arlyon commented Apr 17, 2021

#172 seems like a decent short term solution. I am currently speaking with stripe to see if they can give a proper number (for solution 2) which you could test with; i'll comment when they get back to me.

edit: they're handling the issue internally and will send me an email. even if we decide to go with solution 1, setting a sane timeout is a clear benefit anyways

@arlyon
Copy link
Collaborator

arlyon commented Apr 20, 2021

Contents of the email, reworded because I'd rather not post verbatim:

Recommended solution from stripe: set the timeout limit to 30 seconds, the same value used in their java library (https://github.com/stripe/stripe-java/blob/4262fcc6e4b0f084c03de438886b2a9f0fbea306/src/main/java/com/stripe/Stripe.java#L9).

It was also mentioned we can (and probably should, eventually) also set up retry logic which would be a nice-to-have, though I agree with your original concerns here. In this case, we can make use of idempotency headers like they do on their node client (https://github.com/stripe/stripe-node/blob/e89cc76ecbf64979cc012d81ab299eee32d8500d/lib/StripeResource.js#L302-L335) which are applied automatically if the caller doesn't supply them.

@arlyon
Copy link
Collaborator

arlyon commented Apr 20, 2021

Would you be willing to set the timeout to 30 sec and test? I can make some time to implement idempotent retries however that will obviously need a good deal of testing before it can be put in production. Maybe we can use the mock stripe docker image with some load generator to count dropped connections and detect duplicate charges.

@stearnsc
Copy link
Member Author

Sure; I should have time to test that this afternoon.

@arlyon
Copy link
Collaborator

arlyon commented Apr 20, 2021

Just received a follow up: "while we terminate idle connections after 180s, we may terminate them early for a number of reasons (eg: host management) and you should not explicitly depend on a connection staying open that long". I think retries and idempotency seems like the only bulletproof solution; there is a PR #130 which makes some progress but it is a year old.

@stearnsc
Copy link
Member Author

I'm going to go ahead and merge the "stop connection pooling" change until we implement idempotency. I'll make a followup story for that.

@stearnsc
Copy link
Member Author

Made #176 to track the ongoing issue

@stearnsc
Copy link
Member Author

Closed by #172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants