Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Investigate spike in response time #4460

Closed
rohitpaulk opened this issue May 11, 2017 · 31 comments
Closed

Investigate spike in response time #4460

rohitpaulk opened this issue May 11, 2017 · 31 comments
Assignees

Comments

@rohitpaulk
Copy link
Contributor

Looks like the spike started after #4305 was deployed. Split out from #4459

@rohitpaulk
Copy link
Contributor Author

#4459 (comment)

@rohitpaulk rohitpaulk self-assigned this May 11, 2017
@rohitpaulk
Copy link
Contributor Author

notes on slack

@rohitpaulk
Copy link
Contributor Author

A few stats to get started:

Gratipay does an average of 300 requests/minute i.e. 5 requests/sec (screenshot from heroku attached)

screen shot 2017-05-11 at 8 42 59 pm

Our web server is gunicorn, and we run it with a single worker of the 'synchronous' worker type. If we look at Gratipay as a restaurant, we have a single waiter (worker, in gunicorn terms) for multiple customers (requests, in gunicorn terms). Given 5 customers/sec and a waiter who can only process one customer at a time - the waiter will be able to serve all requests on time as long as each customer takes less than 200ms to serve. If the requests/sec falls below 5, the waiter will have idle time. If the requests/sec goes above 5 - certain customers will have to wait for the waiter to finish requests for other customers before they can be served.

This can be proven by hitting gratipay concurrently with a sizable request (homepage is a good example) and observing response times.

First, let's take github.com as an example to prove how a web server within capacity responds.

github

Next, let's take the Gratipay homepage. Since we're almost exhausting our capacity, the response times vary a lot because certain requests are waiting on the others to finish.

gratipay_home

Note though, this is only visible for endpoints that have significant processing time. If the endpoint is light (gratipay.com/version.txt, for example), this can't be observed:

gratipay_version

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 11, 2017

There are two three solutions to this:

  • Increase the number of waiters in the restaurant (add a dyno, add a gunicorn worker, change to threaded server etc.)
  • Decrease the time it takes to serve a customer (optimize expensive requests)
  • EDIT: As @mattbk mentions, third option is to reduce the number of customers (drop unauthed request past a certain rate)

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 11, 2017

The former is easier, as long as we've designed for that possibility. I remember an issue created about this... trying to fetch it

@rohitpaulk
Copy link
Contributor Author

Related: #1098

@rohitpaulk
Copy link
Contributor Author

Similar work done in #1617, apparently ab on Mac has limitations?

@rohitpaulk
Copy link
Contributor Author

apparently ab on Mac has limitations?

I'm not going down this route right now - if it works fine on GitHub.com but not Gratipay.com, unlikely to be an ab limitation

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 11, 2017

I remember an issue created about this... trying to fetch it

Hmm, still can't find it. I vaguely remember that it listed two reasons that we won't be able to run multiple dynos/processes:

  • Crons were locking in-memory (they should use the database). I think this is solved now
  • Something OAuth related, where we store an authorization token in memory to be accessed between requests?

@mattbk
Copy link
Contributor

mattbk commented May 11, 2017

Even allowing multiple requests for non-auth pages seems like it would help.

EDIT: I may be misunderstanding the problem.

@Changaco
Copy link
Contributor

@rohitpaulk I believe #715 is the issue you're thinking of.

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 11, 2017

Exactly, thanks.

@rohitpaulk
Copy link
Contributor Author

Even allowing multiple requests for non-auth pages seems like it would help.

True. I'd treat that as a last resort though, better if we don't have to drop requests :)

@rohitpaulk
Copy link
Contributor Author

Edited #4460 (comment) to include ^

@rohitpaulk
Copy link
Contributor Author

Now that #715 is closed, I think we can go ahead and add more capacity. I'm pretty confident that adding a dyno will work fine, not sure about increasing the the worker count for gunicorn. Will wait for @whit537's confirmation for the latter - I'm going to go ahead and add another dyno for now.

@rohitpaulk
Copy link
Contributor Author

Heroku apps running on Professional-tier dynos (any dyno type except Free or Hobby) can be scaled to run on multiple dynos simultaneously.

😞 Heroku removed support for horizontal scaling if dynos are of the 'Hobby' type. Professional dynos start at $25/month, we're currently at $7/month.

Might have to look into the gunicorn worker count.

@mattbk
Copy link
Contributor

mattbk commented May 11, 2017

Slack

Gittip's slowness mostly comes for non-optimized DB queries, adding dynos doesn't scale the DB AFAIK, so it potentially just makes it worst by increasing the rate of queries irc

ok i remember our scaling issues were because some tokens were held in memory vs database

@chadwhitacre
Copy link
Contributor

If we're looking at gunicorn worker count we should also have h1:203366 in mind.

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 12, 2017

Gittip's slowness mostly comes for non-optimized DB queries, adding dynos doesn't scale the DB AFAIK, so it potentially just makes it worst by increasing the rate of queries

I think the situation was different then. Our baseline response time might be slower now because of non-optimized DB queries (customer who take a long time to order), but the huge spikes are not because of them - it is because of requests piling up (not enough waiters).

@rohitpaulk
Copy link
Contributor Author

I've increased the gunicorn worker count to 2 on production. Will posts results in a while

@rohitpaulk
Copy link
Contributor Author

I'm seeing a few spikes - will revert in 30 minutes if it doesn't settle down

@rohitpaulk
Copy link
Contributor Author

Reverted, that didn't work out well. I'll figure out a way to try this in a staging environment.

@rohitpaulk
Copy link
Contributor Author

I've created staging apps on my personal account before, going to experiment with Heroku pipelines now so that we can document how this is done

@rohitpaulk
Copy link
Contributor Author

After a lot of experimenting with staging, I'm coming to the conclusion that our spikes are mostly due to the homepage. I'm going to run a quick experiment on staging to prove it - will need to provision a Standard database for a couple of hours

@rohitpaulk
Copy link
Contributor Author

Confirmed. Most of the spikes correspond with periods when the homepage was accessed directly. Surprisingly, a vast majority of our requests are for widget.html and public.json - not the homepage.

Tests against staging with our same database:

response

(yellow is the response time within Aspen, green is the total response time including time spent in heroku)

@rohitpaulk
Copy link
Contributor Author

rohitpaulk commented May 15, 2017

Two steps to fix this:

Note: I've already enabled async workers with a count of 2. Increasing the number isn't going to help here, we have so much data to throw down the wire that the worker count won't matter past a point

@mattbk
Copy link
Contributor

mattbk commented May 15, 2017

I like it. Rather than listing all the teams on the homepage, maybe we should beef up the search capabilities with additional filters?

@rohitpaulk
Copy link
Contributor Author

Rather than listing all the teams on the homepage, maybe we should beef up the search capabilities with additional filters?

Yep, that is something that we can look into too.

@rohitpaulk
Copy link
Contributor Author

Investigation done, we now know what to focus on. Closing

@chadwhitacre
Copy link
Contributor

🕵️‍♂️

@chadwhitacre
Copy link
Contributor

So satisfying! 💃

http://inside.gratipay.com/appendices/health

24 Hours

screen shot 2017-05-16 at 7 55 32 pm

4 Weeks

screen shot 2017-05-16 at 7 55 15 pm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants