Investigate spike in response time #4460

rohitpaulk · 2017-05-11T12:47:26Z

Looks like the spike started after #4305 was deployed. Split out from #4459

rohitpaulk · 2017-05-11T12:47:41Z

#4459 (comment)

rohitpaulk · 2017-05-11T15:10:59Z

notes on slack

rohitpaulk · 2017-05-11T15:33:55Z

A few stats to get started:

Gratipay does an average of 300 requests/minute i.e. 5 requests/sec (screenshot from heroku attached)

Our web server is gunicorn, and we run it with a single worker of the 'synchronous' worker type. If we look at Gratipay as a restaurant, we have a single waiter (worker, in gunicorn terms) for multiple customers (requests, in gunicorn terms). Given 5 customers/sec and a waiter who can only process one customer at a time - the waiter will be able to serve all requests on time as long as each customer takes less than 200ms to serve. If the requests/sec falls below 5, the waiter will have idle time. If the requests/sec goes above 5 - certain customers will have to wait for the waiter to finish requests for other customers before they can be served.

This can be proven by hitting gratipay concurrently with a sizable request (homepage is a good example) and observing response times.

First, let's take github.com as an example to prove how a web server within capacity responds.

Next, let's take the Gratipay homepage. Since we're almost exhausting our capacity, the response times vary a lot because certain requests are waiting on the others to finish.

Note though, this is only visible for endpoints that have significant processing time. If the endpoint is light (gratipay.com/version.txt, for example), this can't be observed:

rohitpaulk · 2017-05-11T15:35:12Z

There are ~~two~~ three solutions to this:

Increase the number of waiters in the restaurant (add a dyno, add a gunicorn worker, change to threaded server etc.)
Decrease the time it takes to serve a customer (optimize expensive requests)
EDIT: As @mattbk mentions, third option is to reduce the number of customers (drop unauthed request past a certain rate)

rohitpaulk · 2017-05-11T15:35:55Z

The former is easier, as long as we've designed for that possibility. I remember an issue created about this... trying to fetch it

rohitpaulk · 2017-05-11T15:37:48Z

Related: #1098

rohitpaulk · 2017-05-11T15:41:41Z

Similar work done in #1617, apparently ab on Mac has limitations?

rohitpaulk · 2017-05-11T15:47:38Z

apparently ab on Mac has limitations?

I'm not going down this route right now - if it works fine on GitHub.com but not Gratipay.com, unlikely to be an ab limitation

rohitpaulk · 2017-05-11T15:52:54Z

I remember an issue created about this... trying to fetch it

Hmm, still can't find it. I vaguely remember that it listed two reasons that we won't be able to run multiple dynos/processes:

Crons were locking in-memory (they should use the database). I think this is solved now
Something OAuth related, where we store an authorization token in memory to be accessed between requests?

mattbk · 2017-05-11T15:58:13Z

Even allowing multiple requests for non-auth pages seems like it would help.

EDIT: I may be misunderstanding the problem.

Changaco · 2017-05-11T16:02:29Z

@rohitpaulk I believe #715 is the issue you're thinking of.

rohitpaulk · 2017-05-11T16:08:16Z

Exactly, thanks.

rohitpaulk · 2017-05-11T16:09:04Z

Even allowing multiple requests for non-auth pages seems like it would help.

True. I'd treat that as a last resort though, better if we don't have to drop requests :)

rohitpaulk · 2017-05-11T16:09:58Z

Edited #4460 (comment) to include ^

rohitpaulk · 2017-05-11T16:12:14Z

Now that #715 is closed, I think we can go ahead and add more capacity. I'm pretty confident that adding a dyno will work fine, not sure about increasing the the worker count for gunicorn. Will wait for @whit537's confirmation for the latter - I'm going to go ahead and add another dyno for now.

rohitpaulk · 2017-05-11T16:16:04Z

Heroku apps running on Professional-tier dynos (any dyno type except Free or Hobby) can be scaled to run on multiple dynos simultaneously.

😞 Heroku removed support for horizontal scaling if dynos are of the 'Hobby' type. Professional dynos start at $25/month, we're currently at $7/month.

Might have to look into the gunicorn worker count.

mattbk · 2017-05-11T17:17:33Z

Slack

Gittip's slowness mostly comes for non-optimized DB queries, adding dynos doesn't scale the DB AFAIK, so it potentially just makes it worst by increasing the rate of queries irc

ok i remember our scaling issues were because some tokens were held in memory vs database

chadwhitacre · 2017-05-12T04:24:06Z

If we're looking at gunicorn worker count we should also have h1:203366 in mind.

rohitpaulk · 2017-05-12T08:41:49Z

Gittip's slowness mostly comes for non-optimized DB queries, adding dynos doesn't scale the DB AFAIK, so it potentially just makes it worst by increasing the rate of queries

I think the situation was different then. Our baseline response time might be slower now because of non-optimized DB queries (customer who take a long time to order), but the huge spikes are not because of them - it is because of requests piling up (not enough waiters).

rohitpaulk · 2017-05-12T09:37:05Z

I've increased the gunicorn worker count to 2 on production. Will posts results in a while

rohitpaulk · 2017-05-12T09:59:50Z

I'm seeing a few spikes - will revert in 30 minutes if it doesn't settle down

rohitpaulk · 2017-05-12T10:20:29Z

Reverted, that didn't work out well. I'll figure out a way to try this in a staging environment.

rohitpaulk · 2017-05-12T21:51:44Z

I've created staging apps on my personal account before, going to experiment with Heroku pipelines now so that we can document how this is done

rohitpaulk · 2017-05-15T20:37:12Z

After a lot of experimenting with staging, I'm coming to the conclusion that our spikes are mostly due to the homepage. I'm going to run a quick experiment on staging to prove it - will need to provision a Standard database for a couple of hours

rohitpaulk · 2017-05-15T21:11:03Z

Confirmed. Most of the spikes correspond with periods when the homepage was accessed directly. Surprisingly, a vast majority of our requests are for widget.html and public.json - not the homepage.

Tests against staging with our same database:

(yellow is the response time within Aspen, green is the total response time including time spent in heroku)

rohitpaulk · 2017-05-15T21:12:19Z

Two steps to fix this:

Reduce size of homepage, only load featured (Homepage server tabs #4470)
Add pagination to teams listing

Note: I've already enabled async workers with a count of 2. Increasing the number isn't going to help here, we have so much data to throw down the wire that the worker count won't matter past a point

mattbk · 2017-05-15T21:16:44Z

I like it. Rather than listing all the teams on the homepage, maybe we should beef up the search capabilities with additional filters?

rohitpaulk · 2017-05-15T21:19:12Z

Rather than listing all the teams on the homepage, maybe we should beef up the search capabilities with additional filters?

Yep, that is something that we can look into too.

rohitpaulk · 2017-05-16T12:50:15Z

Investigation done, we now know what to focus on. Closing

chadwhitacre · 2017-05-16T23:54:27Z

🕵️‍♂️

chadwhitacre · 2017-05-16T23:56:09Z

So satisfying! 💃

http://inside.gratipay.com/appendices/health

24 Hours

4 Weeks

rohitpaulk self-assigned this May 11, 2017

rohitpaulk mentioned this issue May 12, 2017

Setup a staging environment on heroku #4465

Closed

rohitpaulk closed this as completed May 16, 2017

mattbk mentioned this issue May 17, 2017

Project advanced search/browse #3938

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate spike in response time #4460

Investigate spike in response time #4460

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017 •

edited

Loading

mattbk commented May 11, 2017 •

edited

Loading

Changaco commented May 11, 2017

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

mattbk commented May 11, 2017

chadwhitacre commented May 12, 2017

rohitpaulk commented May 12, 2017 •

edited

Loading

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 15, 2017 •

edited

Loading

mattbk commented May 15, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 16, 2017

chadwhitacre commented May 16, 2017

chadwhitacre commented May 16, 2017

Investigate spike in response time #4460

Investigate spike in response time #4460

Comments

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017 • edited Loading

rohitpaulk commented May 11, 2017 • edited Loading

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017 • edited Loading

mattbk commented May 11, 2017 • edited Loading

Changaco commented May 11, 2017

rohitpaulk commented May 11, 2017 • edited Loading

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

rohitpaulk commented May 11, 2017

mattbk commented May 11, 2017

chadwhitacre commented May 12, 2017

rohitpaulk commented May 12, 2017 • edited Loading

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 12, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 15, 2017 • edited Loading

mattbk commented May 15, 2017

rohitpaulk commented May 15, 2017

rohitpaulk commented May 16, 2017

chadwhitacre commented May 16, 2017

chadwhitacre commented May 16, 2017

24 Hours

4 Weeks

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017 •

edited

Loading

mattbk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 11, 2017 •

edited

Loading

rohitpaulk commented May 12, 2017 •

edited

Loading

rohitpaulk commented May 15, 2017 •

edited

Loading