-
Notifications
You must be signed in to change notification settings - Fork 308
Investigate spike in response time #4460
Comments
A few stats to get started: Gratipay does an average of 300 requests/minute i.e. 5 requests/sec (screenshot from heroku attached) Our web server is gunicorn, and we run it with a single worker of the 'synchronous' worker type. If we look at Gratipay as a restaurant, we have a single waiter (worker, in gunicorn terms) for multiple customers (requests, in gunicorn terms). Given 5 customers/sec and a waiter who can only process one customer at a time - the waiter will be able to serve all requests on time as long as each customer takes less than 200ms to serve. If the requests/sec falls below 5, the waiter will have idle time. If the requests/sec goes above 5 - certain customers will have to wait for the waiter to finish requests for other customers before they can be served. This can be proven by hitting gratipay concurrently with a sizable request (homepage is a good example) and observing response times. First, let's take github.com as an example to prove how a web server within capacity responds. Next, let's take the Gratipay homepage. Since we're almost exhausting our capacity, the response times vary a lot because certain requests are waiting on the others to finish. Note though, this is only visible for endpoints that have significant processing time. If the endpoint is light ( |
There are
|
The former is easier, as long as we've designed for that possibility. I remember an issue created about this... trying to fetch it |
Related: #1098 |
Similar work done in #1617, apparently |
I'm not going down this route right now - if it works fine on GitHub.com but not Gratipay.com, unlikely to be an |
Hmm, still can't find it. I vaguely remember that it listed two reasons that we won't be able to run multiple dynos/processes:
|
Even allowing multiple requests for non-auth pages seems like it would help. EDIT: I may be misunderstanding the problem. |
@rohitpaulk I believe #715 is the issue you're thinking of. |
Exactly, thanks. |
True. I'd treat that as a last resort though, better if we don't have to drop requests :) |
Edited #4460 (comment) to include ^ |
😞 Heroku removed support for horizontal scaling if dynos are of the 'Hobby' type. Professional dynos start at $25/month, we're currently at $7/month. Might have to look into the gunicorn worker count. |
If we're looking at gunicorn worker count we should also have h1:203366 in mind. |
I think the situation was different then. Our baseline response time might be slower now because of non-optimized DB queries (customer who take a long time to order), but the huge spikes are not because of them - it is because of requests piling up (not enough waiters). |
I've increased the gunicorn worker count to 2 on production. Will posts results in a while |
I'm seeing a few spikes - will revert in 30 minutes if it doesn't settle down |
Reverted, that didn't work out well. I'll figure out a way to try this in a staging environment. |
I've created staging apps on my personal account before, going to experiment with Heroku pipelines now so that we can document how this is done |
After a lot of experimenting with staging, I'm coming to the conclusion that our spikes are mostly due to the homepage. I'm going to run a quick experiment on staging to prove it - will need to provision a Standard database for a couple of hours |
Confirmed. Most of the spikes correspond with periods when the homepage was accessed directly. Surprisingly, a vast majority of our requests are for Tests against staging with our same database: (yellow is the response time within Aspen, green is the total response time including time spent in heroku) |
Two steps to fix this:
Note: I've already enabled async workers with a count of 2. Increasing the number isn't going to help here, we have so much data to throw down the wire that the worker count won't matter past a point |
I like it. Rather than listing all the teams on the homepage, maybe we should beef up the search capabilities with additional filters? |
Yep, that is something that we can look into too. |
Investigation done, we now know what to focus on. Closing |
🕵️♂️ |
So satisfying! 💃 http://inside.gratipay.com/appendices/health 24 Hours4 Weeks |
Looks like the spike started after #4305 was deployed. Split out from #4459
The text was updated successfully, but these errors were encountered: