Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Something is really broken in our http serving #1617

Closed
zbynekwinkler opened this issue Oct 24, 2013 · 90 comments
Closed

Something is really broken in our http serving #1617

zbynekwinkler opened this issue Oct 24, 2013 · 90 comments

Comments

@zbynekwinkler
Copy link
Contributor

Something is IMO horribly broken in gittip http serving. The file 'reset.css' is under 1kB in size. When testing the site with 'ab' it is easy to see that no matter the number of threads, with just 20 parallel connections we cannot get more than 5 requests per second. That is like 4kB/s transfer speed for a plain small file fully cached in memory. BTW: that transfer speed, that is just twice the size of our logs.

Just for the perspective - I believe each browser uses up to 6 connection for a single domain. The homepage does 8 requests to the main site. When someone on twitter with a 100 followers sends out a tweet where 1 out of 10 clicks the link, we get 10 users, 60 parallel connections and 80 requests.... and slow down to a crawl :(

And I am not even mentioning that hitting the homepage, the most db optimized page on gittip, we get not 5 but 1 request per second. So, if the site does not die right away (that is the better case), we go to 10s to 15s per request.

@bruceadams
Copy link
Contributor

I see essentially the same results. The maximum transfer speed I'm seeing is about 6 Kbytes/sec, I get about that for any thread count (ab -c#) from 2 up through 20. With just a single ab thread count, I get 5 Kbytes/sec.

@zbynekwinkler
Copy link
Contributor Author

The best thing we can do right now is consolidate css and js files and put all static assets on CDN. That would bring the request count per page to 1. That alone would be eightfold increase for users with a cold cache (a storm of new visitors we are expecting on Friday after Chads interview is out).

@zbynekwinkler
Copy link
Contributor Author

I guess what I would really like is to get performance like this (it is a fairly high profile site in CZ, like czech google):

C:\home\zbynek\www.gittip.com>ab.exe -c20 -n100 http://www.seznam.cz/st/3.33.21/css/medium.css?3.33.21
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking www.seznam.cz (be patient).....done


Server Software:        nginx/1.4.2
Server Hostname:        www.seznam.cz
Server Port:            80

Document Path:          /st/3.33.21/css/medium.css?3.33.21
Document Length:        4167 bytes

Concurrency Level:      20
Time taken for tests:   0.773 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      450000 bytes
HTML transferred:       416700 bytes
Requests per second:    129.35 [#/sec] (mean)
Time per request:       154.616 [ms] (mean)
Time per request:       7.731 [ms] (mean, across all concurrent requests)
Transfer rate:          568.45 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    7   7.8      4      37
Processing:    13   98  53.2     85     347
Waiting:        5   62  60.0     50     347
Total:         22  105  59.7     89     379

Percentage of the requests served within a certain time (ms)
  50%     89
  66%     97
  75%    104
  80%    110
  90%    166
  95%    268
  98%    354
  99%    379
 100%    379 (longest request)

but even something like this would be ok :).

C:\home\zbynek\www.gittip.com>ab.exe -c20 -n100 http://assets-gittipllc.netdna-ssl.com/10.1.36/hero-bg.png
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking assets-gittipllc.netdna-ssl.com (be patient).....done


Server Software:        NetDNA-cache/2.2
Server Hostname:        assets-gittipllc.netdna-ssl.com
Server Port:            80

Document Path:          /10.1.36/hero-bg.png
Document Length:        9746 bytes

Concurrency Level:      20
Time taken for tests:   4.849 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1002800 bytes
HTML transferred:       974600 bytes
Requests per second:    20.62 [#/sec] (mean)
Time per request:       969.897 [ms] (mean)
Time per request:       48.495 [ms] (mean, across all concurrent requests)
Transfer rate:          201.94 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       39   47  11.7     41      87
Processing:    69  786 203.6    781    1187
Waiting:       65  458 258.1    448    1187
Total:        130  834 203.4    822    1252

Percentage of the requests served within a certain time (ms)
  50%    822
  66%    840
  75%    898
  80%    956
  90%   1102
  95%   1188
  98%   1235
  99%   1252
 100%   1252 (longest request)

@bruceadams
Copy link
Contributor

This issue appears to be fundamental to Aspen. Running gittip locally (which is difficult far too often, see #1619) I get worse throughput with more requesting threads.

$ ab -n100 -c1 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    14.73 [#/sec] (mean)
Time per request:       67.910 [ms] (mean)
$ ab -n100 -c2 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    16.75 [#/sec] (mean)
Time per request:       119.411 [ms] (mean)
$ ab -n100 -c4 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    5.83 [#/sec] (mean)
Time per request:       686.101 [ms] (mean)
$ ab -n100 -c8 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    4.86 [#/sec] (mean)
Time per request:       1646.592 [ms] (mean)

The CPU on my 4-core laptop never went anywhere near to 100%. My theory is that Aspen is, in effect, single threaded.

This behavior is much like Rails not too long ago. Heroku has made a lot money because they allow one-request-at-a-time Rails apps to scale up easily with many dynos.

I assume the quickest (and not cheap) fix is to add more Gittip Aspen servers. This requires some code fixes, to move user session handling (for Twitter and Bitbucket) out of memory.

@zbynekwinkler
Copy link
Contributor Author

What is your value of MIN_THREADS in local.env? In production we are now running with 40.

Aspen is using under the hood the http server from cherrypy (cheroot). I've never checked it any closer so I don't know anything about it. When I get home I was planning to try some simple wsgi hello world app under different wsgi containers/servers (like gunicorn or uwsgi) locally to see what difference does the server make.

@bruceadams
Copy link
Contributor

I hadn't even thought about MIN_THREADS. Looking now, it's set to the default of MIN_THREADS=10.

@clone1018
Copy link
Contributor

I've always been saying aspen is a little slow, I wanted to do some tests like @zwn will be doing but I never found the time. Good luck, let us know what happens!

@bruceadams
Copy link
Contributor

Testing a single static page on a vanilla Aspen server gives much better performance. It is fast enough that I'm not sure I can tell how well it is processes requests in parallel. (These tests were run on my 4-core i5 laptop.)

ba:~$ ab -n1000 -c1 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    793.02 [#/sec] (mean)
Time per request:       1.261 [ms] (mean)
ba:~$ ab -n1000 -c2 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    441.07 [#/sec] (mean)
Time per request:       4.534 [ms] (mean)
ba:~$ ab -n1000 -c4 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    383.14 [#/sec] (mean)
Time per request:       10.440 [ms] (mean)
ba:~$ ab -n1000 -c8 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    447.95 [#/sec] (mean)
Time per request:       17.859 [ms] (mean)
ba:~$ ab -n1000 -c16 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    490.06 [#/sec] (mean)
Time per request:       32.649 [ms] (mean)

@clone1018
Copy link
Contributor

@bruceadams was your test using postgression for your local Gittip?

@bruceadams
Copy link
Contributor

@clone1018 yes, I used Postgression. I don't expect fetching a static page to exercise the database. I could capture database network traffic to be sure.

@clone1018
Copy link
Contributor

Since we serve the static page via aspen as a webserver, is there a chance we're opening a database connection?

@chadwhitacre
Copy link
Contributor

We should also look at the effect of logging on performance. I have a tingly sense that lots of stdout chatter could slow us down.

@bruceadams
Copy link
Contributor

Logging has the potential to have a huge impact on performance. The best logging frameworks jump through all kinds of hoops to avoid synchronization, excessive object creation and other stuff.

@chadwhitacre
Copy link
Contributor

Since we serve the static page via aspen as a webserver, is there a chance we're opening a database connection?

If we are it'd be in a hook, and the places to check would be in configure-aspen.py and gittip/wireup.py.

@zbynekwinkler
Copy link
Contributor Author

@bruceadams I am not getting the same results as you.

$ ab -c1 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    166.02 [#/sec] (mean)
Time per request:       6.024 [ms] (mean)
$ ab -c2 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    117.18 [#/sec] (mean)
Time per request:       17.068 [ms] (mean)
$ ab -c4 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    87.36 [#/sec] (mean)
Time per request:       45.790 [ms] (mean)
$ ab -c8 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    78.20 [#/sec] (mean)
Time per request:       102.305 [ms] (mean)
$ ab -c16 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    77.45 [#/sec] (mean)
Time per request:       206.598 [ms] (mean)

But the pattern is clear. The more threads the worse. I have also 4 core i5 laptop. I am able to get to 50% utilization in total.

@bruceadams
Copy link
Contributor

@zwn interesting. Do you have a local database? If yes, your faster results suggest the database is being hit when serving this static asset. (I haven't had a chance to check.) (Also, I could not use "localhost" because my machine treats "localhost" as ip6 and something misbehaved.)

@zbynekwinkler
Copy link
Contributor Author

@bruceadams I have a local db. I am not sure what work we do with db per request (I have looked but we need some instrumentation to be sure).

@zbynekwinkler
Copy link
Contributor Author

Running gunicorn -w 10 with default hello world app from http://docs.gunicorn.org/en/latest/run.html :

$ ab -c1 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    1291.87 [#/sec] (mean)
Time per request:       0.774 [ms] (mean)
$ ab -c10 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    5396.07 [#/sec] (mean)
Time per request:       1.853 [ms] (mean)
$ ab -c100 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    3472.34 [#/sec] (mean)
Time per request:       28.799 [ms] (mean)

That is a different ballpark.

@chadwhitacre
Copy link
Contributor

It is possible to mount Aspen as a WSGI app on gunicorn:

http://aspen.io/deployment/

@bruceadams
Copy link
Contributor

Watching network traffic to the database, yes there is a round trip to the database for every fetch of assets/reset.css I'll see if I can figure out where that is in the code (by stumbling around in the dark).

@chadwhitacre
Copy link
Contributor

@bruceadams It's probably in a hook somewhere. Actually, it's probably related to authentication.

@chadwhitacre
Copy link
Contributor

I'm in the process of setting up a load testing environment locally. I have the FreeBSD machine I set up while my laptop was out of commission. I'm installing ab there to hit Gittip on my laptop over the LAN.

@chadwhitacre
Copy link
Contributor

We removed the extra db hit per request in #1626. However, I'm still seeing ~3 req/sec for reset.css in production, as we were the other day.

@chadwhitacre
Copy link
Contributor

For kicks, I stood up an instance of Gittip at Digital Ocean (smallest droplet, NY1). It's reading from the production database and is configured with MIN_THREADS=40 and DATABASE_MAXCONN=100, as in production. I edited configure-aspen.py by hand to turn off the call to utils.update_homepage_tables_once.

In general, I'm seeing ~115 r/s for /assets/reset.css and ~9 r/s for /. From our Heroku dyno we're getting ~6 and ~3, respectively. In other words, we're faster from the DO droplet than we are from the Heroku dyno, even though both are reading from the same Heroku Postgres database.

Two possible ways to account for the discrepancy are SSL and a baseline load of ~2 req/sec, which are present on the Heroku dyno but not the DO droplet.

@chadwhitacre
Copy link
Contributor

Testing locally I am seeing ~177 r/s for /assets/reset.css and 19 r/s for /. That's with a recent backup in a Postgres on a different machine on the same LAN.

@chadwhitacre
Copy link
Contributor

I ran two ab commands at once, one from the laptop and another from the FreeBSD machine. The FreeBSD machine is serving Gittip connected to Postgres on the laptop. I set up a slow -n300 -c1 trickle against / on the laptop, then I fired off a -n100 -c20 against /assets/reset.css from the FreeBSD machine (I get lots of failed requests in such circumstances, but some runs complete). I'm seeing 130 r/s for /assets/reset.css even when the server is under load for /, and meanwhile the / run sees 11 r/s.

This result seems to indicate against the "baseline load" explanation for the slowness of reset.css at Heroku.

@chadwhitacre
Copy link
Contributor

Using Gunicorn locally, I am seeing similar results as with Cheroot: ~170 r/s for /assets/reset.css and ~20 r/s for /.

Edit: I may very well not have been running Gunicorn with multiple processes, cf. #1617 (comment).

@chadwhitacre
Copy link
Contributor

Which, I suppose, is all to say that, yes, something is really broken in our http serving.

💔

@chadwhitacre
Copy link
Contributor

BTW, here are the steps I followed to run Gittip with Gunicorn on port 8537:

  • [gittip] $ pip install gunicorn
  • edit local.env and add ASPEN_WWW_ROOT=www and ASPEN_PROJECT_ROOT=.
  • [gittip] $ foreman run -e local.env gunicorn aspen.wsgi:website -b:8537

@bruceadams
Copy link
Contributor

I can see https://help.heroku.com/tickets/101694 (and don't have time to pay attention to this right now).

@chadwhitacre
Copy link
Contributor

(Thanks for the confirmation, @bruceadams.)

@chadwhitacre
Copy link
Contributor

You probably won't see any dramatic performance in an ab test as that's just testing loading of the homepage content

Actually, the test we're concerned with here is this one (from your test run):

$ ab -n300 -c20 -q https://www.gittip.com/assets/reset.css | grep Requests
Requests per second:    6.15 [#/sec] (mean)

That tests loading a static page, not the homepage content. The expected result is to see an order of magnitude more requests per second here.

@neilmiddleton
Copy link

Sure. Does this 'static' file get routed through the app process maybe? This is probably where @kennethreitz can help more.

@chadwhitacre
Copy link
Contributor

which seems to be the original issue here.

The original issue as I understand it is:

The file 'reset.css' is under 1kB in size. When testing the site with 'ab' it is easy to see that no matter the number of threads, with just 20 parallel connections we cannot get more than 5 requests per second.

@chadwhitacre
Copy link
Contributor

Does this 'static' file get routed through the app process maybe?

Yup. We've been using that as a benchmark to understand the performance characteristics of our app within Heroku. Our reasoning is that serving a static file out of our app server represents an upper bound on the performance we can expect. If we can only get ~8 req/sec for a static file from our app server at Heroku, then we're not going to see more than that for dynamic pages.

For comparison, we are seeing performance from 64 to 177 req/sec for the same /assets/reset.css file from the app server in other contexts. Here's what our results look like so far:

If we vary the WSGI server we're using to gunicorn, we get:

@chadwhitacre
Copy link
Contributor

I just remembered that we have a QA instance of the app running at https://gittip.whit537.org/. I'm going to dust it off and see what we get there.

@chadwhitacre
Copy link
Contributor

Okay!

  • 72—QA at Heroku

@chadwhitacre
Copy link
Contributor

New hypothesis: we're hitting limitations with ab (cf.). Now looking at wrk ... IRC

@aisipos
Copy link
Contributor

aisipos commented Nov 4, 2013

Sorry if this isn't strictly germane to this ticket, but has anyone run gittip.com through http://webpagetest.org?
It doesn't measure throughput but can highlight many web page performance issues. I ran it on the url 'gittip.com', and got these results:
http://www.webpagetest.org/result/131104_BM_VS8/1/details

A couple of things I noted:

Hopefully webpagetest can help identify and fix some performance issues.

@chadwhitacre
Copy link
Contributor

Thanks @aisipos. I agree that it makes sense to do an outside-in performance audit, though per #1555 we're already aware of some obvious culprits. Let's use this ticket for understanding our basic performance characteristics at Heroku, and work on client-facing performance on #1555.

@chadwhitacre
Copy link
Contributor

New results using wrk:

screen shot 2013-11-04 at 4 36 46 pm

@chadwhitacre
Copy link
Contributor

The question in order to close 1617 is whether we are satisfied with this explanation of the difference between /assets/reset.css in production and at QA: the baseline production traffic, a fair amount of which hits the database. 1617 is about a crisis in http serving.

IRC

@chadwhitacre
Copy link
Contributor

We ran a test where @zwn loaded up gittip.whit537.org with this command:

wrk -t2 -c5 -d60 https://gittip.whit537.org/

He achieved ~6 req/sec over the 60 seconds, which is higher than our average load in production.

I then ran wrk -t2 -c20 -d5 http://gittip.whit537.org/assets/reset.css four times, with 67, 122, 163, 218 as the results. This seems to indicate either that our test isn't adequate, or the hypothesis that baseline load in production accounts for the difference with QA is wrong.

@chadwhitacre
Copy link
Contributor

On a tip from @clone1018 (IRC), I just did a heroku restart and then benchmarked production. I saw 88, 100, 84, 64, 77, 46, 79. However, we had so much variation in the test results on the spreadsheet (DO hitting product /assets/reset.css) that I don't trust the comparison to the 34 figure above. I would want to take more samples after letting the app run for a day before venturing a comparison between cold and hot.

@chadwhitacre
Copy link
Contributor

Hi,
Runtime have had a look at this and we can't see any issues with the runtimes you're on. It seems that the issues here are purely down to the application itself. This could be due to some characteristic of your app which doesn't suit Heroku, it could be something else. However, we're sure there's no Heroku issue as such.
I hope this helps. Let me know if you have any questions.
Neil

@chadwhitacre
Copy link
Contributor

Thanks Neil. We'll keep working on it on #1617. It looks like some (but not all) of the issue is explained by using ab (we're now using wrk).

@chadwhitacre
Copy link
Contributor

Let's see if we can land this.

@chadwhitacre
Copy link
Contributor

IRC

@chadwhitacre
Copy link
Contributor

Whoa! Now I'm seeing ~200 req/sec from production:

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://www.gittip.com/assets/reset.css
Running 5s test @ https://www.gittip.com/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    66.36ms   71.39ms 394.61ms   87.50%
    Req/Sec   110.92     32.30   176.00     64.58%
  1053 requests in 5.01s, 1.17MB read
Requests/sec:    210.30
Transfer/sec:    239.72KB

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://www.gittip.com/assets/reset.css
Running 5s test @ https://www.gittip.com/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.92ms  215.24ms   2.16s    95.76%
    Req/Sec   112.51     33.23   172.00     64.41%
  1064 requests in 5.02s, 1.18MB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:    211.96
Transfer/sec:    241.61KB

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://www.gittip.com/assets/reset.css
Running 5s test @ https://www.gittip.com/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    94.25ms  148.62ms   1.13s    95.24%
    Req/Sec    88.05     28.31   143.00     64.29%
  848 requests in 5.00s, 0.94MB read
Requests/sec:    169.46
Transfer/sec:    193.14KB
root@loader:~/wrk#

@chadwhitacre
Copy link
Contributor

Comparable performance from QA:

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://gittip.whit537.org/assets/reset.css
Running 5s test @ https://gittip.whit537.org/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   173.27ms  253.72ms   1.39s    90.38%
    Req/Sec    77.12     36.79   152.00     61.54%
  786 requests in 5.01s, 0.88MB read
Requests/sec:    157.02
Transfer/sec:    179.62KB

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://gittip.whit537.org/assets/reset.css
Running 5s test @ https://gittip.whit537.org/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   109.86ms  114.58ms 487.90ms   82.14%
    Req/Sec    96.04     24.72   137.00     64.29%
  967 requests in 5.00s, 1.08MB read
Requests/sec:    193.21
Transfer/sec:    221.04KB

root@loader:~/wrk# ./wrk -t2 -c20 -d5 https://gittip.whit537.org/assets/reset.css
Running 5s test @ https://gittip.whit537.org/assets/reset.css
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   103.67ms  144.18ms 792.95ms   88.71%
    Req/Sec    93.32     28.03   149.00     61.29%
  920 requests in 5.01s, 1.03MB read
Requests/sec:    183.49
Transfer/sec:    209.86KB
root@loader:~/wrk#

@kennethreitz
Copy link
Contributor

What changed?

@chadwhitacre
Copy link
Contributor

@kennethreitz You tell us? :-)

Very little in the way of code, just some javascript cleanup (irrelevant to /assets/reset.css.)

@chadwhitacre
Copy link
Contributor

Per @kennethreitz in IRC, "noisy neighbors" (or other AWS weather factors?) could explain the difference between QA and production we were seeing at #1617 (comment). This explanation is consistent with the behavior at #1617 (comment), where performance in production increased after a restart, as well as the better performance (and consistent between QA and production) we're seeing today.

@chadwhitacre
Copy link
Contributor

So the worst of the problem is explained by using ab and is fixed by using wrk, and the (non-critical) remainder of the problem is explained by AWS and/or Heroku weather, and can be worked around ad hoc by restarting the application.

@zwn We good to close?

@chadwhitacre
Copy link
Contributor

Closing per IRC.

@chadwhitacre
Copy link
Contributor

💃

@clone1018
Copy link
Contributor

!m *

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants