Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark polling system powered Ember and friends #3692

Open
armanbilge opened this issue Jun 14, 2023 · 11 comments
Open

Benchmark polling system powered Ember and friends #3692

armanbilge opened this issue Jun 14, 2023 · 11 comments

Comments

@armanbilge
Copy link
Member

armanbilge commented Jun 14, 2023

I've published a snapshot of FS2 I/O that integrates with the new polling system (based on CE 3.6-e9aeb8c).

libraryDependencies += "co.fs2" %% "fs2-io" % "3.8-1af22dd"

You can drop-in this dependency to any application using the latest stable versions of http4s Ember, Skunk, etc. Please try it out and report back!! We would really appreciate it 😊

Follow-up to:

@armanbilge armanbilge added this to the v3.6.0 milestone Jun 14, 2023
@armanbilge
Copy link
Member Author

CI is green in both http4s and Skunk.

@armanbilge
Copy link
Member Author

I wrote the laziest server 😅

//> using dep co.fs2::fs2-io::3.8-1af22dd
//> using dep org.http4s::http4s-ember-server::0.23.20

import cats.effect.*
import org.http4s.ember.server.EmberServerBuilder

object App extends IOApp.Simple:
  def run = EmberServerBuilder.default[IO].build.useForever

Then I benchmarked it with hey (after some warmup). Looks like a 20% improvement!

FS2 3.7.0

Summary:
  Total:        30.0515 secs
  Slowest:      0.0634 secs
  Fastest:      0.0001 secs
  Average:      0.0028 secs
  Requests/sec: 17845.4902

FS2 3.8-1af22dd

Summary:
  Total:        30.0243 secs
  Slowest:      0.1221 secs
  Fastest:      0.0001 secs
  Average:      0.0023 secs
  Requests/sec: 21782.6589

@armanbilge
Copy link
Member Author

Thanks to @ChristopherDavenport for doing some additional benchmarking! tl;dr Ember has now surpassed Blaze 🔥

https://gist.github.com/ChristopherDavenport/2e5ad15cd293aa0816090f8677b8cc3b

@danicheg
Copy link
Member

w00t! 🔥

Ember has now surpassed Blaze 🔥

On throughput but not latency, right? Also, has something changed to Ember's latency? cc @ChristopherDavenport Or maybe I'm just misreading the results...

Ember Before

Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.19ms   17.05ms 487.71ms   98.89%
    Req/Sec     5.05k     2.51k   11.17k    61.68%

Ember After

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.98ms   55.97ms   1.30s    98.65%
    Req/Sec     7.06k     3.41k   14.89k    66.11%

@ChristopherDavenport
Copy link
Member

Nothing that I'm aware of on the server which is what this tested. This was just a couple one-off tests on my local machine. Its very possible I screwed something up by actually using the machine for something during the period.(Although I tried not to)

To get more concrete answers we'd need to actually spin something up to better test with something like gatling, and in terms of latency I'd need to switch from wrk to wrk2. I'll see if I can run the latter later.

@wjoel
Copy link
Contributor

wjoel commented Jun 16, 2023

I've been spending quite a bit of time on this today, and I'm not seeing anything like 20% improvement. It's possible that results are much better at low levels of concurrency, as hey seems to use - that program does not seem to be very good for testing those things, to be honest, and I'd suggest using wrk instead and trying different numbers of connections to get a better idea of the performance changes. Maybe performance has also increased more when using fewer threads? I'm not really sure. My results are from a Ryzen 5900X (12 cores, 24 threads) on a hot summer day.

TFB results are quite similar, but it's looks like there's a small improvement when enabling polling. There's some variance, perhaps more so than usual because it's a very warm day here in Oslo with a bit over 30 degrees Celsius, so I ran it several times.
1.0-416dd46-SNAPSHOT baseline https://www.techempower.com/benchmarks/#section=test&shareid=882fc784-4260-45e9-a6ae-e875b407e191
1.0-416dd46-SNAPSHOT baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=e18dfeb9-b66c-4835-805d-8ff23724be29
1.0-416dd46-SNAPSHOT polling https://www.techempower.com/benchmarks/#section=test&shareid=468d67ed-e10b-41e7-ae13-93c0534f9c18
1.0-416dd46-SNAPSHOT polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=59b9fc19-d633-4ccf-85a1-16589eb1c42a

I was wondering if maybe there was some important fix in 0.23 that hadn't been merged into main yet, so I also ran it with 0.23.20, but the overall picture is similar.
0.23.20 baseline https://www.techempower.com/benchmarks/#section=test&shareid=d39c3e94-5a3c-472a-9a63-168e350b85e9
0.23.20 baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=7bf77b3c-ae15-4702-966a-405213e6fe8f
0.23.20 polling https://www.techempower.com/benchmarks/#section=test&shareid=e8406980-2a06-4cce-824d-398e73a15e92
0.23.20 polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=fca65726-83cc-4df2-95d5-ebbb1c1b6666

I then ran the Ember benchmark ("simplest") by @ChristopherDavenport and using wrk with 300 connections, otherwise the same settings as TFB is using. I first ran it with Eclipse Temurin JDK 17 (just starting RealApp from within IntelliJ, using -Xmx4g -Xms2g as VM options), but didn't see much difference between polling and not, so then I changed to Azul Zulu Community JDK 20 and explored further. There seems to be quite a bit of variance sometimes when running these tests (probably GC, but I'm not sure), so I ran them 10 times:
simplest ember baseline, Azul 20, -Xmx4g -Xms2g
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done
Requests/sec: 111094.77
Requests/sec: 116257.67
Requests/sec: 118451.51
Requests/sec: 120371.12
Requests/sec: 120780.44
Requests/sec: 117579.88
Requests/sec: 116879.41
Requests/sec: 119444.93
Requests/sec: 113217.95
Requests/sec: 116717.88
simplest ember polling, Azul 20, -Xmx4g -Xms2g
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done
Requests/sec: 119709.67
Requests/sec: 120327.51
Requests/sec: 116398.66
Requests/sec: 119436.00
Requests/sec: 120864.85
Requests/sec: 116291.97
Requests/sec: 118581.67
Requests/sec: 116910.25
Requests/sec: 116744.99
Requests/sec: 120922.49

They seem to be close enough that I'd be hesitant to call it anything other than a draw.

Enabling pipelining (as TFB does) improves the numbers and seems to make them more consistent, but there's still not much of a difference when using polling:
simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24:
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done
Requests/sec: 188395.82
Requests/sec: 187820.24
Requests/sec: 187938.91
Requests/sec: 187871.00
Requests/sec: 187557.91
Requests/sec: 187908.12
Requests/sec: 186906.13
Requests/sec: 187941.53
Requests/sec: 187964.85
Requests/sec: 188687.31
simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24:
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done
Requests/sec: 188291.70
Requests/sec: 188193.24
Requests/sec: 188321.19
Requests/sec: 188406.69
Requests/sec: 188796.68
Requests/sec: 187570.31
Requests/sec: 188174.94
Requests/sec: 181370.82
Requests/sec: 186937.12
Requests/sec: 188360.46

Results get better, but variance increases, when running with -- 16 as the argument instead (which I guess has to do with the number of threads used when making requests, but I haven't checked), and the variance is lower when using polling though it may have been a coincidence:
simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16:
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done
Requests/sec: 264116.61
Requests/sec: 174607.66
Requests/sec: 131080.75
Requests/sec: 262927.01
Requests/sec: 131162.33
Requests/sec: 131334.81
Requests/sec: 252555.60
Requests/sec: 231018.25
Requests/sec: 202961.52
Requests/sec: 190722.09
simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16:
wjoel@apollo:~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done
Requests/sec: 252128.79
Requests/sec: 243816.50
Requests/sec: 237298.43
Requests/sec: 240201.98
Requests/sec: 240870.38
Requests/sec: 205297.17
Requests/sec: 243058.46
Requests/sec: 239905.62
Requests/sec: 179937.06
Requests/sec: 232942.17

I think results are promising, as nothing seems to have gotten worse and it looks like it quite likely is better, but I'd love to see more benchmark results. 20% may be possible, but only in some circumstances, it seems?

@ChristopherDavenport
Copy link
Member

Thank you for putting all the work into that! That's some really good information!

@armanbilge
Copy link
Member Author

armanbilge commented Aug 28, 2023

For folks following along here: we've also published a snapshot of @antoniojimeneznieto's work implementing a JVM polling system based on io_uring. The initial prototype piggy-backs on Netty's internal io_uring APIs.

Please give it a try! The linked PR demonstrates how to create a JVM Ember server using the fs2-io_uring snapshot.

@wjoel
Copy link
Contributor

wjoel commented Aug 29, 2023

I ran the TFB benchmarks with my branch using your latest and greatest snapshots last weekend, and it wasn't worth writing about, slightly worse results than before. Well, turns out I didn't notice that withSocketGroup(UringSocketGroup[IO]) in your example, and it makes quite the difference...

Baseline: https://www.techempower.com/benchmarks/#section=test&shareid=9b70928b-24e8-4a39-a5dc-7832d8b02cd6&test=plaintext
fs2_iouring and also withSocketGroup(UringSocketGroup[IO]): https://www.techempower.com/benchmarks/#section=test&shareid=2a700d9b-1d3d-4835-b366-12dbeb063575&test=plaintext

On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.

Fantastic. Amazing. Well done.

@djspiewak
Copy link
Member

On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.

What the holy crap.

I expected io_uring to make things faster but I didn't expect it to be that much faster. Long way to go to productionalize this and I'm sure there's loads of stuff that'll get faster and slower along the way, but WOW. The fact that this is happening while we're still going through Netty's indirection is pretty impressive.

@djspiewak
Copy link
Member

@wjoel I had an interesting thought: these results almost certainly say something about our syscall overhead. If we're reaping this magnitude of performance improvement just from swapping out the polling system, then it's basically saying that Ember is almost entirely bounded by the overhead of NIO. We won't know for sure until we get some more polling systems into the mix (this is a three variable algebra problem and we only have two equations so far), but this is really fascinating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants