-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network sysctls #556
base: main
Are you sure you want to change the base?
network sysctls #556
Conversation
# use TCP BBR has significantly increased throughput and reduced latency for connections | ||
# https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html | ||
# In some cases, TCP BBR can significantly increase throughput and reduce latency, | ||
# however this is not true in all cases, and should be used with caution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What cases in particular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we talking about the long fat pipes issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At edgecast, with ~20k machines covering something like >75% of routes in the world, we setup socket performance sampling using a a tool like xtcp2 (https://github.com/randomizedcoder/xtcp2). This is essentially streaming back "ss --tcp --info", back to a big clickhouse cluster. This gave us global visibility of socket performance.
Then we ran a series for "canaries"/experiments to enable BBR on some machines in different PoPs all over the world.
We then carefully analyzed the results to observed the impact to the socket performance. In many cases, socket performance, like throughput, dropped. This was particularly true for small HTTP transactions, and connections with low RTTs (like 20-40ms). This might be because BBR takes time to find the target rate. We did see benefits with BRR particularly to higher RTTs, and in particular cellular networks. In the end, we kept cubic the default, and wrote automation to detect the sockets were BBR would benefit them, and then the Edgecast webserver would switch to BBR only for those destination routes. This gave ~4% performance improvement globally.
Of course this was all for Edgecast, which has traffic patterns that could be very different to your use cases.
... I find that if I enable BBR on my laptop at home, it sucks, particularly for talking to local machines on my LAN, so I don't use BBR on my laptop anymore. My internet connection is pretty good, so I'm ~12ms to most CDNs, so that's also why BBR isn't really required.
Anyway, my point is that just blindly turning on BBR could be doing more harm than good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So my assumption here, when I enabled bbr
here, was that it would help for the typical server (1Gpbs-10Gpbs uplink, running in some data-center with reasonable global peering). That's why we also only put it in the "server" profile while keeping the default for the "desktop" profile.
But your data suggest that it would be actually worse for these cases and this should be used mainly to improve throughput when serving mobile clients? Interestingly your 4% performance improvement seem to match the result that Youtube reported as well.
Which version of BBR did you use for your testing? It looks like since 2023 we now also have BBRv3, with some enhancements: https://datatracker.ietf.org/meeting/117/materials/slides-117-ccwg-bbrv3-algorithm-bug-fixes-and-public-internet-deployment-00
Thanks for the slides.
Even BBR2 isn't merged into the kernel, so currently BBR1 is the only
available. The team working on BBR2 and L4S/Prague (
https://www.rfc-editor.org/rfc/rfc9330.html ) do have a branch with bbr2 in
it. I guess something that might be cool is to get Nix to apply the BBR2
patches.
https://github.com/L4STeam/linux/blob/56eae305cddf172b87c54d8a61db8d1e9e2204f0/net/ipv4/tcp_bbr2.c#L1304
Apparently BBR3 isn't in the repo :(
https://github.com/search?q=repo%3AL4STeam%2Flinux+bbr3&type=code
…On Fri, Nov 8, 2024 at 10:10 PM Jörg Thalheim ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In nixos/server/default.nix
<#556 (comment)>:
> @@ -111,10 +111,56 @@
'';
};
- # use TCP BBR has significantly increased throughput and reduced latency for connections
+ # https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html
+ # In some cases, TCP BBR can significantly increase throughput and reduce latency,
+ # however this is not true in all cases, and should be used with caution
Ok. So my assumption here, when I enabled bbr here, was that it would
help for the typical server (1Gpbs-10Gpbs uplink, running in some
data-center with reasonable global peering). That's why we also only put it
in the "server" profile while keeping the default for the "desktop" profile.
But your data suggest that it would be actually worse for these cases.
Interestingly your 4% performance improvement seem to match the result that
Youtube reported as well.
Which version of BBR did you use for your testing. It looks like since
2023 we now also have BBRv3, with some enhancements:
https://datatracker.ietf.org/meeting/117/materials/slides-117-ccwg-bbrv3-algorithm-bug-fixes-and-public-internet-deployment-00
—
Reply to this email directly, view it on GitHub
<#556 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APMCHTRATJ6DBQQG2OENDALZ7WRNBAVCNFSM6AAAAABRFDAREKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMRVGIZTAMJVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Regards,
Dave Seddon
+1 415 857 5102
|
@randomizedcoder if your recommendation would be to disable bbr for most server usage, maybe we should drop it from srvos than. |
@@ -111,10 +111,56 @@ | |||
''; | |||
}; | |||
|
|||
# use TCP BBR has significantly increased throughput and reduced latency for connections | |||
# https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@randomizedcoder The optimization you applied, maybe we should agree first what usecase we are optimizing for. Could you maybe describe what type of server/client setup you have based your considerations on? I would like to add this as a comment for future readers.
Something like that (feel free to change it what you want to)
# https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html | |
# These settings optimize network configuration for servers with X-Y Gpbs and serving clients with Xms latency | |
# https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html |
Appearantly TCP BBR is not a good idea without doing any performance measurements: #556 fq is actually incorrect. It should be `fq_codel` and systemd already applies this by default for us.
Appearantly TCP BBR is not a good idea without doing any performance measurements: #556 fq is actually incorrect. It should be `fq_codel` and systemd already applies this by default for us.
"net.core.default_qdisc" = "fq"; | ||
"net.ipv4.tcp_congestion_control" = "bbr"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dropped both settings in #576
G'day numtide,
Thanks for all the great numtide projects! I've been learning Nix and you guys are definitely leaders. Thank you.
In the spirit of giving back, I was reading the blog about SrvOS ( https://numtide.com/blog/donating-srvos-to-nix-community/ ), and took a quick look. I noticed some standard TCP performance tweaks I apply are missing (most importantly the TCP buffer sizes), and these tweaks will likely be suitable in the vast majority of cases. Therefore, I submit this little pull request for your thoughts.
It's interesting to see that you ARE changing to BBR-TCP by default. This is probably mostly safe across the WAN, but BBR is not a silver bullet, and in many cases (like any connection with low RTTs) is likely to be making performance worse. I did a lot of testing at Edgecast CDN, and we determined NOT to change to BBR, but would selectively switch to BBR for some destination subnets
https://edg.io/technical-articles/improving-network-performance-with-dynamic-congestion-control/
Thanks again,
Dave