Major performance optimization when using multiple channels #411

aienabled · 2020-12-28T17:21:35Z

Hello!
I'm the developer of CryoFall—a multiplayer action survival game focused on persistent servers each capable of hosting up to 300 players online on an average VPS hardware (2 vCores 3.0+ GHz, 3 GB RAM, targeting fixed 40 updates per second). The game engine (called Renkei) is made from scratch and only using MonoGame as a client framework.
The engine was initially (since 2015) built to use Lidgren Network RUDP library (a default choice for .NET back then), but as I've found and reported more and more issues for it, and considering that this project is no longer actively maintained, I've been preparing to switch to LiteNetLib. Finally, last month, I've switched to LiteNetLib—and it's a great success, aside from one issue that this pull request is going to fix!

I'm the developer who has opened the tickets for LiteNetLib to request a feature of multiple channels per delivery method and to request implementing the reliable sequenced network delivery method 3.5 years ago, as these two were major missing features that were required for our game to function properly due to a large number of entities and fine-tuned network replication of their properties and RPCs. I'm glad to see that my feature requests were eventually implemented and that LiteNetLib has matured dramatically in the past years, making it perhaps a leading choice among the Unity users as well as developers writing their own engines!

Nothing can prove the maturity of LiteNetLib more, than the time necessary to switch from Lidgren Network to LiteNetLib in my case (the project is extremely large and using various Lidgren Network features extensively). The in-place library replacement was done within just a day, with a day more for extensive testing and polishing everything! And that's including the time I've spent making a fork to support more channels. The default max number of channels (64 per delivery method) is not always enough for us, but moving stuff to the application level is not always viable or reasonable, especially as the LiteNetLib already has well-written code to do this. By no means we're using that many channels all the time, but occasionally (with a large number of entities within the player's scope when each, for example, has an independent reliable sequenced channel for HP property and/or few other properties that are modified often) we need a little more than 64 channels for some delivery methods and my fork can handle this fine at the cost of slightly increased traffic overhead (extra one byte for the channel ID in the header). (Please note: current pull request doesn't introduce such change, I'm focusing strictly on the performance optimization when NetPeer has a significant number of channels; of course, the performance patch is even more profitable in case of the increased channel number)

Even when not pushing the number of channels to its limit of 64 per delivery method, I've noticed a heavily unoptimized code related to the usage of multiple channels. The problem is that NetPeer is keeping a linked list of all the channels, and every Update it's iterating over all of them calling SendNextPackets for each. This creates a waste of performance that is getting out of hand quickly when the number of channels rises (O(N*M), N=active connections, M=total created channels per connection). Unfortunately, there are no performance tests included to measure this, but in production (we're running public experimental testing with two different machines hosting the game servers), I've quickly noticed that with LiteNetLib the test server performance degraded dramatically with most of the load being focused on the network thread—and that's just with around 20 online players! See the picture: htop CPU load before my patch (it was happening on both machines that we're running for public experimental testing).

The good thing is that I've also had a similar issue with the Lidgren library before (it is also using a loop over all the channels for each peer in the heartbeat/update logic), and I've already implemented a patch for it 2.5 years ago in my fork of it, and it has proven reliable while being using in production all that time. So I was needed to implement the same solution for LiteNetLib based on my past experience. Hence the pull request. It proved a major success, with the CPU load of the network thread dropping more than x30 times while handling the same number of active connections, ensuring that we can handle hundreds of active connections with dozens of channels each (same as we did with Lidgren library before, or maybe even better). See the pictures: htop CPU load after the patch and the most revealing server CPU load statistics over the week. As you can see, the CPU load almost doesn't increase when the number of players increases to around 15-20 (the spikes are mostly related to the gameplay logic load increase now; e.g. the boss battles are calculation-expensive). And that's with a 1 ms logic update rate for NetManager. After the game update launch in January 2021, I will be able to produce performance stats on the server with hundreds of online players.

So, what's the solution to this performance issue? Not iterating over all the channels of the NetPeer but only over those that have any packets to send (or re-send). It's a waste of performance to iterate over all of them, and there is simply no viable scenario in any game or application when such a large number of channels would be active all the time. In my code optimization, I'm introducing a queue of channels that are scheduled for calling their SendNextPackets method. The channel is enqueued (if it's not in the queue already) when it gets any packets to send, it's completely removed from the queue as soon as it has no packets to send (including ACK packets and possible re-sends). I've integrated this logic by apply the changes as small as possible, making several clear cuts. The code was pretty clear. Though in one case I have stumbled over a mutable struct—a follow-up commit fixed it and I've added a comment to ensure that nobody else would do the same mistake.

Performance-wise, for queue I'm using a regular .NET Queue<T> which is implemented as a circular buffer internally and I'm allocating it with a large enough buffer already, so its performance is O(1) for both Dequeue and Enqueue methods. There is a very little overhead by enqueuing due to the lock on the queue—these two operations happen rarely as there is a check whether the channel is already queued and this check doesn't require an extra lock in most cases as it often reusing the same lock (OutgoingQueue). Overall, the overhead should be neglectable even in the worst-case scenario (when there are only a few channels and they're used all the time).

I hope this pull-request would be useful!

Regards!
Vladimir Kozlov aka ai_enabled
AtomicTorch Studio

…the cycle over every channel.

… sendqueue

aienabled · 2020-12-28T17:27:18Z

I see the CI build is failing due to the C# version support (max 4.0). If necessary, I can make a commit ensuring that this pull-request will not require a higher C# version.

Regards!

aienabled · 2020-12-28T17:44:12Z

The CI build is successful now!

Moe-Baker · 2020-12-30T20:02:43Z

Looks like some good stuff, hopefully, it gets merged.
I also find the 64 channel limit a bit small, how much does your fork do? 255?

RevenantX · 2020-12-30T20:15:37Z

Looks like some good stuff, hopefully, it gets merged.
I also find the 64 channel limit a bit small, how much does your fork do? 255?

In most cases there is no sense to have that much channels. So this limit here inst from random pick. Also increasing this limit will increase all packet sizes.

Moe-Baker · 2020-12-30T20:21:37Z

In most cases there is no sense to have that much channels. So this limit here inst from random pick. Also increasing this limit will increase all packet sizes.

Understandable about increasing the channel limit, yet I think queuing channels seems like a nice addition.

Moe-Baker · 2020-12-30T20:23:32Z

Lol, now I'm actually gonna add something similar for my message queuing on my networking solution, only queue clients that need to be sent messages

aienabled · 2021-01-03T15:59:22Z

@Moe-Baker you can check my fork here https://github.com/AtomicTorchStudio/LiteNetLib/tree/Renkei
To increase the total number of supported channels it was necessary to encode the channel index as two bytes (ushort) instead of a single byte. And it's quite a massive change all around the source code to allow this feature, unfortunately (see the commits).

I agree with @RevenantX opinion that the increased number of channels is not necessary and technically not viable to bring to the master branch. There are obvious tradeoffs (an increased packet header size) and 64 channels per delivery type is already a pretty significant number, and an increased number of channels is definitely not required for most games and applications.

(In my case, the increased number of channels is necessary due to the nature of the game—it's a massive open-world game with a persistent world and each client is connected for many hours. In the worst case, on a single screen (server view scope for each player) there could be dozens and dozens of characters and structures (with network-replicated properties, e.g. health points, which is best to push in a separate channel). For example, see this video though it doesn't display UI (such as character healthbars) as we've recorded it in the spectator mode. I've researched other approaches such as snapshots that may work better and don't use an increased number of channels (utilizing more bandwidth instead when the packet is lost), but the current approach works just great. The server will temporarily allocate a channel only when its state is modified and the change needs to be pushed over the network; this way I'm reusing the channels but sometimes the number is going above 64 for a certain delivery method (like on the video above when there are many weapon shots and explosions damaging a lot of structures and characters))

I'm certain that my pull-request, introducing the optimization into the vanilla library, will definitely improve performance for almost all use cases, and will not hurt it significantly in the worst case. The overhead of pushing-pulling elements into the queue (even when an extra lock is required) is imperceptibly low and there are also no memory allocations.
When the number of channels is low, this overhead is definitely neglectable. If the number of channels is high, it means that they also have a tendency to not be used all the time (otherwise the application will push a relatively huge amount of data over the network, which is not how the games are made, right?). In such case, my optimization will improve the performance incredibly.
In my measurements above the server is not pushing to the vanilla limit of channels—it's just a test server with a very few players that have a bare minimum of interactions, not the stress scenario like on the video above. And you see the improvement. If more data is necessary I can add a debug code to output the total created number of channels and an active number of channels per connection.

My conclusion is that without this optimization, with 1 ms tickrate the library will seriously struggle to process even 32 concurrent connections each with 32 created channels (32x32=1024 channel SendNextPackets expensive calls per tick), even on powerful server hardware like I'm using for the test (3.7 GHz Xeon with multiple vCores, over 600 points in singlethreaded sysbench—it's definitely above the average server machine that is usually scored for 320-450 points in this test).

BTW, we're still running two test game servers—there were zero server reboots since the date of the pull request. Over 500 unique players joined, and thousands of hours played. No issues detected, the performance is stable, the memory usage is stable. The update will launch to an even larger audience (tens of thousands of unique players in a month) soon and I'm fairly certain with this optimization we will see the server handling 200+ concurrent connections without breaking a sweat!

Regards!

aienabled added 4 commits December 25, 2020 22:06

Refactored NetPeer update logic to use channel send queue instead of …

4828059

…the cycle over every channel.

Fixes for channel send queue.

36db581

Fixed a bug with a mutated struct not propagated back to the array.

f3bd351

Merge branch 'master' of https://github.com/RevenantX/LiteNetLib into…

140a0b7

… sendqueue

The patch is made compatible with C# 4.0 and .NET Framework 3.5.

070859e

RevenantX changed the base branch from master to sendqueue January 7, 2021 11:07

RevenantX merged commit d186857 into RevenantX:sendqueue Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major performance optimization when using multiple channels #411

Major performance optimization when using multiple channels #411

aienabled commented Dec 28, 2020 •

edited

Loading

aienabled commented Dec 28, 2020

aienabled commented Dec 28, 2020

Moe-Baker commented Dec 30, 2020

RevenantX commented Dec 30, 2020

Moe-Baker commented Dec 30, 2020 •

edited

Loading

Moe-Baker commented Dec 30, 2020

aienabled commented Jan 3, 2021

Major performance optimization when using multiple channels #411

Major performance optimization when using multiple channels #411

Conversation

aienabled commented Dec 28, 2020 • edited Loading

aienabled commented Dec 28, 2020

aienabled commented Dec 28, 2020

Moe-Baker commented Dec 30, 2020

RevenantX commented Dec 30, 2020

Moe-Baker commented Dec 30, 2020 • edited Loading

Moe-Baker commented Dec 30, 2020

aienabled commented Jan 3, 2021

aienabled commented Dec 28, 2020 •

edited

Loading

Moe-Baker commented Dec 30, 2020 •

edited

Loading