Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Net] Enet optimizations and fixes for sending unreliable enet traffic larger than enets 1400 byte MTU #51466

Conversation

jordo
Copy link
Contributor

@jordo jordo commented Aug 10, 2021

This PR includes some of the enet optimizations and fixes we have been using on our private godot fork.

This PR address the following:

  • Current NetworkedMultiplayerEnet does not set the flag ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT when sending data on the unreliable enet transport. Enet's internal MTU is configured to be 1400 bytes. If you do not set this flag, Enet will deliver messages that are larger than 1400 bytes RELIABLY. The following is output from a gameserver with ENET_DEBUG defined in enet's protocol.c source. This was really surprising to us to see enet queue up outgoing packets even though our godot application was sending everything unreliably. Unless the ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT flag is set (https://github.com/lsalzman/enet/blob/74cea7abf52ddd355146aeb0a4077d2b95368122/include/enet/enet.h#L115), Enet will deliver unreliable messages over 1400 bytes reliably as enet by default will attempt to re-assemble fragments.
"peer 0: 0.000000%+-0.000000% packet loss, 16+-4 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.000000%+-0.000000% packet loss, 19+-4 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.000229%+-0.000458% packet loss, 16+-3 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.000839%+-0.001678% packet loss, 18+-4 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.000626%+-0.001160% packet loss, 20+-4 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.002182%+-0.003952% packet loss, 20+-4 ms round trip time, 1.000000% throttle, 9 outgoing, 0/0 incoming\r\n"
"peer 0: 0.000885%+-0.001404% packet loss, 23+-7 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.001907%+-0.003510% packet loss, 23+-7 ms round trip time, 1.000000% throttle, 0 outgoing, 0/0 incoming\r\n"
"peer 0: 0.002213%+-0.002319% packet loss, 94+-75 ms round trip time, 0.000000% throttle, 2404 outgoing, 0/0 incoming\r\n"
"peer 0: 0.002472%+-0.002274% packet loss, 394+-93 ms round trip time, 0.000000% throttle, 7114 outgoing, 0/0 incoming\r\n"
"peer 0: 0.029907%+-0.056595% packet loss, 639+-161 ms round trip time, 0.062500% throttle, 10664 outgoing, 0/0 incoming\r\n"
"peer 0: 0.063278%+-0.109177% packet loss, 824+-7 ms round trip time, 1.000000% throttle, 13068 outgoing, 0/0 incoming\r\n"
"peer 0: 0.093979%+-0.143295% packet loss, 1006+-7 ms round trip time, 1.000000% throttle, 15179 outgoing, 0/0 incoming\r\n"
  • The current implementation of enet_socket_send in godot (godot.cpp), makes a surperflous copy of outbound data before sending it out on the socket. https://github.com/godotengine/godot/pull/7985/files#diff-4b4db5dcc81b5bff9d43a2bf9b6193316d1b822fe8a310f34b324f0a9e451d3dR143-R148 This is due mainly to the fact that enet wants to to send data via sendmsg on posix platforms, and WSASendTo on windows (in win23.c and unix.c within the enet library). Both these platform functions allow an application to send a list of packets ( A packet being a data pointer and length). Netsocket does not expose a similar interface, so a copy has to be made from each of outgoing packets within enet_socket_send. This PR adds sendmsg type support to Netsocket which allows godot to send a list of packets implemented with sendmsg and WSASendTo on supported godot platforms. Supporting this api in Netsocket allows this PR to remove the superfluous memcopy in enet_socket_send in godot.cpp.

  • Lastly, an improvement is made to how enet is serviced in NetworkedMultiplayerENet::poll. The current implementation is to loop on enet_host_service(). This method of servicing enet has a potential starvation issue that we encountered, which can lead to unexpected durations of how long the godot main thread is spending in NetworkedMultiplayerENet::poll. This is especially not wanted, if you are polling enet on the main thread of a game engine (there is no guarantee the loop will ever stop). If you have clients that send data fast enough, enet will continue to service these messages, and it can be tough to break out of the loop and lead to starvation issues. A solution by the enet library developer was to add enet_host_check_events(). This can be used to process all the events queued up for processing by a single call to enet_host_service(). This is somewhat described in this response here: Q: Right usage of host service and check events lsalzman/enet#126 (comment) A single call toenet_host_service() will hit the socket, and queue all incoming messages for processing in enet. Repeatedly calling enet_host_check_events() doesn't touch the socket at all, and will process all queued messages until there are none left. This technique is described the enet library's developer here: https://www.mail-archive.com/enet-discuss@cubik.org/msg00947.html and http://lists.cubik.org/pipermail/enet-discuss/2013-September/002238.html.

  • Of note - I have NOT been able to test NetSocketPosix::sendmsg on windows. NetSocketPosix::sendmsg does have an #ifdef for the windows platform here but I believe it's a good tradeoff here to support this api, as it allows a lot of efficiencies especially when specifically working with the enet api. The spec for WSASendTo seems reasonably straight forward (https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsasendto) however I currently don't have a machine to test this path.

This change can be cherry picked to 4.0 master as well.

@jordo jordo requested review from a team as code owners August 10, 2021 05:45
@jordo jordo changed the title Network and enet optimizations and fixes for sending unreliable enet traffic larger than enets 1400 byte MTU [Net] Enet optimizations and fixes for sending unreliable enet traffic larger than enets 1400 byte MTU Aug 10, 2021
Comment on lines +37 to +46
typedef struct {
#ifdef WINDOWS_ENABLED
size_t dataLength;
void *data;
#else
void *data;
size_t dataLength;
#endif
} NetSocketBuffer;

Copy link
Contributor Author

@jordo jordo Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of note, on windows, the properties in these structs are inverted. (WSABUF https://docs.microsoft.com/en-us/windows/win32/api/ws2def/ns-ws2def-wsabuf)

@Faless
Copy link
Collaborator

Faless commented Aug 10, 2021

* Current NetworkedMultiplayerEnet does not set the flag `ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT` when sending data on the unreliable enet transport.    Enet's internal MTU is configured to be 1400 bytes.  **If you do not set this flag, Enet will deliver messages that are larger than 1400 bytes RELIABLY**

I don't think we can't change this in 3.x, it would exponentially raise the drop probability of big packets sent with rpc_unreliable.
While I understand you may want to send big packets unreliably, this was not done in Godot historically, so if we are to change it we should do in 4.0.

* This PR adds `sendmsg` type support to `Netsocket` which allows godot to send a list of packets implemented with `sendmsg` and `WSASendTo` on supported godot platforms.  Supporting this api in `Netsocket` allows this PR to remove the superfluous memcopy in `enet_socket_send` in `godot.cpp`.

Do you have any data on how much this saves in terms of cpu usage? Adding a whole new function and struct to only use in the ENet base socket should be motivated by noticeable performance improvements.
(I don't really care much about the extra code iselft, more about the extra are code path to follow and test).

* Lastly, an improvement is made to how enet is serviced in `NetworkedMultiplayerENet::poll`.  The current implementation is to loop on `enet_host_service()`.  This method of servicing enet has a potential starvation issue that we encountered

This is interesting, I'll have a look, but this change seems fine.

@@ -217,21 +217,22 @@ void NetworkedMultiplayerENet::poll() {

_pop_current_packet();

if (!host || !active) // Might be disconnected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifs should always include curly brackets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
/* Keep servicing until there are no available events left in the queue. */
do {
if (!host || !active) // Check again after every event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifs should always include curly brackets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jordo
Copy link
Contributor Author

jordo commented Aug 10, 2021

I don't think we can't change this in 3.x, it would exponentially raise the drop probability of big packets sent with rpc_unreliable.
While I understand you may want to send big packets unreliably, this was not done in Godot historically, so if we are to change it we should do in 4.0.

I would argue against this, as it's currently broken for everyone in 3.x. Godot exposes three transfer modes TRANSFER_MODE_UNRELIABLE, TRANSFER_MODE_UNRELIABLE_ORDERED, TRANSFER_MODE_RELIABLE. If i send a packet that's 1600 bytes with TRANSFER_MODE_UNRELIABLE ( not out of the ordinary for a state snapshot of a networked multiplayer game ), godot will actually transfer this reliably. This is not good, because the expectation of a developer when explicitly setting TRANSFER_MODE_UNRELIABLE will be that the transfer mode is actually unreliable, it should be "fire and forget".

It shouldn't be possible to encounter application-side buffer bloat on an unreliable transport because if network congestion is encountered eventually somewhere along the way packets will simply just get dropped. However, unlimited outbound buffer bloat is exactly what we encountered as shown in the above output I posted. The fact that godot is currently converting unreliable packets over 1400 bytes to reliable, unknowingly to the developer, is not good. For any 3.x developer that is unfamiliar with enet internals this will be very difficult to identify and debug.

If developers are using rpc_unreliable, they should expect that this rpc can fail, regardless of packet size. If failure becomes a problem for their particular use case, it's a simple fix to change to rpc_reliable. Godot should not make that determination for them "under the hood".

Do you have any data on how much this saves in terms of cpu usage? Adding a whole new function and struct to only use in the ENet base socket should be motivated by noticeable performance improvements.
(I don't really care much about the extra code iselft, more about the extra are code path to follow and test).

For us, we're trying to make godot performant, as ultimately we want to support as many simultaneous client connections as possible. So 'noticeable performance improvements' are going to be very use-case specific. Our gameservers are hosted on the cloud, were we pay for cpu time, so any performance improvements mean less costs which is a win for us. Likewise we want to minimize as much cpu overhead because currently we are cpu bound and that determines how many concurrent client connections we can support in our battle-royale games. When you have hundreds of packets being sent to hundreds or potentially a thousand connected clients, superfluous memcpy's for every outgoing packet start to add up. The data to send is already available in buffers, minimizing memcpy's and memory operation overhead in netcode should be a goal to strive for regardless.

Part of the motivation behind the additional support for sendmsg in NetSocket is that NetSocket is really a nice wrapper around the platform specific socket interface (posix & windows). Both windows and posix expose an api to send a list of buffers on the socket https://man7.org/linux/man-pages/man2/send.2.html I would say adding sendmsg support to NetSocket provides a more complete coverage of the underlying platform's socket interface, and can be used for more than just interfacing efficiently with enet.

@jordo jordo force-pushed the winterpixel-enet-optimization-and-fixes branch from d1b0f45 to e336b2a Compare August 11, 2021 03:42
@Faless
Copy link
Collaborator

Faless commented Aug 11, 2021

I would say adding sendmsg support to NetSocket provides a more complete coverage of the underlying platform's socket interface, and can be used for more than just interfacing efficiently with enet.

It just worries me that you mention it being untested on windows, I don't want to merge something that theoretically improve ENet and in practise could break it on windows. So, it will have to undergo proper testing.

So, I would like to split this PR, have the enet_host_service change PRed directly to 3.x, while having the other 2 changes in master first. So they can undergo proper testing.

@Calinou
Copy link
Member

Calinou commented Aug 11, 2021

It just worries me that you mention it being untested on windows, I don't want to merge something that theoretically improve ENet and in practise could break it on windows. So, it will have to undergo proper testing.

We should probably have a networking tests project (similar to the physics tests demo project) that tests this kind of situation, such as sending packets very often, sending large packets, etc.

@jordo
Copy link
Contributor Author

jordo commented Aug 11, 2021

Ya I mean, I'm hoping someone, (I guess I could try), can test windows before this is merged.

@mhilbrunner
Copy link
Member

I'll try to test this on Windows soon-ish.

@Faless
Copy link
Collaborator

Faless commented Aug 11, 2021

Ya I mean, I'm hoping someone, (I guess I could try), can test windows before this is merged.

In any case this should be merged to master before being merged to 3.x.

@mhilbrunner
Copy link
Member

Thanks for this and your work!

  1. I tested this on current Windows 10 and it seems to work fine. I also found no obvious flaws in the implementation, it looks like I'd expect it to.
  2. I agree this should really go into master/Godot 4.0 and then maybe be backported for a future 3.x point release. (Although I'm not sure changing core networking parameters so far into a major release's lifecycle is advised, but we've shipped other major stuff with 3.x updates too.)
  3. This may need to be configurable. Both the MTU and whats happens if its exceeded (current behaviour or fragmenting, which maybe indeed should be default). I agree its somewhat surprising that packets that exceed MTU are sent reliably.
  4. IMO we should emit a warning when unreliable packets exceed the set MTU size, maybe only the first time it happens and maybe only on debug builds. Those are really meant for small, constant, fast updates, not for larger payloads.

@Faless
Copy link
Collaborator

Faless commented Aug 18, 2021

I agree its somewhat surprising that packets that exceed MTU are sent reliably.

The main reason behind this ENet default is the fact that with every additional fragment the drop probability will increase exponentially. In general, you want to avoid unreliable packets above the MTU exactly for this reason, so a warning in debug as suggested could be really useful.

This may need to be configurable.

Yeah, I think that's a good idea, and the 3.x default should be compatible with what we had before IMHO, but at least users can change it. We could use the other default in master though that's not the ENet default.

@mhilbrunner
Copy link
Member

Yeah, if we make it configurable and add a warning IMO the current default is fine for Godot 4.0 too.

@jordo
Copy link
Contributor Author

jordo commented Aug 18, 2021

This may need to be configurable. Both the MTU and whats happens if its exceeded (current behaviour or fragmenting, which maybe indeed should be default). I agree its somewhat surprising that packets that exceed MTU are sent reliably.

We don't want to really adjust the ENet MTU. It should be configured to be something smaller than 1500 bytes which is a common MTU on a bunch of different link level protocols. You really want to avoid fragmentation at the link and lower layers, which is why ENET settled on a 1400 MTU for it's library (which is a good choice, and allows some additional header room).

The main reason behind this ENet default is the fact that with every additional fragment the drop probability will increase exponentially. In general, you want to avoid unreliable packets above the MTU exactly for this reason, so a warning in debug as suggested could be really useful.

Well this is a weird thought, because ya, it's going to be 'exponentially' more drops because currently there are ZERO drops ( it's currently switched to reliable ). If ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT is enabled and you're experiencing packet loss of 1-2%, then ya, you are going to have more drops because you aren't going to be able to re-assemble as many fragmented packets. But that's also making a lot of assumptions... If my packet size is 2000 bytes, it's only going to be fragmented into two pieces. So at 1-2% packet loss, that isn't an exponential increase in drop probability. It's maybe an extra drop or two every hundred packets.

And if you are currently experiencing anything greater than 5% packet loss, you are going to have a lot more issues other than what's happening here... And a 5% packet loss or greater for 'unreliable' egress, is generally caused by the sender sending more data than what the connection can handle. Loss here, is a good thing.

In the end, you are trading off one problem for another here. If you try to push more unreliable data than a connection can handle, you just end up with unlimited outbound buffer bloat (current status quo in godot implementation). With an unreliable transport, you shouldn't have to worry about outbound buffer bloat, it's kinda one of the perks. If someone is trying to send a 1MB packet unreliably, then ya the drop probability will increase exponentially, but at this point it's really user-error. And it becomes quite obvious during development if you really need this 1MB packet, you should just send it reliably.

Out of all the various network properties, real-time game networking is the most sensitive to latencies and jitter. Having unreliable packets, potentially queued up indefinitely at the application layer is probably the worst thing that can happen, and in fact much worse than dropping packets. Because ANY queuing/timers/re-transmissions of these packets is going to wildly change the jitter and latency parameters on the receiving end.

I can't say this more, but packets dropping can and should be expected when you explicitly ask to send something unreliably. In fact it's a good thing, because if you have network congestion, simply dropping packets eliminates additional latencies introduced by buffering.

I have heard other devs encounter strange behaviour and issues with godot enet connections, and I wouldn't be surprised if some developers are currently getting caught by this. It's incredibly difficult to diagnose and debug, as it just manifests like network latency hiccups (but they are actually application introduced latencies), which unless you are a network engineer will be tough to diagnose and fix. And the runaway outbound buffer can potentially bite developers big time, because again it will be hard for general developers to diagnose and understand what's currently going on.

I'm not sure what the solution is for godot devs in 3.X is, but my guess is we aren't the only studio to encounter this problem. Sending game state updates over 1400 bytes isn't uncommon at all... At a 30hz broadcast rate, 1400 bytes a state update is only ~ 40kB/s. I mean, developers COULD break up their state data into packets of 1400 bytes or less, but honestly whats the point? The whole game state still needs to be re-assembled by the client to be valid, in which case you've just shifted the point of re-assembly from the transport/enet layer into the application layer. It's totally pointless to work around this limitation at the application level.

My only concern with maintaining the status quo ( not making ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT default ) would be how many current and future projects are being impacted by this and are just unaware of whats happening under the hood. My gut feeling is that we're not really doing anything special in our project and we ourselves encountered this issue. Basically a standard 'state update' client-server network game architecture.

@mhilbrunner
Copy link
Member

@jordo Thanks for taking the time to discuss this in depth and for the writeup. I think you convinced me, I can follow and see your points. Need to mull this over a bit, but you're probably right.

I think making MTU configurable still doesn't hurt (so devs can lower it if needed), and I still think a warning would be nice (because it is hard to gauge/debug), but even if we make the behaviour if it is exceeded configurable, ENET_PACKET_FLAG_UNRELIABLE_FRAGMENT does seem a saner default. @Faless thoughts?

@Faless
Copy link
Collaborator

Faless commented Aug 18, 2021

that isn't an exponential increase in drop probability

Well, according to maths, that's still exponential, that's just the second power, which is exponential.
That said, if we add a warning in debug we can likely change the default in 3.4 (since devs will realize what's going on).

@jasonwinterpixel
Copy link
Contributor

I think making MTU configurable still doesn't hurt

The MTU of the link level is usually 1500. Enet has specified 1400 as it's MTU. I think that enet's MTU is optimal. Setting the MTU higher than 1500 is going to cause drops and fragments on the link level, and setting it less than 1500 is going to cause less information density to travel over the network. Enet has specified 1400 using a 100 byte buffer under 1500. I think that's a good, battle-tested decision by enet.

TL;DR

Leave enet's MTU at 1400

@akien-mga akien-mga modified the milestones: 4.0, 3.5 Dec 16, 2021
@akien-mga
Copy link
Member

I believe this was superseded by #53129 (master) and #53130 (3.x).

@Faless
Copy link
Collaborator

Faless commented Dec 16, 2021

Well, the sendmsg changes where not included in those PRs. So there is some salvageable part.
Although the implementation of ENetDTLSClient and ENetDTLSServer sendmsg are wrong (because they send multiple packets instead of joining them, resulting in junk data being sent), and I think we want to avoid having platform ifdefs in core/ so we need a solution for the NetSocketBuffer being platform-dependent (maybe define it in net_socket_posix?).
So, there is some work to be done in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants