-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup and fixes for Windows sockets #3816
Conversation
Hi @kulibali, The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do. Release notes are added by creating a uniquely named file in the The basic format of the release notes (using markdown) should be:
Thanks. |
.release-notes/3816.md
Outdated
## Cleanup and fixes for sockets on Windows | ||
|
||
Fixed a few infelicities in the Windows socket code, in particular that clients connecting to an invalid port would think they were successfully connected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is very meaningful to an end user. It's basically "bug fix".
I think something like
## Address Windows socket errors
BLAH BLAH things fixed
with perhaps a description for users of what was wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated them.
@kulibali the tests are failing, do you want me to review looking for why, or will you be getting that? |
I'll take a look. |
Well I just set up a new FreeBSD 13 VM and compiled and ran the tests there and they all worked fine. I wonder if there are networking differences in Cirrus. |
I made a FreeBSD 13 instance on Google Cloud, as that is what Cirrus uses (https://cirrus-ci.org/guide/FreeBSD/), and the tests run fine on that. So I'm stumped. |
On Windows the `TCPConnection` code was not immediately unsubscribing to ASIO events when trying to shut down a connection. It was also not ever checking to see if all the relevant IOCP operations were complete, so in certain cases it would wait forever on the ASIO event and hang the program. This change makes `TCPConnection` immediately unsubscribe. There may be some pending IOCP events left over but since we're shutting down we don't care about them reaching client code. On Windows the recommended way to check the status of a connection is to use `SO_CONNECT_TIME` rather than `SO_ERROR`. Certain failing IOCP operations will not set `SO_ERROR`, so this changes `TCPClient._is_sock_connected()` to use `SO_CONNECT_TIME` on Windows. Fixes a case in `TCPConnection._event_notify` where clients might receive spurious `connecting` calls when a connection in a different network family failed. Changes the use of `Sleep()` on Windows to `SleepEX()`, which will wake up for IO operations. Adds a new test `net/TCPConnectionFailed` that tests if a connection to a bogus port will fail immediately. This was broken on Windows. Adds a new test for Unix only `new/TCPConnectionToClosedServerFailed` that tests if a connection to a recently-closed server will fail immediately. This test is disabled on Windows because it seems that per investigation of #3656 listening sockets that have an `AcceptEx` in process stick around in the `CLOSE_WAIT` state and will allow connections even if the accepting and listening sockets have been closed.
1a7fd6e
to
fa1a0d0
Compare
class _TestTCPConnectionToClosedServerFailed is UnitTest | ||
""" | ||
Check that you can't connect to a closed listener. | ||
""" | ||
fun name(): String => "net/TCPConnectionToClosedServerFailed" | ||
fun exclusion_group(): String => "network" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test lacks a complete(true) call. You should have that for any "success" cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete(true)
call is in _Connector
, in the connect_failed
function of the notify object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed the _Connector call
class _TestTCPConnectionToClosedServerFailed is UnitTest | ||
""" | ||
Check that you can't connect to a closed listener. | ||
""" | ||
fun name(): String => "net/TCPConnectionToClosedServerFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this testing? I dont see a client that is doing any connecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to _Connector.connect()
on line 776 tries connecting.
how's this going @kulibali, do you need any assistance? |
Assistance would be welcome. I haven't been able to find out how exactly Cirrus's FreeBSD images are configured. The tests that are failing on Cirrus run fine on a FreeBSD VM on my machine, as well as a Google Cloud FreeBSD image. |
Perhaps a timing issue that happens to happen only there? Or do you think it's a configuration issue? |
The only tests we have that use 127.0.0.1 are the new tests. Perhaps the host is the problem.
I've found one small issue so far. |
ok i figured it out. working on a "better" fix than what i am currently doing. |
issues:
i'm working on setting a lower default on cirrus and then will clean up my changes. |
@kulibali ok, you are good now. is there anything else or is this ready to squash and merge? |
@kulibali let me know if any of the updates i made to release notes are incorrect. |
Looks good @SeanTAllen, thanks! |
Prior to this commit we were using a mechanism on Windows called `WaitForSingleObject` when we wanted to put a thread to sleep until we activated a signal to wake it up later. Now we use `WaitForSingleObjectEx` instead, so we can be sure that the thread is kept in an "alertable" state if there is an APC (Async Procedure Call) that needs to be run during the wait period. We use APCs for socket I/O completion callbacks, and they can only arrive on the thread that initiated them, so it's important to allow such callbacks to run even if the scheduler thread is suspended. Note that the callbacks we use for sockets will not run any Pony code - they will only dispatch a message via the ASIO subsystem, so running those APCs is safe even if the actor has been work-stealed to another thread and is running there. In such a situation, the ASIO event will just go into the actor's queue and they'll process the message later, as normal. We just need to make sure the APC actually will get run, hence we need to ensure even suspended threads will stay "alertable". Note that this fixes a similar problem to one of the problems that was fixed in PR ponylang#3816, wherein some calls to `Sleep` were migrated to `SleepEx` for the same reason - to stay "alertable".
Prior to this commit we were using a mechanism on Windows called `WaitForSingleObject` when we wanted to put a thread to sleep until we activated a signal to wake it up later. Now we use `WaitForSingleObjectEx` instead, so we can be sure that the thread is kept in an "alertable" state if there is an APC (Async Procedure Call) that needs to be run during the wait period. We use APCs for socket I/O completion callbacks, and they can only arrive on the thread that initiated them, so it's important to allow such callbacks to run even if the scheduler thread is suspended. Note that the callbacks we use for sockets will not run any Pony code - they will only dispatch a message via the ASIO subsystem, so running those APCs is safe even if the actor has been work-stealed to another thread and is running there. In such a situation, the ASIO event will just go into the actor's queue and they'll process the message later, as normal. We just need to make sure the APC actually will get run, hence we need to ensure even suspended threads will stay "alertable". Note that this fixes a similar problem to one of the problems that was fixed in PR #3816, wherein some calls to `Sleep` were migrated to `SleepEx` for the same reason - to stay "alertable".
On Windows the
TCPConnection
code was not immediately unsubscribing to ASIO events when trying to shut down a connection. It was also not ever checking to see if all the relevant IOCP operations were complete, so in certain cases it would wait forever on the ASIO event and hang the program. This change makesTCPConnection
immediately unsubscribe. There may be some pending IOCP events left over but since we're shutting down we don't care about them reaching client code.On Windows the recommended way to check the status of a connection is to use
SO_CONNECT_TIME
rather thanSO_ERROR
. Certain failing IOCP operations will not setSO_ERROR
, so this changesTCPClient._is_sock_connected()
to useSO_CONNECT_TIME
on Windows.Fixes a case in
TCPConnection._event_notify
where clients might receive spuriousconnecting
calls when a connection in a different network family failed.Changes the use of
Sleep()
on Windows toSleepEX()
, which will wake up for IO operations.Adds a new test
net/TCPConnectionFailed
that tests if a connection to a bogus port will fail immediately. This was broken on Windows.Adds a new test for Unix only
new/TCPConnectionToClosedServerFailed
that tests if a connection to a recently-closed server will fail immediately. This test is disabled on Windows because it seems that per investigation of #3656 listening sockets that have anAcceptEx
in process stick around in theCLOSE_WAIT
state and will allow connections even if the accepting and listening sockets have been closed.