Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and fixes for Windows sockets #3816

Merged
merged 14 commits into from
Aug 15, 2021
Merged

Conversation

chalcolith
Copy link
Member

On Windows the TCPConnection code was not immediately unsubscribing to ASIO events when trying to shut down a connection. It was also not ever checking to see if all the relevant IOCP operations were complete, so in certain cases it would wait forever on the ASIO event and hang the program. This change makes TCPConnection immediately unsubscribe. There may be some pending IOCP events left over but since we're shutting down we don't care about them reaching client code.

On Windows the recommended way to check the status of a connection is to use SO_CONNECT_TIME rather than SO_ERROR. Certain failing IOCP operations will not set SO_ERROR, so this changes TCPClient._is_sock_connected() to use SO_CONNECT_TIME on Windows.

Fixes a case in TCPConnection._event_notify where clients might receive spurious connecting calls when a connection in a different network family failed.

Changes the use of Sleep() on Windows to SleepEX(), which will wake up for IO operations.

Adds a new test net/TCPConnectionFailed that tests if a connection to a bogus port will fail immediately. This was broken on Windows.

Adds a new test for Unix only new/TCPConnectionToClosedServerFailed that tests if a connection to a recently-closed server will fail immediately. This test is disabled on Windows because it seems that per investigation of #3656 listening sockets that have an AcceptEx in process stick around in the CLOSE_WAIT state and will allow connections even if the accepting and listening sockets have been closed.

@chalcolith chalcolith added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Aug 3, 2021
@ponylang-main
Copy link
Contributor

Hi @kulibali,

The changelog - fixed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 3816.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

@chalcolith chalcolith added the do not merge This PR should not be merged at this time label Aug 3, 2021
Comment on lines 1 to 3
## Cleanup and fixes for sockets on Windows

Fixed a few infelicities in the Windows socket code, in particular that clients connecting to an invalid port would think they were successfully connected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is very meaningful to an end user. It's basically "bug fix".

I think something like

## Address Windows socket errors

BLAH BLAH things fixed

with perhaps a description for users of what was wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated them.

packages/net/_test.pony Outdated Show resolved Hide resolved
packages/net/_test.pony Outdated Show resolved Hide resolved
@SeanTAllen
Copy link
Member

@kulibali the tests are failing, do you want me to review looking for why, or will you be getting that?

@chalcolith
Copy link
Member Author

I'll take a look.

@chalcolith
Copy link
Member Author

Well I just set up a new FreeBSD 13 VM and compiled and ran the tests there and they all worked fine. I wonder if there are networking differences in Cirrus.

@chalcolith
Copy link
Member Author

I made a FreeBSD 13 instance on Google Cloud, as that is what Cirrus uses (https://cirrus-ci.org/guide/FreeBSD/), and the tests run fine on that. So I'm stumped.

On Windows the `TCPConnection` code was not immediately unsubscribing to ASIO events when trying to shut down a connection. It was also not ever checking to see if all the relevant IOCP operations were complete, so in certain cases it would wait forever on the ASIO event and hang the program.  This change makes `TCPConnection` immediately unsubscribe. There may be some pending IOCP events left over but since we're shutting down we don't care about them reaching client code.

On Windows the recommended way to check the status of a connection is to use `SO_CONNECT_TIME` rather than `SO_ERROR`. Certain failing IOCP operations will not set `SO_ERROR`, so this changes `TCPClient._is_sock_connected()` to use `SO_CONNECT_TIME` on Windows.

Fixes a case in `TCPConnection._event_notify` where clients might receive spurious `connecting` calls when a connection in a different network family failed.

Changes the use of `Sleep()` on Windows to `SleepEX()`, which will wake up for IO operations.

Adds a new test `net/TCPConnectionFailed` that tests if a connection to a bogus port will fail immediately. This was broken on Windows.

Adds a new test for Unix only `new/TCPConnectionToClosedServerFailed` that tests if a connection to a recently-closed server will fail immediately. This test is disabled on Windows because it seems that per investigation of #3656 listening sockets that have an `AcceptEx` in process stick around in the `CLOSE_WAIT` state and will allow connections even if the accepting and listening sockets have been closed.
@chalcolith chalcolith force-pushed the kulibali/windows_socket_fixes branch from 1a7fd6e to fa1a0d0 Compare August 5, 2021 15:57
packages/net/_test.pony Outdated Show resolved Hide resolved
packages/net/_test.pony Outdated Show resolved Hide resolved
Comment on lines +740 to +746
class _TestTCPConnectionToClosedServerFailed is UnitTest
"""
Check that you can't connect to a closed listener.
"""
fun name(): String => "net/TCPConnectionToClosedServerFailed"
fun exclusion_group(): String => "network"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test lacks a complete(true) call. You should have that for any "success" cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete(true) call is in _Connector, in the connect_failed function of the notify object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed the _Connector call

packages/net/_test.pony Outdated Show resolved Hide resolved
Comment on lines +740 to +744
class _TestTCPConnectionToClosedServerFailed is UnitTest
"""
Check that you can't connect to a closed listener.
"""
fun name(): String => "net/TCPConnectionToClosedServerFailed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this testing? I dont see a client that is doing any connecting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to _Connector.connect() on line 776 tries connecting.

@SeanTAllen
Copy link
Member

how's this going @kulibali, do you need any assistance?

@chalcolith
Copy link
Member Author

Assistance would be welcome. I haven't been able to find out how exactly Cirrus's FreeBSD images are configured. The tests that are failing on Cirrus run fine on a FreeBSD VM on my machine, as well as a Google Cloud FreeBSD image.

@SeanTAllen
Copy link
Member

Perhaps a timing issue that happens to happen only there? Or do you think it's a configuration issue?

The only tests we have that use 127.0.0.1 are the new tests.
Perhaps the host is the problem.
@SeanTAllen
Copy link
Member

I've found one small issue so far.

@SeanTAllen
Copy link
Member

ok i figured it out. working on a "better" fix than what i am currently doing.

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 15, 2021

issues:

  1. complete(true) shouldnt be called if expect_action is used. complete(true) will be called when all expected actions are finished. if you call complete(true) then even if expected actions dont happen, the test will pass.
  2. there was an incorrect complete action that was being hidden by problem 1
  3. the default connect timeout on the cirrus freebsd machines is really long. 75 seconds.

i'm working on setting a lower default on cirrus and then will clean up my changes.

@SeanTAllen
Copy link
Member

@kulibali ok, you are good now. is there anything else or is this ready to squash and merge?

@SeanTAllen
Copy link
Member

@kulibali let me know if any of the updates i made to release notes are incorrect.

@chalcolith
Copy link
Member Author

Looks good @SeanTAllen, thanks!

@SeanTAllen SeanTAllen removed the do not merge This PR should not be merged at this time label Aug 15, 2021
@SeanTAllen SeanTAllen merged commit d50259f into main Aug 15, 2021
@SeanTAllen SeanTAllen deleted the kulibali/windows_socket_fixes branch August 15, 2021 15:20
github-actions bot pushed a commit that referenced this pull request Aug 15, 2021
github-actions bot pushed a commit that referenced this pull request Aug 15, 2021
jemc added a commit to jemc/ponyc that referenced this pull request Feb 22, 2023
Prior to this commit we were using a mechanism on Windows called
`WaitForSingleObject` when we wanted to put a thread to sleep
until we activated a signal to wake it up later.

Now we use `WaitForSingleObjectEx` instead, so we can be sure that
the thread is kept in an "alertable" state if there is an APC
(Async Procedure Call) that needs to be run during the wait period.

We use APCs for socket I/O completion callbacks, and they can only
arrive on the thread that initiated them, so it's important to
allow such callbacks to run even if the scheduler thread is suspended.

Note that the callbacks we use for sockets will not run any Pony
code - they will only dispatch a message via the ASIO subsystem,
so running those APCs is safe even if the actor has been
work-stealed to another thread and is running there.
In such a situation, the ASIO event will just go into the actor's
queue and they'll process the message later, as normal.
We just need to make sure the APC actually will get run,
hence we need to ensure even suspended threads will stay "alertable".

Note that this fixes a similar problem to one of the problems
that was fixed in PR ponylang#3816, wherein some calls to `Sleep`
were migrated to `SleepEx` for the same reason - to stay "alertable".
SeanTAllen pushed a commit that referenced this pull request Feb 22, 2023
Prior to this commit we were using a mechanism on Windows called
`WaitForSingleObject` when we wanted to put a thread to sleep
until we activated a signal to wake it up later.

Now we use `WaitForSingleObjectEx` instead, so we can be sure that
the thread is kept in an "alertable" state if there is an APC
(Async Procedure Call) that needs to be run during the wait period.

We use APCs for socket I/O completion callbacks, and they can only
arrive on the thread that initiated them, so it's important to
allow such callbacks to run even if the scheduler thread is suspended.

Note that the callbacks we use for sockets will not run any Pony
code - they will only dispatch a message via the ASIO subsystem,
so running those APCs is safe even if the actor has been
work-stealed to another thread and is running there.
In such a situation, the ASIO event will just go into the actor's
queue and they'll process the message later, as normal.
We just need to make sure the APC actually will get run,
hence we need to ensure even suspended threads will stay "alertable".

Note that this fixes a similar problem to one of the problems
that was fixed in PR #3816, wherein some calls to `Sleep`
were migrated to `SleepEx` for the same reason - to stay "alertable".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants