Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmtpClientTest.SendMail_SendQUITOnDispose timeout on OSX #41687

Closed
antonfirsov opened this issue Sep 1, 2020 · 10 comments
Closed

SmtpClientTest.SendMail_SendQUITOnDispose timeout on OSX #41687

antonfirsov opened this issue Sep 1, 2020 · 10 comments
Assignees
Milestone

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Sep 1, 2020

Test case

System.Net.Mail.Tests.SmtpClientTest.SendMail_SendQUITOnDispose(asyncSend: True)
System.Net.Mail.Tests.SmtpClientTest.SendMail_SendQUITOnDispose(asyncSend: False)

Failed on

net5.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open
https://dev.azure.com/dnceng/public/_build/results?buildId=796625&view=ms.vss-test-web.build-test-results-tab&runId=25205974&paneView=debug&resultId=167484

With

System.TimeoutException : Task timed out after 00:00:30

Stack trace

at System.Threading.Tasks.TaskTimeoutExtensions.TimeoutAfter(Task task, TimeSpan timeout) in /_/src/libraries/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 41
   at System.Net.Mail.Tests.SmtpClientTest.SendMail_SendQUITOnDispose(Boolean asyncSend) in /_/src/libraries/System.Net.Mail/tests/Functional/SmtpClientTest.cs:line 512
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Net untriaged New issue has not been triaged by the area owner labels Sep 1, 2020
@ghost
Copy link

ghost commented Sep 1, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@wfurt
Copy link
Member

wfurt commented Sep 1, 2020

I hit that as well. I'm wondering if this is related to #41508. We did not see that before AFAIK.

cc: @tmds

@wfurt
Copy link
Member

wfurt commented Sep 1, 2020

4 failures in last 4 days and nothing before in 90 days. This looks like regression @karelz

@tmds
Copy link
Member

tmds commented Sep 2, 2020

This is the test that fails, for asyncSend=true.

        [Theory]
        [InlineData(false)]
        [InlineData(true)]
        public async Task SendMail_SendQUITOnDispose(bool asyncSend)
        {
            bool quitMessageReceived = false;
            using ManualResetEventSlim quitReceived = new ManualResetEventSlim();
            using var server = new LoopbackSmtpServer();
            server.OnQuitReceived += _ =>
            {
                quitMessageReceived = true;
                quitReceived.Set();
            };

            using (SmtpClient client = server.CreateClient())
            {
                client.Credentials = new NetworkCredential("Foo", "Bar");
                MailMessage msg = new MailMessage("foo@example.com", "bar@example.com", "hello", "howdydoo");
                if (asyncSend)
                {
                    await client.SendMailAsync(msg).TimeoutAfter((int)TimeSpan.FromSeconds(30).TotalMilliseconds);
                }
                else
                {
                    client.Send(msg);
                }
                Assert.False(quitMessageReceived, "QUIT received");
            }

            // There is a latency between send/receive.
            quitReceived.Wait(TimeSpan.FromSeconds(30));
            Assert.True(quitMessageReceived, "QUIT message not received");
        }

From the exception, it seems it is throwing on TimeoutAfter.

4 failures in last 4 days and nothing before in 90 days. This looks like regression @karelz

#41508 matches the timing. #41508 should only be changing behaviour when the Socket gets cleaned up by the finalizer. I don't think that happens in this test.

This was reported against macOS, has it occurred on Linux? I will try to reproduce it on Linux.

Maybe we need to increase the timeout used?

@wfurt
Copy link
Member

wfurt commented Sep 2, 2020

All cases I saw were on OSX 10.14. I don't see direct correlation either but it is suspicious that the test verifies dispose behavior. I'm wondering if all this is timing....? But I don't know why it would be much? slower.

@antonfirsov
Copy link
Member Author

antonfirsov commented Sep 2, 2020

What I see in Kusto, querying for Method == "SendMail_SendQUITOnDispose":

  • There are 5 failures since 2020-08-28, 2 with asyncSend: true and 3 with asyncSend: false. The test did not fail before.
  • The sync variant fails in client.Send(msg), the async variant fails in quitReceived.Wait(TimeSpan.FromSeconds(30))
  • The following 2 failures happened before merging SafeSocketHandle: avoid potential blocking of finalizer thread #41508 (which was around 2020-09-01 12:00 UTC):
    image

@karelz karelz added this to the 5.0.0 milestone Sep 3, 2020
@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Sep 3, 2020
@karelz
Copy link
Member

karelz commented Sep 3, 2020

Triage: Check if it was 1 machine? Or if there are more random test failures on the OSX 10.14.
Let's keep it in 5.0 until we are sure it is NOT impacting production of 5.0, given it started failing just recently -- we don't want to punt it until we are sure it is not a regression in 5.0.

@wfurt
Copy link
Member

wfurt commented Sep 3, 2020

I did 1000+ runs locally and I did not see any failure.

@CarnaViire
Copy link
Member

I also wasn't able to repro it locally.
And we didn't have any additional failures for this test since 2020-09-01 18:56.

Also, querying for long timeouted tests in last month shows me that this number is pretty high for osx.1014.amd64.open (it took the first place)

QueueName Count
osx.1014.amd64.open 1819
windows.10.amd64.open 634
ubuntu.1604.amd64.open 480
... ...

link to Kusto

Query that I ran

TestResults 
| join kind=inner WorkItems on WorkItemId 
| join kind=inner Jobs on JobId
| where Finished >= datetime(2020-08-04) and Message contains 'timeout' and Result == 'Fail' and Duration > 30
| summarize Count = count() by QueueName
| order by Count desc
| project QueueName, Count

It makes me think that it might be CI overload for Mac machines...

@CarnaViire
Copy link
Member

As of today, there are still no additional failures for this test. I'm closing it for now, and will reopen if the failures re-appear.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants