Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries with MultipleActiveResultSets=True (MARS) are very slow / time out on Linux #422

Open
pawelpabich opened this issue Feb 12, 2020 · 106 comments

Comments

@pawelpabich
Copy link

pawelpabich commented Feb 12, 2020

Describe the bug

TL;DR:
Queries using connections with MARS enabled, even when they don't use MARS, are much slower or even time out on Linux. The same queries are fast and reliable on Windows no matter whether MARS is disabled or enabled and on Linux when MARS is disabled.

Context
Octopus Cloud hosts Octopus Deploy instances in Linux containers on Azure AKS with data stored in Azure Files and Azure SQL. A couple of months ago we noticed that some of the SQL queries were much slower or even started timing out which is not something we've experienced before on Windows using Full .NET Framework. Some of the slowdown might be caused by AKS (K8s) but we think that the SqlClient might also be playing a role here. 119112824000676 is our Azure Support Request if that helps in any way.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso)
   at reprocli.Program.Scenario4(String connString, Int32 number)
   at reprocli.Program.<>c__DisplayClass0_0.<Main>b__0(Int32 n)
   at System.Linq.Parallel.ForAllOperator`1.ForAllEnumerator`1.MoveNext(TInput& currentElement, Int32& currentKey)
   at System.Linq.Parallel.ForAllSpoolingTask`2.SpoolingWork()
   at System.Linq.Parallel.SpoolingTaskBase.Work()
   at System.Linq.Parallel.QueryTask.BaseWork(Object unused)
   at System.Linq.Parallel.QueryTask.<>c.<.cctor>b__10_0(Object o)
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__274_0(Object obj)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ClientConnectionId:005d2aae-9409-4711-aaa0-b03b70f2832e
Error Number:-2,State:0,Class:11
ClientConnectionId before routing:e3300799-fdd0-40a4-84ea-b9f383596b12
Routing Destination:fed2c41af7dc.tr5.westus2-a.worker.database.windows.net,11063<---

We also captured TCP dumps while running the tests on Linux and it looks like enabling MARS causes TCP RST.

image

image

Full TCP Dumps: https://github.com/benPearce1/k8s-sql-timeout-repro/tree/tiny/source/reprocli/tcpdumps

To reproduce

Code

Repo with the sample app: https://github.com/benPearce1/k8s-sql-timeout-repro/blob/tiny/source/reprocli/Program.cs. Compiled folder contains pre-compiled versions of the app so .NET Core SDK doesn't have to be present on the target VMs.

The first parameter is the level of parallelism. The second parameter is the connection string.

using System;
using System.Data;
using System.Diagnostics;
using System.Linq;
using Microsoft.Data.SqlClient;

namespace reprocli
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                var count = int.Parse(args[0]);
                var connectionString = args[1];

                var total = Stopwatch.StartNew();

                PrepareData(connectionString);
                total.Restart();
                Enumerable.Range(0,count)
                    .AsParallel()
                    .WithDegreeOfParallelism(count)
                    .ForAll(n => Scenario4(connectionString, n));

                Console.WriteLine($"Total: {total.Elapsed}");

            }
            catch (Exception e)
            {
                Console.WriteLine(e);
                throw;
            }
        }

        private static void Scenario4(string connString, int number)
        {
            var userStopWatch = Stopwatch.StartNew();

            var buffer = new object[100];
            for (var i = 0; i < 210; i++)
            {
                var queryStopWatch = Stopwatch.StartNew();


                using (var connection = new SqlConnection(connString))
                {
                    connection.Open();
                    using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                    {
                        using (var command = new SqlCommand("SELECT * From TestTable", connection, transaction))
                        {
                            using (var reader = command.ExecuteReader())
                            {
                                while (reader.Read())
                                {
                                    reader.GetValues(buffer);
                                }
                            }
                        }

                        transaction.Commit();
                    }
                }

                queryStopWatch.Stop();
                Console.WriteLine($"Number: {number}. Query: {i} Time: {queryStopWatch.Elapsed}");
            }

            userStopWatch.Stop();
            Console.WriteLine($"Number: {number}. All Queries. Time: {userStopWatch.Elapsed}");
        }

        static void PrepareData(string connectionString)
        {
            var createTable = @"
                DROP TABLE IF EXISTS TestTable;
                CREATE TABLE TestTable
                (
                    [Id] [nvarchar](50) NOT NULL PRIMARY KEY,
                    [Name] [nvarchar](20) NOT NULL
                );";

            using (var connection = new SqlConnection(connectionString))
            {
                connection.Open();
                using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                {
                    using (var command = new SqlCommand(createTable, connection, transaction))
                    {
                        command.ExecuteNonQuery();
                    }

                    transaction.Commit();
                }
            }

        }
    }
}

This is how we reproduced the problem which doesn't mean you need this exact config.

The database was hosted in an Azure SQL Elastic Pool (Standard: 300 eDTUs) on a SQL Server in West US 2 region.

LINUX

Run the sample app with the following arguments on a Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.

MARS ON

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'

The expected result is that the app finishes without throwing any errors but that's not the case and Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. is thrown.

Reducing the level of parallelism to 20 stops the app from crashing.

Also, when MARS is ON the console doesn't show any progress for 10+ seconds. This is not the case when MARS is OFF.

MARS OFF

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=False;'

The expected result is that the app finishes without throwing any errors which is the case. The app finished in just under 25 seconds. Total: 00:00:24.9737616. The app also worked with much higher levels of parallelism (e.g. 500)

AKS

Same spec as above: Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2. We also ran this test in a container in AKS and the results were pretty much the same. The only difference was that we had to lower the level of parallelism even more. K8s networking adds a bit of overhead which might make the problem more pronounced.

WINDOWS

Run the sample app with the following arguments on a Windows (Windows Server 2016 Datacenter) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.

dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'

The expected result is that the app finishes without throwing an exception which is the case. The app finished in just under 24 seconds. Total: 00:00:23.4068641. It also worked with level of parallelism set to 500. We achieved similar results with MARS disabled.

Note: We used .NET Core to run tests in Windows.

Expected behavior

The sample app should not crash and connections with MARS feature enabled should behave in the same way on both Linux and Windows.

Further technical details

Microsoft.Data.SqlClient version: 1.1.0 and 2.0.0-preview1.20021.1
.NET target: (Core 2.2 and Core 3.1)
SQL Server version: (Azure SQL)
Operating system: (Ubuntu 18.04 and AKS with Ubuntu 18.4)

Additional context
We've been battling this issue for a long time now so we are happy to help in any way we can to get it resolved.

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 12, 2020

The linux implementation of the network interface is entirely different to the one used on windows which is why you'll see different behaviour. The difference between mars and non-mars is probably caused by the fact that mars takes part of the packet size for the mars header so it can require 2 received network packets to complete a single logical connection packet and that will exascerbate timing problems, you'd see them roughly twice as fast.

One thing to note is that while sql is quite capable of doing many things at once it's not always a good idea to try and open 100 simultanous connections and to use them all in parallel. The unix sockets implementation differs from windows in ways which can exascerbate threading problems, e.g. dotnet/runtime#32016 . There's also another issue on this repo in the past where on investigation i found thread starvation with high levels of requested connection level parallelism caused a problem.

I'll try and take a look at the repro and see anything sticks out as a problem.

@cheenamalhotra
Copy link
Member

Thanks @Wraith2 for looking into it.

@pawelpabich
Thanks for the detailed repro, hopefully it would help fast-track investigations!
On a side note, would you mind testing the same with System.Data.SqlClient older driver versions (4.6.1, 4.7.0...) to confirm if the behaviour has always been the same or recently this performance has degraded?

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 12, 2020

The replication works on windows and forced managed mode. By works I mean that I get a connection failure request pretty quickly (should be 10 seconds) after I start the program. If you confirm that's what's expected then the repro is good.

Initial impression is that it's simple threadpool starvation. If you add in:

ThreadPool.GetMaxThreads(out var workerThreads, out var completionThreads);
ThreadPool.SetMinThreads(count, completionThreads);

before you start the stopwatch you should find that it runs through successfully.

So why does that work? You're using Parallel which is attempting to start count number of tasks and then have as many as the specified maxdop running at once. It does this by creating tasks and then queuing them to the threadpool. When you start those tasks they all try to connect to the database synchronously which blocks but it blocks in a particular way in the managed implementation, it ends up at

public override uint Receive(out SNIPacket packet, int timeoutInMilliseconds)
which sits in a loop waiting for a ManuarlResetEventSlim to be signalled telling it that a packet has arrived. The problem is that the thing that has to set that event need to run somewhere and if you've consumed all the threadpool threads with synchronous waits there's no thread for it to run on, so the wait times out, and that causes the receive to fail and then the connection isn't established and all this lack of being able to do things either manifests as a connection or query timeout eventually.

It's be nice if we weren't going sync over async in the library so this problem couldn't occur inside the library. Changing that is challenging and short to mid term impractical, it's pretty much a core rewrite of the non-windows implementation with all the reliability and performance concerns that it would come with. So don't expect that to happen soon.

So why is this your problem? Well you're the one using all the threadpool theads. The threadpool starts with a minimum of 10 ready threads in it and at various times tasks are queued to those threads and then completed. Unless you hit a period of sustained load on the pool it's unlikely to need to grow past the minimum value so most of the time you'll have a small number of threads. The threadpool is balanced for general workloads and a lot of research and tuning has gone into it. One of the behaviours that results from this tuning is that it doesn't immediately add large numbers of threads when there are more tasks waiting to be processed than there are threads available to do them. The threadpool usually only adds one thread a second to the pool. If you load it with 100 tasks quickly you'll get 10, then 11, then 12 etc climbing slowly.

So how do you fix it? Don't do this. Don't try to use large number of simultaneous synchronous connections like this. Unless you're talking to a different server or database on each connection then it's going to be more efficient to use a pipeline approach feeding results to an open connection and having it process multiple commands with a single connection. Or, switch to using async if possible because the awaited suspension will free up threadpool threads allowing them to be used or callback signalling.

@pawelpabich
Copy link
Author

pawelpabich commented Feb 13, 2020

Thanks @Wraith2 for the investigation. Thread starvation on Linux (dotnet/runtime#14304) explains some of the results we see but I think there are still some unanswered questions here.

One thing to note is that while sql is quite capable of doing many things at once it's not always a good idea to try and open 100 simultanous connections and to use them all in parallel. T

This might not be obvious but the repo is based on how Octopus actually works. It's not entirely a synthetic benchmark. A single deployment performs around that many queries to the database. We didn't want to include thousands of lines of code here but the end result is the same. count represents the number of concurrent deployments. Our customers can run hundreds of deployments at once.

Also, we checked a bunch of metrics in the SQL Server and the resource utilization is far from being maxed out. This indicates that SQL Server is not the bottleneck here.

Question 1

Why switching MARS OFF on Linux fixes the problem entirely and the level of parallelism can be increased from 20 to 500. Would just one additional packet cause such a massive difference?

Question 2

Why do we see TCP RST with MARS ON and we don't see it with MARS OFF?

@pawelpabich
Copy link
Author

@cheenamalhotra we saw the same problem when we were using S.D.SqlClient. We only migrated to M.D.SqlClient on at the end of December 2019. We hoped that would solve our problem but it didn't.

@pawelpabich
Copy link
Author

ThreadPool.GetMaxThreads(out var workerThreads, out var completionThreads);
ThreadPool.SetMinThreads(count, completionThreads);

Threads on Windows are not cheap. @Wraith2 do know whether this is also the case on Linux? If not, we could start with more threads as a short term workaround.

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 13, 2020

From the little I know of it threads are even more expensive on Linux than they are on windows, iirc windows is setup for fast thread creation and Linux for fast process creation. So as you say not cheap and it's a poor way to scale. Setting MinThreads on the pool is a way to get around it. Not a great idea for production but it's a public api so it is provided andf supported for you to use if you need to do so. Read around the method and see if you think it's an acceptable thing to do in your case?

SQL server is not the bottleneck, the server is barely doing anything throughout the benchmark. The problem is the client process thread creation as far as I can see.

A1: Turning off mars on Linux removes an intermediate handle object and changes where and how receives are done. In the code I linked to you see the Receive using a packetEvent wait handle. Without mars you'd see it's caller calling into

public override uint Receive(out SNIPacket packet, int timeoutInMilliseconds)

which doesn't do that kind of wait at all and instead does a direct socket call. So you don't hit the problem because the wait is done in the socket implementation which may not require a threadpool.
Really if you can avoid using mars I would advise doing so, it doesn't seem to have many upsides and significantly increases the likelihood of issues on the managed network implementation. So yes one packet really can't make all the difference because that one packet on the mars path immediately causes a synchronous wait that can't be unlocked without the resource it's blocking.

A2: At a guess if it's coming from the sqlclient side I'd say it's because the socket is being shutdown because of the failure. Once successfully connected sockets are kept open until they're removed from the pool even if you close the SqlConnection object the SqlConnectionInternal object it contains stays alive for a while unless there is a failure which dooms it to immediate closure. If it's from the server side I couldn't say.

@pawelpabich
Copy link
Author

pawelpabich commented Feb 13, 2020

Thanks @Wraith2. I appreciate the time you took to explain the details.

Really if you can avoid using mars I would advise doing so, it doesn't seem to have many upsides and significantly increases the likelihood of issues on the managed network implementation

I would love to do that but this would be a significant change for us and we need a decent workaround/solution as soon as possible. @cheenamalhotra, is there a chance this code can be optimized?

At a guess if it's coming from the sqlclient side I'd say it's because the socket is being shutdown because of the failure. Once successfully connected sockets are kept open until they're removed from the pool even if you close the SqlConnection object the SqlConnectionInternal object it contains stays alive for a while unless there is a failure which dooms it to immediate closure. If it's from the server side I couldn't say.

Yes, it is coming from the server side.
image 172.xxx is a private IP of the VM
@cheenamalhotra, I think we need some additional help here.

@Leonardo-Ferreira
Copy link

Leonardo-Ferreira commented Mar 12, 2020

I just revalidated this using a MAC OSX 10.14.6 (Mojave) + .Net Core APP 3.1.0 + System.Data.SqlClient 4.8.1.
I did noticed the initial hang, ranging from 7 to 10 seconds and did observed slowness on sporadic queries

Edit1: I fiddled a bit, but the "Enumerable.Range" is always 20. When I reduced the degree of parallelism to 8 (number of cores in my machine) the problem disappeared completely. with degree of parallelism up to 8, MARS or no-MARS are very very very similar. The higher the degree of parallelism, the worst the initial hang for the first query.

This can also be solved as @Wraith2 said, by expanding the thread pool. Note that linux does not differentiate between WorkerThreads and CompletionThreads (IO threads)

@Leonardo-Ferreira
Copy link

I was able to trace the slowness of your "mid period queries" to the transactions. If you do not open transactions explicitly things will fly.
The slowness of the transaction happens whether MARS is on or not, so I don't have reasons to believe that there's a problem with the mars implementation.

@pawelpabich
Copy link
Author

This can also be solved as @Wraith2 said, by expanding the thread pool. Note that linux does not differentiate between WorkerThreads and CompletionThreads (IO threads)

We've tried this and it did not solve the problem for us because MARS connections still take much longer to open and we simply run out of connections in the pool.

@pawelpabich
Copy link
Author

I was able to trace the slowness of your "mid period queries" to the transactions. If you do not open transactions explicitly things will fly.

So what do you propose we do instead?

The slowness of the transaction happens whether MARS is on or not, so I don't have reasons to believe that there's a problem with the mars implementation.

From what I see you ran your tests in a different environment (MAC OSX) than ours so I'm not sure how you can draw this conclusion.

@Leonardo-Ferreira
Copy link

Well, about the validity of the tests, MAC OSX behaves very similarly to Linux, including the thread management, after all Mac OSX is a unix based... So I would take my reproducing with a grain of salt, but I'm quite confident on the analysis...

About the transactions blocking each other, there are several workarounds, but they rely on actual code changes and business-rules validations and etc. This is a database issue, not a implementation issue.
If you're out-of-your-mind you could remove the transaction all together (and when questioned, you can say that the fever from the coronavirus made you hallucinate), but in a more realistic scenario, you can try to set the isolation level to "ReadUncommited", this way the blocking/deadlocking will decrease immensely...

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Mar 13, 2020

@pawelpabich

I have't been responding lately, but just to let you know I am investigating this issue. I'm able to reproduce the problem in linux docker containers when connecting to any SQL Server, irrespective of being Azure DB and it does seem to be related to linux memory management, as when running the same app on Windows using the same codepath of Managed SNI the problem does not reproduce. @Wraith2 seems to have provided some details.

However I'm investigating if it is specifically related to containerized environment limitations. I'm not sure at this point if this has anything to with transactions, I don't think so.. but I'll share updates soon!

@Leonardo-Ferreira
Copy link

Leonardo-Ferreira commented Mar 13, 2020

@cheenamalhotra, I don't believe that this is related to containerised applications because I was able to reproduce the issue on my local machine (up to some degree, since I did not experienced heavy delays mid-execution). Perhaps the containerised scenario exarcebates the problem but I believe that the problem is there...
Here is a sum-up of my tests, all conducted on my local machine (8 core, 16gb ram) and network sqlserver (not local):

Count:20, Parallel:8, YES TRANSACTION - NO MARS
Number: 5. All Queries. Time: 00:00:01.0470491
Number: 3. All Queries. Time: 00:00:01.0500434
Number: 6. All Queries. Time: 00:00:01.0514407
Number: 0. All Queries. Time: 00:00:01.0542009
Number: 2. All Queries. Time: 00:00:01.0583711
Number: 4. All Queries. Time: 00:00:01.1497360
Number: 1. All Queries. Time: 00:00:01.1510161
Number: 7. All Queries. Time: 00:00:01.1630255
Number: 8. All Queries. Time: 00:00:00.6292422
Number: 10. All Queries. Time: 00:00:00.6340763
Number: 9. All Queries. Time: 00:00:00.6494815
Number: 12. All Queries. Time: 00:00:00.6463415
Number: 11. All Queries. Time: 00:00:00.6570911
Number: 14. All Queries. Time: 00:00:00.8896517
Number: 13. All Queries. Time: 00:00:00.9317428
Number: 15. All Queries. Time: 00:00:00.9248774
Number: 16. All Queries. Time: 00:00:01.2828993
Number: 17. All Queries. Time: 00:00:01.2867972
Number: 18. All Queries. Time: 00:00:01.2852665
Number: 19. All Queries. Time: 00:00:01.2847378
Total: 00:00:03.1413429
First Queries:
00:00:00.5860205
00:00:00.5860870
00:00:00.5859981
00:00:00.5860731
00:00:00.5860413
00:00:00.5860577
00:00:00.5860111
00:00:00.5873522
00:00:00.0018571
00:00:00.0018698
00:00:00.0018539
00:00:00.0020848
00:00:00.0015862
00:00:00.0024604
00:00:00.0026320
00:00:00.0022536
00:00:00.0066965
00:00:00.0068477
00:00:00.0068829
00:00:00.0065229

Count:20, Parallel:8, YES TRANSACTION - YES MARS
Number: 3. All Queries. Time: 00:00:01.8238989
Number: 2. All Queries. Time: 00:00:01.8260756
Number: 4. All Queries. Time: 00:00:01.8352712
Number: 7. All Queries. Time: 00:00:01.8375570
Number: 6. All Queries. Time: 00:00:01.8399148
Number: 5. All Queries. Time: 00:00:01.8444057
Number: 1. All Queries. Time: 00:00:01.8468115
Number: 0. All Queries. Time: 00:00:01.8583285
Number: 13. All Queries. Time: 00:00:01.4769483
Number: 15. All Queries. Time: 00:00:01.4741198
Number: 8. All Queries. Time: 00:00:01.5680454
Number: 9. All Queries. Time: 00:00:01.5665582
Number: 10. All Queries. Time: 00:00:01.5605656
Number: 12. All Queries. Time: 00:00:01.5582591
Number: 11. All Queries. Time: 00:00:01.5654078
Number: 14. All Queries. Time: 00:00:01.5613466
Number: 16. All Queries. Time: 00:00:00.4957817
Number: 17. All Queries. Time: 00:00:00.4917708
Number: 18. All Queries. Time: 00:00:00.4574173
Number: 19. All Queries. Time: 00:00:00.4571974
Total: 00:00:04.0195703
First Queries:
00:00:00.5965634
00:00:00.5965534
00:00:00.5965991
00:00:00.5965589
00:00:00.5965778
00:00:00.5965937
00:00:00.5965741
00:00:00.5965588
00:00:00.0064917
00:00:00.0066611
00:00:00.0068385
00:00:00.0067982
00:00:00.0068086
00:00:00.0066095
00:00:00.0066777
00:00:00.0066912
00:00:00.0045863
00:00:00.0058520
00:00:00.0021949
00:00:00.0018970

Count:20, Parallel:8, NO TRANSACTION - YES MARS
Number: 5. All Queries. Time: 00:00:00.8371968
Number: 3. All Queries. Time: 00:00:00.8407228
Number: 0. All Queries. Time: 00:00:00.8464102
Number: 2. All Queries. Time: 00:00:00.8510748
Number: 7. All Queries. Time: 00:00:00.8699850
Number: 4. All Queries. Time: 00:00:00.8858711
Number: 6. All Queries. Time: 00:00:00.8877594
Number: 1. All Queries. Time: 00:00:00.9125339
Number: 8. All Queries. Time: 00:00:00.4547782
Number: 9. All Queries. Time: 00:00:00.4537054
Number: 11. All Queries. Time: 00:00:00.4833543
Number: 13. All Queries. Time: 00:00:00.4650509
Number: 10. All Queries. Time: 00:00:00.5251250
Number: 12. All Queries. Time: 00:00:00.5050363
Number: 14. All Queries. Time: 00:00:00.5020542
Number: 15. All Queries. Time: 00:00:00.5916084
Number: 17. All Queries. Time: 00:00:00.5435370
Number: 18. All Queries. Time: 00:00:00.5602439
Number: 19. All Queries. Time: 00:00:00.5896770
Number: 16. All Queries. Time: 00:00:00.6996142
Total: 00:00:02.1313958
First Queries:
00:00:00.6278838
00:00:00.6279385
00:00:00.6278945
00:00:00.6278912
00:00:00.6279155
00:00:00.6279257
00:00:00.6278742
00:00:00.6279185
00:00:00.0012271
00:00:00.0006439
00:00:00.0007066
00:00:00.0006888
00:00:00.0011631
00:00:00.0006712
00:00:00.0012544
00:00:00.0049881
00:00:00.0019198
00:00:00.0005747
00:00:00.0007611
00:00:00.0006678

Count:20, Parallel:8, NO TRANSACTION - NO MARS
Number: 0. All Queries. Time: 00:00:00.7707507
Number: 5. All Queries. Time: 00:00:00.8029723
Number: 6. All Queries. Time: 00:00:00.8053098
Number: 3. All Queries. Time: 00:00:00.8165772
Number: 1. All Queries. Time: 00:00:00.8190445
Number: 2. All Queries. Time: 00:00:00.8348932
Number: 4. All Queries. Time: 00:00:00.8596580
Number: 7. All Queries. Time: 00:00:00.9215064
Number: 8. All Queries. Time: 00:00:00.6400341
Number: 10. All Queries. Time: 00:00:00.6576637
Number: 11. All Queries. Time: 00:00:00.6576382
Number: 9. All Queries. Time: 00:00:00.6908438
Number: 12. All Queries. Time: 00:00:00.6806739
Number: 13. All Queries. Time: 00:00:00.6704229
Number: 14. All Queries. Time: 00:00:00.6783818
Number: 15. All Queries. Time: 00:00:00.6482645
Number: 16. All Queries. Time: 00:00:00.4391626
Number: 18. All Queries. Time: 00:00:00.3995539
Number: 19. All Queries. Time: 00:00:00.3799171
Number: 17. All Queries. Time: 00:00:00.4165565
Total: 00:00:02.0165027
First Queries:
00:00:00.6144434
00:00:00.6144111
00:00:00.6143904
00:00:00.6144485
00:00:00.6143983
00:00:00.6175860
00:00:00.6177502
00:00:00.6193642
00:00:00.0005292
00:00:00.0022520
00:00:00.0020013
00:00:00.0022936
00:00:00.0026404
00:00:00.0021068
00:00:00.0021113
00:00:00.0027646
00:00:00.0044434
00:00:00.0067386
00:00:00.0040238
00:00:00.0034082

@naleczynskia
Copy link

naleczynskia commented Mar 20, 2020

I do experience this issue on Azure Web App For Containers (on Linux), but not when running the same container locally on Docker Desktop (on Windows). Hope this helps.

@tunger
Copy link

tunger commented Mar 27, 2020

Good that this github issue exists (thanks!), we seem to have run into the same issue. Problem only appears when running the (asp core + ef core 3.1.2) app on Docker with Kubernetes with MARS on. Our background service handling lots of data would simply "die", sometimes with and sometimes without any exception thrown. As it is a BackgroundService/IHostedService, the web app continues to run, just the BackgroundService is gone.

I turned MARS off and now it works.

I got two kinds of exceptions, this one with default settings of DbContext.

An exception occurred while iterating over the results of a query for context type '***'.
Microsoft.Data.SqlClient.SqlException (0x80131904): The connection is broken and recovery is not possible.  The connection is marked by the server as unrecoverable.  No attempt was made to restore the connection.
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__164_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

When setting the command timeout to five minutes, I got this exception - same as the opener of this issue.

An exception occurred while iterating over the results of a query for context type '"***"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TrySetMetaData(_SqlMetaDataSet metaData, Boolean moreInfo)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
   at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
   at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReader(RelationalCommandParameterObject parameterObject)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.InitializeReader(DbContext _, Boolean result)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.MoveNext()
ClientConnectionId:3d813f87-29be-4a5a-9e6d-faff5d0e0a5f
Error Number:-2,State:0,Class:11"

This issue caused lots of working days of diagnosing, as there is no clear indication what is wrong, hindering troubleshooting.

@JMan7777
Copy link

Hi, I have exactly the same issue as @tunger.
I'm running under .net core 3.1.3 (Linux docker container as Azure Web App) with Azure MSSQL DB.
I'm just using Dapper instead of EF.
With MARS disabled everything works fine.

@Wraith2
Copy link
Contributor

Wraith2 commented Apr 10, 2020

Have a read of #502 and see if increasing packet size helps things. If it doesn't try turning off mars. If you can't turn off mars then you're stuck.

@junchen2000
Copy link

junchen2000 commented Apr 30, 2020

Hi Team,

We have multiple users reporting the same issue. The issue only happens on Linux. The issue does not happen on Windows.

To narrow down the issue, we have simplified the reproducing code to the following:

using System;
using System.ComponentModel.DataAnnotations;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using Microsoft.Data.SqlClient;
namespace reprocli
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                var count = 40;
                var connectionString = "Server=tcp:<myAzureSQLName>.database.windows.net,1433;Initial Catalog=xxx;Persist Security Info=False;User ID=xxxxxxxxx;Password=xxxxxxx;MultipleActiveResultSets=TRUE;Connect Timeout=0;Max Pool Size=20;Packet Size=12000";
                var total = Stopwatch.StartNew();
                total.Restart();
                Enumerable.Range(0, count)
                    .AsParallel()
                    .WithDegreeOfParallelism(count)
                    .ForAll(n => Scenario4(connectionString, n));
                Console.WriteLine($"Total: {total.Elapsed}");
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
                throw;
            }
        }
        private static void Scenario4(string connString, int number)
        {
            var userStopWatch = Stopwatch.StartNew();
            var buffer = new object[100];
            {
                using (var connection = new SqlConnection(connString))
                {
                    var StartTimestamp = DateTime.Now;
                    connection.Open();
                    var FinishTimestamp = DateTime.Now;
                    TimeSpan span = FinishTimestamp.Subtract(StartTimestamp);
                    Console.WriteLine("Task " + number.ToString("00") + ": Connection.Open started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
                    StartTimestamp = DateTime.Now;
                    using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
                    {
                        FinishTimestamp = DateTime.Now;
                        span = FinishTimestamp.Subtract(StartTimestamp);
                        Console.WriteLine("Task " + number.ToString("00") + ": BeginTransaction started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
                        transaction.Commit();
                    }
                    connection.Close();
                }
            }
            userStopWatch.Stop();
            Console.WriteLine("Task " + number.ToString("00") + $": Total - Elapsed: {userStopWatch.Elapsed}");
        }
    }
}

Note in the above code:

  1. We have spawned 40 threads in parallel. With only 40 threads, thread pool starvation should not be a problem. (We can only see 51 threads when ManualResetEventSlim is being signaled)

  2. Each thread only connects to Azure SQL Database for once. In that connection, it only starts a transaction, and commits the transaction. There're no queries.

  3. We have modified the connection string to include several properties:

a. Connect Timeout=0
The execution timeout error can be avoided by setting connect timeout to 0. Without hitting the timeout, we can observe the slow behavior more clearly.

b. Max Pool Size=20
We specify the pool size to half number of the parallel threads so that we can see how connection pool takes effect here.

c. Packet Size=12000
We increase the packet size, but it does not show any differences.

  1. We have logged the timestamps and duration of Connectoin.Open and BeginTransaction for each thread/task.

Test result for Windows

Total execution took only 5.8 seconds. Each thread running in parallel took no more than 3 seconds. Most of the time were spent on Connection.Open:

Task 03: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 1252.0322
Task 03: BeginTransaction started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 99.9115
Task 07: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 1454.2317
Task 03: Total - Elapsed: 00:00:02.0012941
Task 07: BeginTransaction started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:57 PM - Elapsed 94.4907
Task 01: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1641.8124
Task 07: Total - Elapsed: 00:00:02.1885435
Task 01: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.4182
Task 01: Total - Elapsed: 00:00:02.3745601
Task 00: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1827.7951
Task 00: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.6793
Task 00: Total - Elapsed: 00:00:02.5601853
Task 11: Connection.Open started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 558.5962
Task 08: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2096.162
Task 11: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.5295
Task 08: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.8543
Task 11: Total - Elapsed: 00:00:00.7444362
Task 05: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2199.2868
Task 08: Total - Elapsed: 00:00:02.8291427
Task 06: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2282.408
Task 05: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.3598
Task 06: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.9552
Task 05: Total - Elapsed: 00:00:02.9313097
Task 04: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2384.6057
Task 06: Total - Elapsed: 00:00:03.0157004
Task 09: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 2360.5249
Task 04: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 92.6181
Task 09: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 93.0764
Task 04: Total - Elapsed: 00:00:03.1168025
Task 10: Connection.Open started at 4/27/2020 5:26:57 PM;ended at 4/27/2020 5:26:58 PM - Elapsed 1463.5762
Task 09: Total - Elapsed: 00:00:02.5470265
Task 17: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 372.9687
Task 10: BeginTransaction started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.4596
Task 17: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.8557
Task 10: Total - Elapsed: 00:00:01.6491569
Task 13: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1113.3493
Task 17: Total - Elapsed: 00:00:00.5603184
Task 14: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1014.9437
Task 13: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3817
Task 14: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 93.0985
Task 02: Connection.Open started at 4/27/2020 5:26:56 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 2937.5539
Task 13: Total - Elapsed: 00:00:01.2987254
Task 16: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 741.6992
Task 02: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.2785
Task 14: Total - Elapsed: 00:00:01.2020219
Task 18: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 645.2825
Task 16: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3808
Task 02: Total - Elapsed: 00:00:03.6693590
Task 19: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 653.5425
Task 18: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.8252
Task 16: Total - Elapsed: 00:00:00.9270470
Task 20: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 556.2838
Task 19: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.6183
Task 18: Total - Elapsed: 00:00:00.8327458
Task 12: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 1598.5975
Task 20: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.3875
Task 19: Total - Elapsed: 00:00:00.8397575
Task 21: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 701.6779
Task 12: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 93.0304
Task 20: Total - Elapsed: 00:00:00.7413675
Task 22: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 655.9256
Task 21: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.6944
Task 12: Total - Elapsed: 00:00:01.7852348
Task 23: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 648.3088
Task 22: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.4581
Task 21: Total - Elapsed: 00:00:00.8880851
Task 24: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 652.2877
Task 22: Total - Elapsed: 00:00:00.8415095
Task 25: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 555.9168
Task 23: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:26:59 PM - Elapsed 92.791
Task 24: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.6832
Task 25: BeginTransaction started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.3799
Task 23: Total - Elapsed: 00:00:00.8348116
Task 26: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.4023
Task 24: Total - Elapsed: 00:00:00.8388235
Task 28: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 555.3253
Task 25: Total - Elapsed: 00:00:00.7411146
Task 27: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.6749
Task 26: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.9759
Task 28: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 93.3051
Task 27: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.7269
Task 26: Total - Elapsed: 00:00:00.7461889
Task 29: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.9274
Task 15: Connection.Open started at 4/27/2020 5:26:58 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 1764.509
Task 28: Total - Elapsed: 00:00:00.7427732
Task 31: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 557.6932
Task 15: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.3374
Task 27: Total - Elapsed: 00:00:00.7468952
Task 34: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 372.3959
Task 29: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.7142
Task 15: Total - Elapsed: 00:00:01.9484886
Task 35: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 371.0839
Task 31: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.5221
Task 29: Total - Elapsed: 00:00:00.7463724
Task 34: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.283
Task 32: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 559.6628
Task 35: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.6887
Task 31: Total - Elapsed: 00:00:00.7448745
Task 33: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 561.1435
Task 34: Total - Elapsed: 00:00:00.5597399
Task 36: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 449.6328
Task 32: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.8304
Task 35: Total - Elapsed: 00:00:00.5563786
Task 37: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 465.2997
Task 33: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.5839
Task 36: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.3554
Task 32: Total - Elapsed: 00:00:00.7463895
Task 38: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 467.4738
Task 37: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 91.6418
Task 33: Total - Elapsed: 00:00:00.7467624
Task 39: Connection.Open started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 464.5236
Task 36: Total - Elapsed: 00:00:00.6352385
Task 38: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.8737
Task 37: Total - Elapsed: 00:00:00.6496484
Task 39: BeginTransaction started at 4/27/2020 5:27:00 PM;ended at 4/27/2020 5:27:00 PM - Elapsed 92.489
Task 38: Total - Elapsed: 00:00:00.6536980
Task 39: Total - Elapsed: 00:00:00.6576849
Task 30: Connection.Open started at 4/27/2020 5:26:59 PM;ended at 4/27/2020 5:27:01 PM - Elapsed 1316.457
Task 30: BeginTransaction started at 4/27/2020 5:27:01 PM;ended at 4/27/2020 5:27:01 PM - Elapsed 92.3981
Task 30: Total - Elapsed: 00:00:01.5020711
Total: 00:00:05.7179428

Test Result for Linux

Run the same on Linux. Total execution took 34 seconds. The connection.Open does not really execute in parallel. Once a BeginTransaction starts in one connection, it slows down the other thread to start Connection.Open. The more connection.open, the more delay. Once the connection pool was full, it was blocked for x0 seconds until one of the BeginTransaction finished. Then the later connections queued into the connection pool could run properly without any delays:

dragon@marlonvm:~/k8s-sql-timeout-repro/source/reprocli$ dotnet run
Task 04: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 481.6339
Task 00: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 510.3399
Task 02: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 556.4548
Task 01: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 591.7924
Task 03: Connection.Open started at 4/27/2020 9:34:31 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 617.9443
Task 05: Connection.Open started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:34:32 AM - Elapsed 27.8998
Task 06: Connection.Open started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:34:33 AM - Elapsed 30.827
Task 07: Connection.Open started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:34:34 AM - Elapsed 27.4854
Task 08: Connection.Open started at 4/27/2020 9:34:34 AM;ended at 4/27/2020 9:34:35 AM - Elapsed 31.2146
Task 09: Connection.Open started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:34:35 AM - Elapsed 30.8095
Task 10: Connection.Open started at 4/27/2020 9:34:36 AM;ended at 4/27/2020 9:34:36 AM - Elapsed 30.4743
Task 11: Connection.Open started at 4/27/2020 9:34:37 AM;ended at 4/27/2020 9:34:37 AM - Elapsed 30.617
Task 12: Connection.Open started at 4/27/2020 9:34:38 AM;ended at 4/27/2020 9:34:38 AM - Elapsed 29.1201
Task 13: Connection.Open started at 4/27/2020 9:34:39 AM;ended at 4/27/2020 9:34:39 AM - Elapsed 36.9007
Task 14: Connection.Open started at 4/27/2020 9:34:40 AM;ended at 4/27/2020 9:34:40 AM - Elapsed 30.8955
Task 15: Connection.Open started at 4/27/2020 9:34:41 AM;ended at 4/27/2020 9:34:41 AM - Elapsed 31.2848
Task 16: Connection.Open started at 4/27/2020 9:34:42 AM;ended at 4/27/2020 9:34:42 AM - Elapsed 29.2114
Task 17: Connection.Open started at 4/27/2020 9:34:43 AM;ended at 4/27/2020 9:34:43 AM - Elapsed 24.7978
Task 18: Connection.Open started at 4/27/2020 9:34:44 AM;ended at 4/27/2020 9:34:44 AM - Elapsed 32.4253
Task 19: Connection.Open started at 4/27/2020 9:34:45 AM;ended at 4/27/2020 9:34:45 AM - Elapsed 37.3471
Task 08: BeginTransaction started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 30011.9233
Task 00: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32982.9174
Task 09: BeginTransaction started at 4/27/2020 9:34:35 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 29512.0317
Task 10: BeginTransaction started at 4/27/2020 9:34:36 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 28508.912
Task 02: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32936.5808
Task 11: BeginTransaction started at 4/27/2020 9:34:37 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 27462.8604
Task 04: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 33008.9028
Task 06: BeginTransaction started at 4/27/2020 9:34:33 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 31512.4423
Task 07: BeginTransaction started at 4/27/2020 9:34:34 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 31016.4188
Task 12: BeginTransaction started at 4/27/2020 9:34:38 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 26512.2184
Task 13: BeginTransaction started at 4/27/2020 9:34:39 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 25504.2651
Task 03: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32875.1887
Task 05: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32517.823
Task 01: BeginTransaction started at 4/27/2020 9:34:32 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 32901.4141
Task 14: BeginTransaction started at 4/27/2020 9:34:40 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 24504.4908
Task 15: BeginTransaction started at 4/27/2020 9:34:41 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 23473.666
Task 16: BeginTransaction started at 4/27/2020 9:34:42 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 22510.7361
Task 17: BeginTransaction started at 4/27/2020 9:34:43 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 21515.0665
Task 18: BeginTransaction started at 4/27/2020 9:34:44 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 20984.4067
Task 19: BeginTransaction started at 4/27/2020 9:34:45 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 19970.1126
Task 22: Connection.Open started at 4/27/2020 9:34:48 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 17040.9565
Task 24: Connection.Open started at 4/27/2020 9:34:49 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 15542.3034
Task 20: Connection.Open started at 4/27/2020 9:34:46 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 19045.6091
Task 13: Total - Elapsed: 00:00:25.5511950
Task 06: Total - Elapsed: 00:00:31.5553711
Task 08: Total - Elapsed: 00:00:30.0551484
Task 15: Total - Elapsed: 00:00:23.5144020
Task 10: Total - Elapsed: 00:00:28.5512858
Task 28: Connection.Open started at 4/27/2020 9:34:53 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 12040.2609
Task 25: Connection.Open started at 4/27/2020 9:34:50 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 14541.4863
Task 19: Total - Elapsed: 00:00:20.0156375
Task 26: Connection.Open started at 4/27/2020 9:34:51 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 14040.7895
Task 21: Connection.Open started at 4/27/2020 9:34:47 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 18045.5826
Task 01: Total - Elapsed: 00:00:33.5539375
Task 07: Total - Elapsed: 00:00:31.0564162
Task 00: Total - Elapsed: 00:00:33.5541849
Task 30: Connection.Open started at 4/27/2020 9:34:55 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 10039.1023
Task 18: Total - Elapsed: 00:00:21.0258884
Task 23: Connection.Open started at 4/27/2020 9:34:49 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 16044.2951
Task 09: Total - Elapsed: 00:00:29.5554896
Task 31: Connection.Open started at 4/27/2020 9:34:56 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 9036.8845
Task 03: Total - Elapsed: 00:00:33.5544169
Task 17: Total - Elapsed: 00:00:21.5493243
Task 33: Connection.Open started at 4/27/2020 9:34:58 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 7036.9824
Task 12: Total - Elapsed: 00:00:26.5529391
Task 34: Connection.Open started at 4/27/2020 9:34:59 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 6035.7954
Task 02: Total - Elapsed: 00:00:33.5545999
Task 35: Connection.Open started at 4/27/2020 9:35:00 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 5035.7386
Task 16: Total - Elapsed: 00:00:22.5499312
Task 36: Connection.Open started at 4/27/2020 9:35:01 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 4035.4121
Task 04: Total - Elapsed: 00:00:33.5546821
Task 14: Total - Elapsed: 00:00:24.5453515
Task 27: Connection.Open started at 4/27/2020 9:34:52 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 13040.8999
Task 11: Total - Elapsed: 00:00:27.5058150
Task 29: Connection.Open started at 4/27/2020 9:34:54 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 11036.1083
Task 37: Connection.Open started at 4/27/2020 9:35:02 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 3035.0194
Task 32: Connection.Open started at 4/27/2020 9:34:57 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 8038.2898
Task 38: Connection.Open started at 4/27/2020 9:35:03 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2023.0241
Task 05: Total - Elapsed: 00:00:32.5594002
Task 39: Connection.Open started at 4/27/2020 9:35:04 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1033.8102
Task 25: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.9938
Task 21: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8662
Task 22: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.5818
Task 28: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0527
Task 33: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.5791
Task 31: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.7331
Task 30: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0152
Task 24: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.8402
Task 23: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8164
Task 26: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.3343
Task 20: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.898
Task 34: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8759
Task 35: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8686
Task 36: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8287
Task 37: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.7166
Task 29: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.889
Task 32: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8629
Task 38: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8027
Task 27: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 2.0957
Task 39: BeginTransaction started at 4/27/2020 9:35:05 AM;ended at 4/27/2020 9:35:05 AM - Elapsed 1.8029
Task 25: Total - Elapsed: 00:00:14.5450434
Task 21: Total - Elapsed: 00:00:18.0492143
Task 22: Total - Elapsed: 00:00:17.0452281
Task 28: Total - Elapsed: 00:00:12.0440717
Task 33: Total - Elapsed: 00:00:07.0400924
Task 30: Total - Elapsed: 00:00:10.0426584
Task 31: Total - Elapsed: 00:00:09.0402773
Task 23: Total - Elapsed: 00:00:16.0478210
Task 26: Total - Elapsed: 00:00:14.0450272
Task 24: Total - Elapsed: 00:00:15.5469980
Task 34: Total - Elapsed: 00:00:06.0394797
Task 35: Total - Elapsed: 00:00:05.0394576
Task 36: Total - Elapsed: 00:00:04.0390648
Task 37: Total - Elapsed: 00:00:03.0384628
Task 38: Total - Elapsed: 00:00:02.0265013
Task 32: Total - Elapsed: 00:00:08.0421579
Task 27: Total - Elapsed: 00:00:13.0456037
Task 29: Total - Elapsed: 00:00:11.0406611
Task 20: Total - Elapsed: 00:00:19.0509156
Task 39: Total - Elapsed: 00:00:01.0373777
Total: 00:00:33.5828616

The interesting thing is that from the managed call stack, we see the threads were waiting on SNIMarsHandle.Receive:

(lldb) clrstack
OS Thread Id: 0x9654 (20)
        Child SP               IP Call Site
00007EFF421D6C00 00007effbe084360 [GCFrame: 00007eff421d6c00]
00007EFF421D6CE0 00007effbe084360 [HelperMethodFrame_1OBJ: 00007eff421d6ce0] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007EFF421D6E10 00007EFF42F207B4 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007EFF421D6EA0 00007EFF455A4595 Microsoft.Data.SqlClient.SNI.SNIMarsHandle.Receive(Microsoft.Data.SqlClient.SNI.SNIPacket ByRef, Int32)
00007EFF421D6F10 00007EFF44110B12 Microsoft.Data.SqlClient.SNI.SNIProxy.ReadSyncOverAsync(Microsoft.Data.SqlClient.SNI.SNIHandle, Microsoft.Data.SqlClient.SNI.SNIPacket ByRef, Int32)
00007EFF421D6F40 00007EFF44110A9D Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.ReadSyncOverAsync(Int32, UInt32 ByRef)
00007EFF421D6F90 00007EFF441103C1 Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
00007EFF421D6FF0 00007EFF441102B9 Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
00007EFF421D7010 00007EFF4412D7FB Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
00007EFF421D7030 00007EFF4412D6EE Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte ByRef)
00007EFF421D7060 00007EFF4412C778 Microsoft.Data.SqlClient.TdsParser.TryRun(Microsoft.Data.SqlClient.RunBehavior, Microsoft.Data.SqlClient.SqlCommand, Microsoft.Data.SqlClient.SqlDataReader, Microsoft.Data.SqlClient.BulkCopySimpleResultSet, Microsoft.Data.SqlClient.TdsParserStateObject, Boolean ByRef)
00007EFF421D7200 00007EFF4412C132 Microsoft.Data.SqlClient.TdsParser.Run(Microsoft.Data.SqlClient.RunBehavior, Microsoft.Data.SqlClient.SqlCommand, Microsoft.Data.SqlClient.SqlDataReader, Microsoft.Data.SqlClient.BulkCopySimpleResultSet, Microsoft.Data.SqlClient.TdsParserStateObject)
00007EFF421D7270 00007EFF455A3A53 Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[], TransactionManagerRequestType, System.String, TransactionManagerIsolationLevel, Int32, Microsoft.Data.SqlClient.SqlInternalTransaction, Microsoft.Data.SqlClient.TdsParserStateObject, Boolean)
00007EFF421D7320 00007EFF455A2723 Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest, System.String, System.Data.IsolationLevel, Microsoft.Data.SqlClient.SqlInternalTransaction, Boolean)
00007EFF421D73F0 00007EFF455A235D Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest, System.String, System.Data.IsolationLevel, Microsoft.Data.SqlClient.SqlInternalTransaction, Boolean)
00007EFF421D7440 00007EFF455A18D0 Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(System.Data.IsolationLevel, System.String, Boolean)
00007EFF421D74D0 00007EFF455A14EF Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(System.Data.IsolationLevel, System.String)
00007EFF421D7520 00007EFF455A13EF Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(System.Data.IsolationLevel)
00007EFF421D7540 00007EFF4340E81B reprocli.Program.Scenario4(System.String, Int32) [/home/dragon/k8s-sql-timeout-repro/source/reprocli/Program.cs @ 54]

And we can track to where ManualResetEventSlim was signaled:

(lldb) clrstack
OS Thread Id: 0x9d24 (51)
        Child SP               IP Call Site
00007EFEA8FF85B0 00007EFF42F20C46 System.Threading.ManualResetEventSlim.UpdateStateAtomically(Int32, Int32) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 720]
00007EFEA8FF85F0 00007EFF42F2026A System.Threading.ManualResetEventSlim.Set(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 310]
00007EFEA8FF8640 00007EFF42F2022D System.Threading.ManualResetEventSlim.Set() [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 297]
00007EFEA8FF8650 00007EFF455C644B Microsoft.Data.SqlClient.SNI.SNIMarsHandle.HandleReceiveComplete(Microsoft.Data.SqlClient.SNI.SNIPacket, Microsoft.Data.SqlClient.SNI.SNISMUXHeader)
00007EFEA8FF86E0 00007EFF455C54A0 Microsoft.Data.SqlClient.SNI.SNIMarsConnection.HandleReceiveComplete(Microsoft.Data.SqlClient.SNI.SNIPacket, UInt32)
00007EFEA8FF8750 00007EFF455C4F1B Microsoft.Data.SqlClient.SNI.SNIPacket+<>c__DisplayClass33_0.<ReadFromStreamAsync>b__0(System.Threading.Tasks.Task`1<Int32>)
00007EFEA8FF8800 00007EFF4310DE1A System.Threading.Tasks.ContinuationTaskFromResultTask`1[[System.Int32, System.Private.CoreLib]].InnerInvoke() [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/TaskContinuation.cs @ 135]
00007EFEA8FF8820 00007EFF42F3A142 System.Threading.Tasks.Task+<>c.<.cctor>b__274_0(System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2427]
00007EFEA8FF8830 00007EFF42F1ECE2 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 289]
00007EFEA8FF8870 00007EFF42F345C1 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2389]
00007EFEA8FF88F0 00007EFF42F343D8 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2321]
00007EFEA8FF8910 00007EFF42F3436B System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2312]
00007EFEA8FF8920 00007EFF42F28B14 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs @ 663]
00007EFEA8FF89A0 00007EFF42F16B3B System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() [/_/src/System.Private.CoreLib/src/System/Threading/ThreadPool.CoreCLR.cs @ 29]
00007EFEA8FF8D50 00007effbc817f7f [DebuggerU2MCatchHandlerFrame: 00007efea8ff8d50]

The above managed call stack gives the impression that thread was waiting to receive data from the network.

However, from TCPDump trace, we see server side had already sent back the packet (Frame 487) and client had acknowledged it (Frame 562). And then the client side waited for 35+ seconds, and then sent another packet to server (Frame 2177). And later network traffic became normal without delays.

(This TCPDump trace was captured in a different run. So the delay time was slightly different from the above output.)

image

It looks like the client socket layer received the data, but it did not notify the application in real time and was blocked somewhere.

How can we tell what blocked it?

@voroninp
Copy link

voroninp commented Apr 30, 2020

@junchen2000 Hey, man, are you sure you'd like to share connection string containing password? (even if it is local and test)

@junchen2000
Copy link

@voroninp, Thanks! Revised.

@Leonardo-Ferreira
Copy link

@junchen2000 would mind re-running your tests WITHOUT the transaction?

@junchen2000
Copy link

@Leonardo-Ferreira, I comment out the transaction. It only opens the connection and closes the connection for each thread.

No delays:

dragon@marlonvm:~/k8s-sql-timeout-repro/source/reprocli$ dotnet run
Task 00: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 381.0838
Task 00: Total - Elapsed: 00:00:00.4389760
Task 02: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 390.1555
Task 02: Total - Elapsed: 00:00:00.4405980
Task 01: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 390.2725
Task 01: Total - Elapsed: 00:00:00.4406879
Task 04: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 390.3906
Task 04: Total - Elapsed: 00:00:00.4405574
Task 05: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 1.8329
Task 05: Total - Elapsed: 00:00:00.0019143
Task 06: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2083
Task 06: Total - Elapsed: 00:00:00.0002659
Task 07: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2609
Task 07: Total - Elapsed: 00:00:00.0003122
Task 08: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2356
Task 08: Total - Elapsed: 00:00:00.0003324
Task 09: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.269
Task 09: Total - Elapsed: 00:00:00.0003064
Task 10: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2555
Task 10: Total - Elapsed: 00:00:00.0003168
Task 11: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2079
Task 11: Total - Elapsed: 00:00:00.0002616
Task 12: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1608
Task 12: Total - Elapsed: 00:00:00.0002120
Task 13: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1788
Task 13: Total - Elapsed: 00:00:00.0002219
Task 14: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1594
Task 14: Total - Elapsed: 00:00:00.0002130
Task 15: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1685
Task 15: Total - Elapsed: 00:00:00.0002248
Task 16: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1765
Task 16: Total - Elapsed: 00:00:00.0002265
Task 17: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1884
Task 17: Total - Elapsed: 00:00:00.0002391
Task 18: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1702
Task 19: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1653
Task 18: Total - Elapsed: 00:00:00.0002414
Task 19: Total - Elapsed: 00:00:00.0002107
Task 20: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1636
Task 20: Total - Elapsed: 00:00:00.0003681
Task 21: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.3245
Task 21: Total - Elapsed: 00:00:00.0003703
Task 22: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2136
Task 22: Total - Elapsed: 00:00:00.0002727
Task 23: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1729
Task 23: Total - Elapsed: 00:00:00.0002188
Task 24: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1678
Task 24: Total - Elapsed: 00:00:00.0002231
Task 25: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1831
Task 25: Total - Elapsed: 00:00:00.0002202
Task 26: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1574
Task 26: Total - Elapsed: 00:00:00.0002061
Task 27: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1653
Task 27: Total - Elapsed: 00:00:00.0001979
Task 28: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1363
Task 28: Total - Elapsed: 00:00:00.0002044
Task 29: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1558
Task 29: Total - Elapsed: 00:00:00.0002112
Task 30: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1525
Task 30: Total - Elapsed: 00:00:00.0002049
Task 31: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1786
Task 31: Total - Elapsed: 00:00:00.0002470
Task 32: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1836
Task 32: Total - Elapsed: 00:00:00.0002417
Task 33: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2061
Task 33: Total - Elapsed: 00:00:00.0002580
Task 35: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.2134
Task 35: Total - Elapsed: 00:00:00.0002727
Task 37: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1847
Task 37: Total - Elapsed: 00:00:00.0002219
Task 39: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1435
Task 39: Total - Elapsed: 00:00:00.0001966
Task 34: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1441
Task 34: Total - Elapsed: 00:00:00.0001977
Task 36: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1657
Task 36: Total - Elapsed: 00:00:00.0002115
Task 38: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 0.1429
Task 38: Total - Elapsed: 00:00:00.0002218
Task 03: Connection.Open started at 5/6/2020 5:55:53 AM;ended at 5/6/2020 5:55:53 AM - Elapsed 404.2743
Task 03: Total - Elapsed: 00:00:00.4545276
Total: 00:00:00.4811442

If I add SqlCommand part without the explicitly declared transaction, it shows the similar behavior as the transaction one.

Also, connection.BeginTransaction with different IsolationLevel (ReadCommitted, ReadUncommitted, or Snapshot) does not make any differences. It is always slow.

@David-Engel
Copy link
Contributor

If anyone is feeling experimental could you try using the artifacts from #1543 and follow the instructions in that PR to see if it alleviates this issue.

@arielmoraes @Will-Bill @DanielHillebrandSWMS @pawelpabich @sdesmond46 @Leonardo-Ferreira

Any of you willing to give Wraith2's PR artifacts a try in a test environment and provide feedback on his proposal to helping this issue?

@pawelpabich
Copy link
Author

@David-Engel, thanks for the ping. In the meantime, we've rewritten all code that needed MARS and this problem doesn't affect us anymore. Included repro code is the only test case.

@TheRockStarDBA
Copy link

This bit us big time. Setting MultipleActiveResultSets=true caused lots of timeouts when running .net core app on linux pod on K8s. Removing it from connection string made the app very very fast and responsive and the "Connection Timeout Expired" errors are all gone.

@Daniel-Svensson
Copy link

We recently had to abandon our plans to move to linux (at least in the near time) which very well be due to this issue.

(I saw lots of thread pool starvation both for ExecuteNonQuery() and ExecuteDataReader() (sync loading from EF6) where "ReadSyncOverAsync" showed up in the call stacks) the issue got slightly better but existed event when we raised "min thread pool threads " by a factor 10 (we might have to increase it more) and increased the rate of thread injection for blocking calls (https://learn.microsoft.com/en-us/dotnet/core/runtime-config/threading#thread-injection-in-response-to-blocking-work-items) to 4, but the later seemed to not make as big improvement as expected.

The code works much better since we removed all sync IO for the common paths and increased thread pool threads, but it is a large system so it will take quite some time to remove enough of it to test again, and so far there is more work to se if disabling MARS makes a difference or not.

I would have hoped that it should be enough to increase blocking IO https://learn.microsoft.com/en-us/dotnet/core/runtime-config/threading#thread-injection-in-response-to-blocking-work-items to make it possible to make some sync sql calls with MARS enabled, however that seems to not make as big improvement as expected, (maybe due to how waiting is done, using ManualResetEventSlim ?)

Will try to disable MARS and se how it goes.

@JRahnama
Copy link
Contributor

@Daniel-Svensson also you can try reducing max pool size and set it to 20. New pools will be created when the max is reached.

@Daniel-Svensson
Copy link

Thank you for the feedback.

I've tested to disable MARS and without even the old version with lots of sync IO works fine.
Without mars we are quite confident that performance wont be an issue with the sqlclient on linux.

@oleg-varlamov
Copy link

We ran into the same problem when migrating our microservices to a Linux environment. First, we encountered non-obvious timeout errors Win32Exception (258): Unknown error 258.
After enabling logging, it turned out that opening transactions in Linux takes unpredictably longer than on Windows. Disabling MARS in the connection string helped to solve the problem, but we would like to know if it will be completely solved?

@SergeiPavlov
Copy link

We can reproduce this problem and see some improvement if set environment variable DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1.

@maxsargentdev
Copy link

I am seeing an issue in a system that is along these lines and am looking for advice.

I have a single node kubernetes cluster running SQL server (mssql-docker) and am connecting to that database via a service also running in the cluster, using the latest Microsoft.Data.Sqlclient.

The behaviour I see is that the service establishes connections one connection at a time, each connection taking 15s to establish.

Running mssql-tools and connecting with sqlcmd is instant.

I have lots more data gathered and will post it over the weekend.

@maxsargentdev
Copy link

Just to follow up, not sure why but the issue is resolved by not providing certs to mssql via cert-manager but just using the self signed certificates.

Even when FORCEENCRYPT was off, and the connection from my service was Encrypt=False, some kind of issue was happening causing the TLS negotiation to stall...

Removing all references to custom certs has resolved the issue but I do have to trust the server certificate which luckily for me isn't such a big deal in this system.

@ronking-zhu
Copy link

@cheenamalhotra do we have some plans for the fix of the MARS = true perf issue in the future versions?

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 26, 2023

If you read through the discussion on one of the pr's linked above, #1543 , you'll see that the experiment with a dedicated scheduler showed that we're not the problem it's async io in general that is. At the moment there is no clear way to "fix" this issue. Either don't use mars or pre-size your threadpool to meet the incoming demand.

@dazinator
Copy link

Even with Mars disabled, we see this error.

It seems that SqlClient on linux is not really able to scale properly, when using Async. Imho this is not given the recognition or warning that it should be given in the docs. There should be warnings in large text for anything involving this driver at the moment, especially EF Core Sql Server which often people use in linux docker containers - and they will hit this issue if concurrency ramps up in their app.

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 12, 2023

Do you have a reproduction?

@dazinator
Copy link

dazinator commented Dec 13, 2023

Do you have a reproduction?

I am not sure there is much point as the design issue seems already acknowledged by greater minds than I, in this discussion you linked here

We have a mostly under utilised Azure Sql database (auto scaling) and ubuntu VM running a Linux docker container. We aren't using MARS. I am curious to know if you / anyone has tried that repro you linked above for the non MARS case, as my understanding is that the same issues in linux will surface at some point with MARS disabled - as it still does sync over async and still relies on a bcl network api that requires threadpool?

I think you've basically said the only real solution (workaround) for now, is to prep the threadpool manually by boosting the min threads in advance of placing load.. However I'm wondering how best one can anticipate the load to adjust the thread pool min thread count in advance? In our scenario we have cron jobs which cause stuff to ramp up but those jobs query external systems and depending on the data read from external systems their may be more or less work to do for the threadpool. Also if this is the official workaround is it possible to get some better guidance for how to set min / max threads to a reasonable number and whether we should decrease it again etc? Perhaps also is it worth talking with the EF Core team to ensure sql server examples or docs, show or atleast mention this issue for awareness? Otherwise its just going to catch more and more teams out.

And if I have understood correctly and the Microsoft sql client team are aware of the issue with sync over async and scalability on linux, may I ask if there are any tests that they have done to probe for points of failure and increase the understanding of the acknowledged issue - i.e what sorts of loads will cause the issue that is acknowledged in the design? This will be helpful for us to create a repro, but also teams designing new apps with load patterns that sql client has issues with, may want to use something else like postgres etc.

@eglauko
Copy link

eglauko commented Aug 20, 2024

If you read through the discussion on one of the pr's linked above, #1543 , you'll see that the experiment with a dedicated scheduler showed that we're not the problem it's async io in general that is. At the moment there is no clear way to "fix" this issue. Either don't use mars or pre-size your threadpool to meet the incoming demand.

#1543 (comment)

We have sporadic problems with applications that run in an ubuntu container and connect to SqlServer databases. The same application running on ubuntu and connecting to Postgres has no problems. When running on Windows Server (IIS) there are no problems either.

Our impression is the same as @Wraith2 wrote in his comment: "it looks like the networking implementation in the runtime just stops calling back when data arrives".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests