-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries with MultipleActiveResultSets=True (MARS) are very slow / time out on Linux #422
Comments
The linux implementation of the network interface is entirely different to the one used on windows which is why you'll see different behaviour. The difference between mars and non-mars is probably caused by the fact that mars takes part of the packet size for the mars header so it can require 2 received network packets to complete a single logical connection packet and that will exascerbate timing problems, you'd see them roughly twice as fast. One thing to note is that while sql is quite capable of doing many things at once it's not always a good idea to try and open 100 simultanous connections and to use them all in parallel. The unix sockets implementation differs from windows in ways which can exascerbate threading problems, e.g. dotnet/runtime#32016 . There's also another issue on this repo in the past where on investigation i found thread starvation with high levels of requested connection level parallelism caused a problem. I'll try and take a look at the repro and see anything sticks out as a problem. |
Thanks @Wraith2 for looking into it. @pawelpabich |
The replication works on windows and forced managed mode. By works I mean that I get a connection failure request pretty quickly (should be 10 seconds) after I start the program. If you confirm that's what's expected then the repro is good. Initial impression is that it's simple threadpool starvation. If you add in:
before you start the stopwatch you should find that it runs through successfully. So why does that work? You're using Parallel which is attempting to start SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNIMarsHandle.cs Line 391 in bac3ab4
It's be nice if we weren't going sync over async in the library so this problem couldn't occur inside the library. Changing that is challenging and short to mid term impractical, it's pretty much a core rewrite of the non-windows implementation with all the reliability and performance concerns that it would come with. So don't expect that to happen soon. So why is this your problem? Well you're the one using all the threadpool theads. The threadpool starts with a minimum of 10 ready threads in it and at various times tasks are queued to those threads and then completed. Unless you hit a period of sustained load on the pool it's unlikely to need to grow past the minimum value so most of the time you'll have a small number of threads. The threadpool is balanced for general workloads and a lot of research and tuning has gone into it. One of the behaviours that results from this tuning is that it doesn't immediately add large numbers of threads when there are more tasks waiting to be processed than there are threads available to do them. The threadpool usually only adds one thread a second to the pool. If you load it with 100 tasks quickly you'll get 10, then 11, then 12 etc climbing slowly. So how do you fix it? Don't do this. Don't try to use large number of simultaneous synchronous connections like this. Unless you're talking to a different server or database on each connection then it's going to be more efficient to use a pipeline approach feeding results to an open connection and having it process multiple commands with a single connection. Or, switch to using async if possible because the awaited suspension will free up threadpool threads allowing them to be used or callback signalling. |
Thanks @Wraith2 for the investigation. Thread starvation on Linux (dotnet/runtime#14304) explains some of the results we see but I think there are still some unanswered questions here.
This might not be obvious but the repo is based on how Octopus actually works. It's not entirely a synthetic benchmark. A single deployment performs around that many queries to the database. We didn't want to include thousands of lines of code here but the end result is the same. Also, we checked a bunch of metrics in the SQL Server and the resource utilization is far from being maxed out. This indicates that SQL Server is not the bottleneck here. Question 1 Why switching MARS OFF on Linux fixes the problem entirely and the level of parallelism can be increased from 20 to 500. Would just one additional packet cause such a massive difference? Question 2 Why do we see |
@cheenamalhotra we saw the same problem when we were using |
Threads on Windows are not cheap. @Wraith2 do know whether this is also the case on Linux? If not, we could start with more threads as a short term workaround. |
From the little I know of it threads are even more expensive on Linux than they are on windows, iirc windows is setup for fast thread creation and Linux for fast process creation. So as you say not cheap and it's a poor way to scale. Setting MinThreads on the pool is a way to get around it. Not a great idea for production but it's a public api so it is provided andf supported for you to use if you need to do so. Read around the method and see if you think it's an acceptable thing to do in your case? SQL server is not the bottleneck, the server is barely doing anything throughout the benchmark. The problem is the client process thread creation as far as I can see. A1: Turning off mars on Linux removes an intermediate handle object and changes where and how receives are done. In the code I linked to you see the Receive using a packetEvent wait handle. Without mars you'd see it's caller calling into SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs Line 493 in bac3ab4
which doesn't do that kind of wait at all and instead does a direct socket call. So you don't hit the problem because the wait is done in the socket implementation which may not require a threadpool. Really if you can avoid using mars I would advise doing so, it doesn't seem to have many upsides and significantly increases the likelihood of issues on the managed network implementation. So yes one packet really can't make all the difference because that one packet on the mars path immediately causes a synchronous wait that can't be unlocked without the resource it's blocking. A2: At a guess if it's coming from the sqlclient side I'd say it's because the socket is being shutdown because of the failure. Once successfully connected sockets are kept open until they're removed from the pool even if you close the SqlConnection object the SqlConnectionInternal object it contains stays alive for a while unless there is a failure which dooms it to immediate closure. If it's from the server side I couldn't say. |
Thanks @Wraith2. I appreciate the time you took to explain the details.
I would love to do that but this would be a significant change for us and we need a decent workaround/solution as soon as possible. @cheenamalhotra, is there a chance this code can be optimized?
Yes, it is coming from the server side. |
I just revalidated this using a MAC OSX 10.14.6 (Mojave) + .Net Core APP 3.1.0 + System.Data.SqlClient 4.8.1. Edit1: I fiddled a bit, but the "Enumerable.Range" is always 20. When I reduced the degree of parallelism to 8 (number of cores in my machine) the problem disappeared completely. with degree of parallelism up to 8, MARS or no-MARS are very very very similar. The higher the degree of parallelism, the worst the initial hang for the first query. This can also be solved as @Wraith2 said, by expanding the thread pool. Note that linux does not differentiate between WorkerThreads and CompletionThreads (IO threads) |
I was able to trace the slowness of your "mid period queries" to the transactions. If you do not open transactions explicitly things will fly. |
We've tried this and it did not solve the problem for us because MARS connections still take much longer to open and we simply run out of connections in the pool. |
So what do you propose we do instead?
From what I see you ran your tests in a different environment (MAC OSX) than ours so I'm not sure how you can draw this conclusion. |
Well, about the validity of the tests, MAC OSX behaves very similarly to Linux, including the thread management, after all Mac OSX is a unix based... So I would take my reproducing with a grain of salt, but I'm quite confident on the analysis... About the transactions blocking each other, there are several workarounds, but they rely on actual code changes and business-rules validations and etc. This is a database issue, not a implementation issue. |
I have't been responding lately, but just to let you know I am investigating this issue. I'm able to reproduce the problem in linux docker containers when connecting to any SQL Server, irrespective of being Azure DB and it does seem to be related to linux memory management, as when running the same app on Windows using the same codepath of Managed SNI the problem does not reproduce. @Wraith2 seems to have provided some details. However I'm investigating if it is specifically related to containerized environment limitations. I'm not sure at this point if this has anything to with transactions, I don't think so.. but I'll share updates soon! |
@cheenamalhotra, I don't believe that this is related to containerised applications because I was able to reproduce the issue on my local machine (up to some degree, since I did not experienced heavy delays mid-execution). Perhaps the containerised scenario exarcebates the problem but I believe that the problem is there... Count:20, Parallel:8, YES TRANSACTION - NO MARS Count:20, Parallel:8, YES TRANSACTION - YES MARS Count:20, Parallel:8, NO TRANSACTION - YES MARS Count:20, Parallel:8, NO TRANSACTION - NO MARS |
I do experience this issue on Azure Web App For Containers (on Linux), but not when running the same container locally on Docker Desktop (on Windows). Hope this helps. |
Good that this github issue exists (thanks!), we seem to have run into the same issue. Problem only appears when running the (asp core + ef core 3.1.2) app on Docker with Kubernetes with MARS on. Our background service handling lots of data would simply "die", sometimes with and sometimes without any exception thrown. As it is a BackgroundService/IHostedService, the web app continues to run, just the BackgroundService is gone. I turned MARS off and now it works. I got two kinds of exceptions, this one with default settings of DbContext.
When setting the command timeout to five minutes, I got this exception - same as the opener of this issue.
This issue caused lots of working days of diagnosing, as there is no clear indication what is wrong, hindering troubleshooting. |
Hi, I have exactly the same issue as @tunger. |
Have a read of #502 and see if increasing packet size helps things. If it doesn't try turning off mars. If you can't turn off mars then you're stuck. |
Hi Team, We have multiple users reporting the same issue. The issue only happens on Linux. The issue does not happen on Windows. To narrow down the issue, we have simplified the reproducing code to the following: using System;
using System.ComponentModel.DataAnnotations;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using Microsoft.Data.SqlClient;
namespace reprocli
{
class Program
{
static void Main(string[] args)
{
try
{
var count = 40;
var connectionString = "Server=tcp:<myAzureSQLName>.database.windows.net,1433;Initial Catalog=xxx;Persist Security Info=False;User ID=xxxxxxxxx;Password=xxxxxxx;MultipleActiveResultSets=TRUE;Connect Timeout=0;Max Pool Size=20;Packet Size=12000";
var total = Stopwatch.StartNew();
total.Restart();
Enumerable.Range(0, count)
.AsParallel()
.WithDegreeOfParallelism(count)
.ForAll(n => Scenario4(connectionString, n));
Console.WriteLine($"Total: {total.Elapsed}");
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
}
private static void Scenario4(string connString, int number)
{
var userStopWatch = Stopwatch.StartNew();
var buffer = new object[100];
{
using (var connection = new SqlConnection(connString))
{
var StartTimestamp = DateTime.Now;
connection.Open();
var FinishTimestamp = DateTime.Now;
TimeSpan span = FinishTimestamp.Subtract(StartTimestamp);
Console.WriteLine("Task " + number.ToString("00") + ": Connection.Open started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
StartTimestamp = DateTime.Now;
using (var transaction = connection.BeginTransaction(IsolationLevel.ReadCommitted))
{
FinishTimestamp = DateTime.Now;
span = FinishTimestamp.Subtract(StartTimestamp);
Console.WriteLine("Task " + number.ToString("00") + ": BeginTransaction started at " + StartTimestamp.ToString() + ";ended at " + FinishTimestamp.ToString() + " - Elapsed " + span.TotalMilliseconds);
transaction.Commit();
}
connection.Close();
}
}
userStopWatch.Stop();
Console.WriteLine("Task " + number.ToString("00") + $": Total - Elapsed: {userStopWatch.Elapsed}");
}
}
} Note in the above code:
a. Connect Timeout=0 b. Max Pool Size=20 c. Packet Size=12000
Test result for Windows Total execution took only 5.8 seconds. Each thread running in parallel took no more than 3 seconds. Most of the time were spent on Connection.Open:
Test Result for Linux Run the same on Linux. Total execution took 34 seconds. The connection.Open does not really execute in parallel. Once a BeginTransaction starts in one connection, it slows down the other thread to start Connection.Open. The more connection.open, the more delay. Once the connection pool was full, it was blocked for x0 seconds until one of the BeginTransaction finished. Then the later connections queued into the connection pool could run properly without any delays:
The interesting thing is that from the managed call stack, we see the threads were waiting on SNIMarsHandle.Receive:
And we can track to where ManualResetEventSlim was signaled:
The above managed call stack gives the impression that thread was waiting to receive data from the network. However, from TCPDump trace, we see server side had already sent back the packet (Frame 487) and client had acknowledged it (Frame 562). And then the client side waited for 35+ seconds, and then sent another packet to server (Frame 2177). And later network traffic became normal without delays. (This TCPDump trace was captured in a different run. So the delay time was slightly different from the above output.) It looks like the client socket layer received the data, but it did not notify the application in real time and was blocked somewhere. How can we tell what blocked it? |
@junchen2000 Hey, man, are you sure you'd like to share connection string containing password? (even if it is local and test) |
@voroninp, Thanks! Revised. |
@junchen2000 would mind re-running your tests WITHOUT the transaction? |
@Leonardo-Ferreira, I comment out the transaction. It only opens the connection and closes the connection for each thread. No delays:
If I add SqlCommand part without the explicitly declared transaction, it shows the similar behavior as the transaction one. Also, connection.BeginTransaction with different IsolationLevel (ReadCommitted, ReadUncommitted, or Snapshot) does not make any differences. It is always slow. |
@arielmoraes @Will-Bill @DanielHillebrandSWMS @pawelpabich @sdesmond46 @Leonardo-Ferreira Any of you willing to give Wraith2's PR artifacts a try in a test environment and provide feedback on his proposal to helping this issue? |
@David-Engel, thanks for the ping. In the meantime, we've rewritten all code that needed MARS and this problem doesn't affect us anymore. Included repro code is the only test case. |
This bit us big time. Setting |
We recently had to abandon our plans to move to linux (at least in the near time) which very well be due to this issue. (I saw lots of thread pool starvation both for ExecuteNonQuery() and ExecuteDataReader() (sync loading from EF6) where "ReadSyncOverAsync" showed up in the call stacks) the issue got slightly better but existed event when we raised "min thread pool threads " by a factor 10 (we might have to increase it more) and increased the rate of thread injection for blocking calls (https://learn.microsoft.com/en-us/dotnet/core/runtime-config/threading#thread-injection-in-response-to-blocking-work-items) to 4, but the later seemed to not make as big improvement as expected. The code works much better since we removed all sync IO for the common paths and increased thread pool threads, but it is a large system so it will take quite some time to remove enough of it to test again, and so far there is more work to se if disabling MARS makes a difference or not. I would have hoped that it should be enough to increase blocking IO https://learn.microsoft.com/en-us/dotnet/core/runtime-config/threading#thread-injection-in-response-to-blocking-work-items to make it possible to make some sync sql calls with MARS enabled, however that seems to not make as big improvement as expected, (maybe due to how waiting is done, using ManualResetEventSlim ?) Will try to disable MARS and se how it goes. |
@Daniel-Svensson also you can try reducing max pool size and set it to 20. New pools will be created when the max is reached. |
Thank you for the feedback. I've tested to disable MARS and without even the old version with lots of sync IO works fine. |
We ran into the same problem when migrating our microservices to a Linux environment. First, we encountered non-obvious timeout errors |
We can reproduce this problem and see some improvement if set environment variable |
I am seeing an issue in a system that is along these lines and am looking for advice. I have a single node kubernetes cluster running SQL server (mssql-docker) and am connecting to that database via a service also running in the cluster, using the latest Microsoft.Data.Sqlclient. The behaviour I see is that the service establishes connections one connection at a time, each connection taking 15s to establish. Running mssql-tools and connecting with sqlcmd is instant. I have lots more data gathered and will post it over the weekend. |
Just to follow up, not sure why but the issue is resolved by not providing certs to mssql via cert-manager but just using the self signed certificates. Even when FORCEENCRYPT was off, and the connection from my service was Encrypt=False, some kind of issue was happening causing the TLS negotiation to stall... Removing all references to custom certs has resolved the issue but I do have to trust the server certificate which luckily for me isn't such a big deal in this system. |
@cheenamalhotra do we have some plans for the fix of the MARS = true perf issue in the future versions? |
If you read through the discussion on one of the pr's linked above, #1543 , you'll see that the experiment with a dedicated scheduler showed that we're not the problem it's async io in general that is. At the moment there is no clear way to "fix" this issue. Either don't use mars or pre-size your threadpool to meet the incoming demand. |
Even with Mars disabled, we see this error. It seems that SqlClient on linux is not really able to scale properly, when using Async. Imho this is not given the recognition or warning that it should be given in the docs. There should be warnings in large text for anything involving this driver at the moment, especially EF Core Sql Server which often people use in linux docker containers - and they will hit this issue if concurrency ramps up in their app. |
Do you have a reproduction? |
I am not sure there is much point as the design issue seems already acknowledged by greater minds than I, in this discussion you linked here We have a mostly under utilised Azure Sql database (auto scaling) and ubuntu VM running a Linux docker container. We aren't using MARS. I am curious to know if you / anyone has tried that repro you linked above for the non MARS case, as my understanding is that the same issues in linux will surface at some point with MARS disabled - as it still does sync over async and still relies on a bcl network api that requires threadpool? I think you've basically said the only real solution (workaround) for now, is to prep the threadpool manually by boosting the min threads in advance of placing load.. However I'm wondering how best one can anticipate the load to adjust the thread pool min thread count in advance? In our scenario we have cron jobs which cause stuff to ramp up but those jobs query external systems and depending on the data read from external systems their may be more or less work to do for the threadpool. Also if this is the official workaround is it possible to get some better guidance for how to set min / max threads to a reasonable number and whether we should decrease it again etc? Perhaps also is it worth talking with the EF Core team to ensure sql server examples or docs, show or atleast mention this issue for awareness? Otherwise its just going to catch more and more teams out. And if I have understood correctly and the Microsoft sql client team are aware of the issue with sync over async and scalability on linux, may I ask if there are any tests that they have done to probe for points of failure and increase the understanding of the acknowledged issue - i.e what sorts of loads will cause the issue that is acknowledged in the design? This will be helpful for us to create a repro, but also teams designing new apps with load patterns that sql client has issues with, may want to use something else like postgres etc. |
We have sporadic problems with applications that run in an ubuntu container and connect to SqlServer databases. The same application running on ubuntu and connecting to Postgres has no problems. When running on Windows Server (IIS) there are no problems either. Our impression is the same as @Wraith2 wrote in his comment: "it looks like the networking implementation in the runtime just stops calling back when data arrives". |
Describe the bug
TL;DR:
Queries using connections with MARS enabled, even when they don't use MARS, are much slower or even time out on Linux. The same queries are fast and reliable on Windows no matter whether MARS is disabled or enabled and on Linux when MARS is disabled.
Context
Octopus Cloud hosts Octopus Deploy instances in Linux containers on Azure AKS with data stored in Azure Files and Azure SQL. A couple of months ago we noticed that some of the SQL queries were much slower or even started timing out which is not something we've experienced before on Windows using Full .NET Framework. Some of the slowdown might be caused by AKS (K8s) but we think that the SqlClient might also be playing a role here. 119112824000676 is our Azure Support Request if that helps in any way.
We also captured TCP dumps while running the tests on Linux and it looks like enabling MARS causes TCP RST.
Full TCP Dumps: https://github.com/benPearce1/k8s-sql-timeout-repro/tree/tiny/source/reprocli/tcpdumps
To reproduce
Code
Repo with the sample app: https://github.com/benPearce1/k8s-sql-timeout-repro/blob/tiny/source/reprocli/Program.cs.
Compiled
folder contains pre-compiled versions of the app so .NET Core SDK doesn't have to be present on the target VMs.The first parameter is the level of parallelism. The second parameter is the connection string.
This is how we reproduced the problem which doesn't mean you need this exact config.
The database was hosted in an Azure SQL Elastic Pool (Standard: 300 eDTUs) on a SQL Server in West US 2 region.
LINUX
Run the sample app with the following arguments on a Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.
MARS ON
dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'
The expected result is that the app finishes without throwing any errors but that's not the case and
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
is thrown.Reducing the level of parallelism to 20 stops the app from crashing.
Also, when MARS is ON the console doesn't show any progress for 10+ seconds. This is not the case when MARS is OFF.
MARS OFF
dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=False;'
The expected result is that the app finishes without throwing any errors which is the case. The app finished in just under 25 seconds.
Total: 00:00:24.9737616
. The app also worked with much higher levels of parallelism (e.g. 500)AKS
Same spec as above: Linux (ubuntu 18.04) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2. We also ran this test in a container in AKS and the results were pretty much the same. The only difference was that we had to lower the level of parallelism even more. K8s networking adds a bit of overhead which might make the problem more pronounced.
WINDOWS
Run the sample app with the following arguments on a Windows (Windows Server 2016 Datacenter) VM (Standard D8s v3 (8 vcpus, 32 GiB memory) in Azure West US 2 region.
dotnet reprocli.dll 200 'Server=tcp:YOURSERVER.database.windows.net,1433;Initial Catalog=TestDatabase;Persist Security Info=False;User ID=YOURUSER;Password=YOURPASSWORD;MultipleActiveResultSets=True;'
The expected result is that the app finishes without throwing an exception which is the case. The app finished in just under 24 seconds.
Total: 00:00:23.4068641
. It also worked with level of parallelism set to 500. We achieved similar results with MARS disabled.Note: We used .NET Core to run tests in Windows.
Expected behavior
The sample app should not crash and connections with MARS feature enabled should behave in the same way on both Linux and Windows.
Further technical details
Microsoft.Data.SqlClient version: 1.1.0 and 2.0.0-preview1.20021.1
.NET target: (Core 2.2 and Core 3.1)
SQL Server version: (Azure SQL)
Operating system: (Ubuntu 18.04 and AKS with Ubuntu 18.4)
Additional context
We've been battling this issue for a long time now so we are happy to help in any way we can to get it resolved.
The text was updated successfully, but these errors were encountered: