-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock that uses congestion detection for self-tuning #93879
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsThis is basically the same as #87672 The difference is that
|
Tagging subscribers to this area: @mangod9 Issue DetailsThis is basically the same as #87672 with some additions The difference is that The essence of the algorithm is:
The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).
|
Here is an example of using this Lock in a short-held scenario. Scenarios like 50/50 inside/outside can also be measured, but less interesting as such scenario on its own has little room to scale beyond 2 threads. (3 threads can't all spend 50% of their total time inside the same lock) The code: using System.Diagnostics;
namespace ConsoleApp12
{
internal class Program
{
private const int iters = 10000000;
static void Main(string[] args)
{
for (; ; )
{
for (int i = 0; i < 7; i++)
{
int thrCount = 1 << i;
System.Console.WriteLine("Threads:" + thrCount);
for (int j = 0; j < 4; j++)
{
System.Console.Write("Fat Lock: ");
RunMany(() => FatLock(thrCount), thrCount);
}
}
System.Console.WriteLine();
}
}
static void RunMany(Action f, int threadCount)
{
Thread[] threads = new Thread[threadCount];
bool start = false;
for (int i = 0; i < threads.Length; i++)
{
threads[i] = new Thread(
() =>
{
while (!start) Thread.SpinWait(1);
f();
}
);
threads[i].Start();
}
Thread.Sleep(10);
start = true;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < threads.Length; i++)
{
threads[i].Join();
}
System.Console.WriteLine("Ops per msec: " + iters / sw.ElapsedMilliseconds);
}
private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);
private static int ComputeSomething(Random random)
{
int delay = random.Next(4, 10);
return Fib(delay);
}
static System.Threading.Lock fatLock = new System.Threading.Lock();
public static int sharedCounter = 0;
public static Dictionary<int, int> sharedDictionary = new Dictionary<int, int>();
static void FatLock(int thrCount)
{
Random random = new Random();
for (int i = 0; i < iters / thrCount; i++)
{
// do some computation
int value = ComputeSomething(random);
var scope = fatLock.EnterScope();
{
// update shared state
sharedCounter += value;
sharedDictionary[i] = sharedCounter;
}
scope.Dispose();
}
}
}
} Results on: On Windows10 x64 Higher number is better. === Lock with congestion sensing (in this PR).
=== results for #87672
|
Here is the results for the same benchmark as above, but with 2 locks. Sometimes a program has more than one lock. It is the same driver code, just this part is different: . . .
. . .
static System.Threading.Lock fatLock1 = new System.Threading.Lock();
static System.Threading.Lock fatLock2 = new System.Threading.Lock();
public static int sharedCounter1 = 0;
public static int sharedCounter2 = 0;
public static Dictionary<int, int> sharedDictionary1 = new Dictionary<int, int>();
public static Dictionary<int, int> sharedDictionary2 = new Dictionary<int, int>();
static void FatLock(int thrCount)
{
Random random = new Random();
for (int i = 0; i < iters / thrCount; i++)
{
// do some computation
int value = ComputeSomething(random);
if (i % 2 == 0)
{
var scope = fatLock1.EnterScope();
{
// update shared state
sharedCounter1 += value;
sharedDictionary1[i] = sharedCounter1;
}
scope.Dispose();
}
else
{
var scope = fatLock2.EnterScope();
{
// update shared state
sharedCounter2 += value;
sharedDictionary2[i] = sharedCounter2;
}
scope.Dispose();
}
}
} === Congestion-sensing (this PR)
=== Base PR
|
In the last example, the base PR ends up with 3X worse throughput than the congestion-sensing approach. The reason is excessive spinning in a lock that is heavily contested. When threads can't acquire the lock in one shot, they will try acquiring again in a loop, which would succeed, eventually - at a great cost, since spinning on highly contested state is expensive. (many cache misses, unnecessary transfers of the cache line ownership between cores, possibly some thermal effects, ...). Excessive spinning also takes resources that could be used by other threads not involved in this lock, even if that is another lock under similar conditions. Thus two locks end up behaving much worse than just one. While the system still makes progress, it is more fruitful for some of the contestants in busy locks to leave the "traffic jam" and take a nap, while the rest of the contestants can do the same work, but with less overhead. NOTE: this should not sound as simply "spinning is bad". Spinning is good when it is cheap. It is the expensive spinning that should be avoided. |
This PR is premature. Please see the plenty of information shared in #87672 and take it into consideration before raising PRs. |
Let's go ahead and reopen this PR, I should probably let you manage your own PRs, and we're continuing to discuss on next steps. |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
f4c8351
to
d58eade
Compare
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
s_minSpinCount = DefaultMinSpinCount << SpinCountScaleShift; | ||
|
||
// we can now use the slow path of the lock. | ||
Volatile.Write(ref s_staticsInitializationStage, (int)StaticsInitializationStage.Usable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is now functional and did not do anything that could take locks. The rest of initialization is optional and just need to eventually happen.
This is basically the same as #87672 with some additions
The main difference is that
TryEnterSlow
uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.The essence of the algorithm is:
The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).