New `AsyncMutex` implementation #3562

BalmungSan · 2023-04-25T21:11:26Z

I was half-following the recent discussions about the Mutex issue and on an internal conversation with Arman, we both concluded that the current version of AsyncMutex was hard to follow, whereas the two previous versions were "easier" (it still requires a lot of comments and some mental force but still it is more straightforward IMHO). There is also the point that the previous versions always preserved fairness (FIFO semantics), contrary to current.

Thus, considering that the main rationale for the implementation change was performance (especially on the happy path) I was curious and decided to re-run the benchmarks. Since the version Arman introduced on #3409 was heavily modified, I expected some changes.

For clarification, during the lifetime of this PR, we have had the following versions:

current is the version currently on main series/3.x
previous is the final version presented on Optimize Mutex & AtomicCell #3346 (this version has been discarded since a new test shows it has a serious bug which would allow multiple acquires of the Mutex)
deferred is the original version presented on Optimize Mutex & AtomicCell #3346 (this version only requires Concurrent so it can also replace semaphore implementation)
new is the final version presented on this PR (it is based on deferred but taking advantage of Async)

Right now we (Arman & Luis) propose to use either deferred or new, since both pass all the current tests (including the FIFO order one) as well as being easier to understand (in our humble opinion).

Benchmark results

Current

Benchmark	(fibers)	(iterations)	Mode	Cnt	Score	Error	Units
MutexBenchmark.happyPathAsync	10	1000	thrpt	20	219.723	± 0.334	ops/s
MutexBenchmark.happyPathAsync	50	1000	thrpt	20	44.040	± 0.563	ops/s
MutexBenchmark.happyPathAsync	100	1000	thrpt	20	22.485	± 0.152	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.highContentionAsync	10	1000	thrpt	20	31.041	± 1.161	ops/s
MutexBenchmark.highContentionAsync	50	1000	thrpt	20	9.442	± 0.129	ops/s
MutexBenchmark.highContentionAsync	100	1000	thrpt	20	5.477	± 0.131	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.cancellationAsync	10	1000	thrpt	20	21.953	± 0.809	ops/s
MutexBenchmark.cancellationAsync	50	1000	thrpt	20	7.899	± 0.202	ops/s
MutexBenchmark.cancellationAsync	100	1000	thrpt	20	4.844	± 0.034	ops/s

Previous / Deferred

Benchmark	(fibers)	(iterations)	Mode	Cnt	Score	Error	Units
MutexBenchmark.happyPathAsync	10	1000	thrpt	20	213.143	± 1.696	ops/s
MutexBenchmark.happyPathAsync	50	1000	thrpt	20	44.325	± 0.215	ops/s
MutexBenchmark.happyPathAsync	100	1000	thrpt	20	22.191	± 0.215	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.highContentionAsync	10	1000	thrpt	20	34.244	± 1.025	ops/s
MutexBenchmark.highContentionAsync	50	1000	thrpt	20	8.179	± 0.230	ops/s
MutexBenchmark.highContentionAsync	100	1000	thrpt	20	4.424	± 0.028	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.cancellationAsync	10	1000	thrpt	20	22.864	± 0.901	ops/s
MutexBenchmark.cancellationAsync	50	1000	thrpt	20	7.927	± 0.066	ops/s
MutexBenchmark.cancellationAsync	100	1000	thrpt	20	4.895	± 0.045	ops/s

New

Benchmark	(fibers)	(iterations)	Mode	Cnt	Score	Error	Units
MutexBenchmark.happyPathAsync	10	1000	thrpt	20	183.778	± 0.156	ops/s
MutexBenchmark.happyPathAsync	50	1000	thrpt	20	37.519	± 0.046	ops/s
MutexBenchmark.happyPathAsync	100	1000	thrpt	20	18.778	± 0.121	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.highContentionAsync	10	1000	thrpt	20	32.455	± 1.579	ops/s
MutexBenchmark.highContentionAsync	50	1000	thrpt	20	8.192	± 0.276	ops/s
MutexBenchmark.highContentionAsync	100	1000	thrpt	20	4.486	± 0.075	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.cancellationAsync	10	1000	thrpt	20	24.626	± 1.514	ops/s
MutexBenchmark.cancellationAsync	50	1000	thrpt	20	8.150	± 0.073	ops/s
MutexBenchmark.cancellationAsync	100	1000	thrpt	20	5.046	± 0.089	ops/s

For context, here are the results of the semaphore implementation for all runs.

Current

[info] MutexBenchmark.happyPathConcurrent             10          1000  thrpt   20  198.903 ± 0.638  ops/s
[info] MutexBenchmark.happyPathConcurrent             50          1000  thrpt   20   40.689 ± 0.348  ops/s
[info] MutexBenchmark.happyPathConcurrent            100          1000  thrpt   20   20.214 ± 0.116  ops/s
[info] MutexBenchmark.highContentionConcurrent        10          1000  thrpt   20   31.392 ± 1.198  ops/s
[info] MutexBenchmark.highContentionConcurrent        50          1000  thrpt   20    7.466 ± 0.156  ops/s
[info] MutexBenchmark.highContentionConcurrent       100          1000  thrpt   20    4.014 ± 0.067  ops/s
[info] MutexBenchmark.cancellationConcurrent          10          1000  thrpt   20   21.554 ± 0.651  ops/s
[info] MutexBenchmark.cancellationConcurrent          50          1000  thrpt   20    7.455 ± 0.049  ops/s
[info] MutexBenchmark.cancellationConcurrent         100          1000  thrpt   20    3.619 ± 0.056  ops/s

Previous / Deferred

[info] MutexBenchmark.happyPathConcurrent             10          1000  thrpt   20  198.903 ± 0.638  ops/s
[info] MutexBenchmark.happyPathConcurrent             50          1000  thrpt   20   40.689 ± 0.348  ops/s
[info] MutexBenchmark.happyPathConcurrent            100          1000  thrpt   20   20.214 ± 0.116  ops/s
[info] MutexBenchmark.highContentionConcurrent        10          1000  thrpt   20   31.392 ± 1.198  ops/s
[info] MutexBenchmark.highContentionConcurrent        50          1000  thrpt   20    7.466 ± 0.156  ops/s
[info] MutexBenchmark.highContentionConcurrent       100          1000  thrpt   20    4.014 ± 0.067  ops/s
[info] MutexBenchmark.cancellationConcurrent          10          1000  thrpt   20   21.554 ± 0.651  ops/s
[info] MutexBenchmark.cancellationConcurrent          50          1000  thrpt   20    7.455 ± 0.049  ops/s
[info] MutexBenchmark.cancellationConcurrent         100          1000  thrpt   20    3.619 ± 0.056  ops/s

New

[info] MutexBenchmark.happyPathConcurrent             10          1000  thrpt   20  199.503 ± 0.725  ops/s
[info] MutexBenchmark.happyPathConcurrent             50          1000  thrpt   20   40.464 ± 0.115  ops/s
[info] MutexBenchmark.happyPathConcurrent            100          1000  thrpt   20   20.236 ± 0.263  ops/s
[info] MutexBenchmark.highContentionConcurrent        10          1000  thrpt   20   32.125 ± 1.281  ops/s
[info] MutexBenchmark.highContentionConcurrent        50          1000  thrpt   20    7.467 ± 0.117  ops/s
[info] MutexBenchmark.highContentionConcurrent       100          1000  thrpt   20    3.934 ± 0.026  ops/s
[info] MutexBenchmark.cancellationConcurrent          10          1000  thrpt   20   21.359 ± 0.838  ops/s
[info] MutexBenchmark.cancellationConcurrent          50          1000  thrpt   20    7.438 ± 0.102  ops/s
[info] MutexBenchmark.cancellationConcurrent         100          1000  thrpt   20    3.432 ± 0.017  ops/s

Lies, Damn Lies, and Benchmarks

armanbilge · 2023-04-26T01:30:08Z

tests/shared/src/test/scala/cats/effect/std/MutexSpec.scala

+    "preserve waiters order (fifo) on a non-race cancellation" in ticked { implicit ticker =>
+      val numbers = List.range(1, 10)
+      val p = mutex.flatMap { m =>
+        IO.ref(List.empty[Int]).flatMap { ref =>
+          for {
+            f1 <- m.lock.allocated
+            (_, f1Release) = f1
+            f2 <- m.lock.use_.start
+            _ <- IO.sleep(1.millis)
+            t <- numbers.parTraverse_ { i =>
+              IO.sleep(i.millis) >>
+                m.lock.surround(ref.update(acc => i :: acc))
+            }.start
+            _ <- IO.sleep(100.millis)
+            _ <- f2.cancel
+            _ <- f1Release
+            _ <- t.join
+            r <- ref.get
+          } yield r.reverse
+        }
+      }
+
+      p must completeAs(numbers)
+    }


This test passes for this PR but fails with the current implementation on series/3.x because waiter #1 moves to the back of the line after #10. IMHO that is a pretty significant corruption of fairness esp. considering there are no race conditions at all here, this test is fully ticked.

durban · 2023-04-26T12:46:57Z

@BalmungSan I've set this as "draft" due to your "do not merge" comment. I've also left another comment.

std/shared/src/main/scala/cats/effect/std/Mutex.scala

BalmungSan · 2023-04-26T16:41:57Z

As expected, this PR version of Mutex fails the new test.
FWIW, the current version on series/3.x does pass the test! 🎉
cc @durban @armanbilge @djspiewak

BalmungSan · 2023-04-26T17:50:39Z

So, the "original" Deferred based lock-chain idea pass all the current tests.
I am pushing it as a milestone, I think I can optimize it a little bit but just in case.
If we decide to keep this version, since it only needs Concurrent we can just replace the Semaphore one with this.

BalmungSan · 2023-04-26T23:40:05Z

I just updated the PR description with the benchmark results for current, previous (deferred) & new. And it seems the performance of all versions is very similar.

Thus, I think the best is to just use the deferred one and remove the semaphore one.

cc: @durban @armanbilge @djspiewak

…nt cancels during release

djspiewak · 2023-04-27T01:19:47Z

Yeah this is super-duper close. It definitely feels like it's best to pick the simplest one with the most reasonable semantics.

BalmungSan · 2023-04-27T17:34:39Z

Okay, so I reverted and unified both the Async & Concurrent implementations to the original LockChain idea.
I fleshed out the naming a bit, added a couple of comments, and made some minor improvements like using eq, avoiding wrapper allocation, and the like. Plus, a minor optimization on release for when there is no one waiting for the Mutex.

Final version benchmark results

Benchmark	(fibers)	(iterations)	Mode	Cnt	Score	Error	Units
MutexBenchmark.happyPathConcurrent	10	1000	thrpt	20	213.946	± 1.116	ops/s
MutexBenchmark.happyPathConcurrent	50	1000	thrpt	20	44.099	± 0.356	ops/s
MutexBenchmark.happyPathConcurrent	100	1000	thrpt	20	21.624	± 0.089	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.highContentionConcurrent	10	1000	thrpt	20	33.514	± 1.624	ops/s
MutexBenchmark.highContentionConcurrent	50	1000	thrpt	20	7.655	± 0.363	ops/s
MutexBenchmark.highContentionConcurrent	100	1000	thrpt	20	4.203	± 0.024	ops/s
-	-	-	-	-	-	-	-
MutexBenchmark.cancellationConcurrent	10	1000	thrpt	20	22.853	± 1.031	ops/s
MutexBenchmark.cancellationConcurrent	50	1000	thrpt	20	7.895	± 0.176	ops/s
MutexBenchmark.cancellationConcurrent	100	1000	thrpt	20	4.997	± 0.080	ops/s

c.c. @djspiewak @armanbilge @durban

BalmungSan added 5 commits April 25, 2023 11:42

Revert Mutex implementation

cfd8057

Add Mutex test: handle multiple concurrent cancels during release

e6a1f6a

Fix Mutex.AsyncImpl for the new concurrent test

b74ba3e

Merge branch 'series/3.x' into revert-mutex-impl

2f78e30

Add Mutex test: preserve waiters order (fifo) on a non-race cancellation

2c355db

armanbilge reviewed Apr 26, 2023

View reviewed changes

durban marked this pull request as draft April 26, 2023 12:43

durban reviewed Apr 26, 2023

View reviewed changes

std/shared/src/main/scala/cats/effect/std/Mutex.scala Outdated Show resolved Hide resolved

durban mentioned this pull request Apr 26, 2023

Async mutex corrupts FIFO on happy(ish) path #3564

Closed

Add Mutex test: cancellation must not corrupt Mutex

1b0e287

Revert AsyncMutex implementation to the LockChain based one

27aa9c8

BalmungSan added 4 commits April 26, 2023 16:49

wip

6961694

Add new AsyncMutex implementation

1334970

Make AsyncMutex.LockQueueCell an AtomicReference itself

0770773

Fix 'discarded non-Unit value of type Boolean' warning

4211c53

BalmungSan marked this pull request as ready for review April 26, 2023 23:37

BalmungSan requested a review from armanbilge April 26, 2023 23:37

BalmungSan changed the title ~~Revert AsyncMutex implementation~~ New AsyncMutex implementation Apr 26, 2023

Reduce the number of iterations on MutexSpec.handle multiple concurre…

b7a2724

…nt cancels during release

Use deferred as the only implementation for Mutex

3cb4833

BalmungSan force-pushed the revert-mutex-impl branch from 2ce85c3 to 3cb4833 Compare April 27, 2023 17:30

Remove wrapper allocation on Mutex.ConcurrentImpl.LockQueue

64fc4e3

BalmungSan requested a review from durban April 27, 2023 17:43

djspiewak approved these changes Apr 28, 2023

View reviewed changes

durban approved these changes Apr 28, 2023

View reviewed changes

djspiewak merged commit 741f87a into typelevel:series/3.x Apr 28, 2023

armanbilge linked an issue Apr 28, 2023 that may be closed by this pull request

Async mutex corrupts FIFO on happy(ish) path #3564

Closed

BalmungSan deleted the revert-mutex-impl branch April 28, 2023 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `AsyncMutex` implementation #3562

New `AsyncMutex` implementation #3562

BalmungSan commented Apr 25, 2023 •

edited

Loading

armanbilge Apr 26, 2023

durban commented Apr 26, 2023

BalmungSan commented Apr 26, 2023

BalmungSan commented Apr 26, 2023

BalmungSan commented Apr 26, 2023 •

edited

Loading

djspiewak commented Apr 27, 2023

BalmungSan commented Apr 27, 2023 •

edited

Loading

New AsyncMutex implementation #3562

New AsyncMutex implementation #3562

Conversation

BalmungSan commented Apr 25, 2023 • edited Loading

Benchmark results

Current

Previous / Deferred

New

Current

Previous / Deferred

New

armanbilge Apr 26, 2023

Choose a reason for hiding this comment

durban commented Apr 26, 2023

BalmungSan commented Apr 26, 2023

BalmungSan commented Apr 26, 2023

BalmungSan commented Apr 26, 2023 • edited Loading

djspiewak commented Apr 27, 2023

BalmungSan commented Apr 27, 2023 • edited Loading

Final version benchmark results

New `AsyncMutex` implementation #3562

New `AsyncMutex` implementation #3562

BalmungSan commented Apr 25, 2023 •

edited

Loading

BalmungSan commented Apr 26, 2023 •

edited

Loading

BalmungSan commented Apr 27, 2023 •

edited

Loading