experiment: spin locks #22

hugosenari · 2023-09-08T01:28:40Z

Disclaimer:

This is another experiment, I would not expect it to be approved

Motivation: Same as from #21

The initial idea was to use spin locks instead of locks, following this post recommendations, I'm using TTAS instead of TAS, I would like to use PAUSE, but is compiler/platform specific.

Unexpected Outcome:

It went pretty bad, worse than current implementation.
So I created another version using more spin locks (like I did for #21), one per thread, so they almost won't fight with each other. And that one went very well.

Results

We can't compare it with #21 result because I changed the test to set EPOCH after waitAll and before spawn, is more accurate to the test intent (check how much time we are spending between spawn)

The benchmark results,

Since only spin locks makes results worse, so I focus on second implementation spin_doctors, it speeds up the results between 3~5 times.

But still no perfect:

T1E0 - T0E1 increased because T0E1 decreased 75% percentile from ~19767ns, to ~2724ns
OP increased because spin locks uses 100% (of CPU)
- For this test 1 run with 3 spawns, means 5 others are just spinning.
- Maybe is why I get better results with more spawns

Suggestion:
Some sleep, no stealing.

I know that threads let us sleep using nanoseconds, we could sleep after N runs without any task, this could also help timeout.

There is no task to steal, because I'm only scheduling the same amount of threads as tasks

Araq · 2023-09-08T06:19:12Z

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

So I think the takeaway is that my stuff is very hard to improve upon? :P

Please note that it is important that eventually there is no busy-loop and the OS is allowed to pause threads as this can be important for energy consumption and Malebolgia is supposed to run well on embedded systems. However, a runtime flag like useBusyLoop: bool could be added to the master object.

hugosenari · 2023-09-08T17:25:04Z

src/experiment/malebolgia_spin_doctors.nim

+      ackchyually = i
+      break


Suggested change

ackchyually = i

break

ackchyually = i

signal(chan.alarms[i])

break

To wake up only when empty

hugosenari · 2023-09-08T17:34:41Z

src/experiment/malebolgia_spin_doctors.nim

+    wip  = tsWip
+  while not globalStopToken.load(moRelaxed):
+    # Test
+    if chan.board[i].load(moRelaxed) != todo:
+      cpuRelax()
+      continue


Suggested change

wip = tsWip

while not globalStopToken.load(moRelaxed):

# Test

if chan.board[i].load(moRelaxed) != todo:

cpuRelax()

continue

wip = tsWip

done = tsDone

idle = 0

while not globalStopToken.load(moRelaxed):

# Test

if chan.board[i].load(moRelaxed) != todo:

inc idle

if idle > 1000 and chan.board[i].compareExchange(done, empty, moRelaxed, moRelease):

idle = 0

wait(chan.alarms[i])

cpuRelax()

continue

wait after N idle runs

hugosenari · 2023-09-08T17:47:54Z

src/experiment/malebolgia_spin_doctors.nim

+    if busyThreads.load(moRelaxed) < ThreadPoolSize:
+      taskCreated master
+      send PoolTask(m: addr(master), t: toTask(fn), result: nil)
+    else:
+      fn


I know isn't so simple, but:

If mainThread is used to perform some task, who is there to say to other threads what to do.
We win one more thread and may lose - they can be idle - seven.

Suggested change

if busyThreads.load(moRelaxed) < ThreadPoolSize:

taskCreated master

send PoolTask(m: addr(master), t: toTask(fn), result: nil)

else:

fn

if isMainThread() or busyThreads.load(moRelaxed) < ThreadPoolSize:

taskCreated master

send PoolTask(m: addr(master), t: toTask(fn), result: nil)

else:

fn

That's a good idea.

hugosenari · 2023-09-08T18:30:10Z

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

Updated, thanks,

So I think the takeaway is that my stuff is very hard to improve upon? :P

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance) ;-)

However, a runtime flag like useBusyLoop: bool could be added to the master object.

I added suggestions for a wait/signal, but I forgot to review send iterator
It is currently round-robin, means no matter if thread01 is free I set task to next one, this would wake up all threads instead of use ones that are already up.

Round-robin:

more time with global atom currentThread (x1)
less likely to test current thread as busy (x8)

Start from 0 always

no time with global atom currentThread (x1)
more likely to test current thread as busy (x8)
but won't wake all threads

Araq · 2023-09-08T19:12:43Z

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance)

I don't really. This stuff is super hard so I tried to write it as simple as possible, though I'm not afraid of atomic instructions.

hugosenari · 2023-09-08T23:02:08Z

bench/parapply/tparapply_male.nim

+  echo [
+    $inNanoseconds(epoch  - bigbang),  # mesurement precision
+    $inNanoseconds(ops[0] - epoch),    # how fast we perform 1º task
+    $inNanoseconds(ops[1] - epoch),    # how fast we perform 2º task serial after 1º
+    $inNanoseconds(ops[2] - epoch),    # how fast we perform 3º task parallel
+    $inNanoseconds(ops[3] - epoch),    # how fast we perform 4º task serial after 3º
+    $inNanoseconds(ops[4] - epoch),    # how fast we perform 5º task parallel
+    $inNanoseconds(ops[1] - ops[0]),   # serial   latency
+    $inNanoseconds(ops[3] - ops[1]),   # parallel latency
+  ].join(sep)


This answers your question about benchmark results

echo [ "OP", "T0E0", "T0E1", "T1E0", "T1E1", "T2E0", "T0E1-T0E0", "T1E0-T0E1", ].join(sep)

Now I realized that OP is misaligned, it was delta of two consecutive getMonoTime() like T0E1-T0E0 (ops[1]-ops[0]) that has the same effect
Now it is time before parApply until first line of body in waitAll

hugosenari · 2023-09-12T03:03:46Z

Closing as experiment completed,

Next one, should be work stealing, but you already played with it, and mention as slower.
So I am not motivated.

I'm not afraid of atomic instructions.

I couldn't say the same, I'm still on page 183, and parallelism is only in page 253.

Araq · 2023-09-12T05:55:47Z

Next one, should be work stealing, but you already played with it, and mention as slower. So I am not motivated.

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

hugosenari · 2023-09-12T23:23:41Z

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

Malebolgia as MASTER branch or this PR in malebolgia_spin.nim

Malebolgia as this PR in malebolgia_spin_doctors.nim

My view of how Work Stealing looks like,

...
NOPE! 🤣

Ok, to be fair, this representation is worse than real the worst scenario at some point in time:
Because if a thread is ~~shifting~~ poping from other it isn't shifting from itself, or all other at the same time.
But for other graphs, I represented a superposition of all possible cases. 🤔

Notes:
Simplified to single producer/N consumers, in reality can have N producers/consumers
I know that .L isn't the queue, but is what we have to wait to operate the queue.

Araq · 2023-09-13T05:24:20Z

How did you make these graphs? They are awesome!

However, as I said, work stealing doesn't even have to use locks at all and can be done with a lockfree queue. And more importantly, all the highly concurrent/parallel runtimes ended up doing work stealing so it's reasonable to assume that it simply is a very good mechanism.

hugosenari · 2023-09-13T06:19:54Z

How did you make these graphs? They are awesome!

Coggle.it

https://coggle.it/diagram/ZQDcj9E4196r1jqQ/t/t0
https://coggle.it/diagram/ZQFMXn170vn6vXB5/t/t0-2023-9-12-22-39
https://coggle.it/diagram/ZPKLGeut86Ir78iB/t/t0

experiment: spin locks

f9be8b9

experiment: spin locks, relaxed

7c56b23

hugosenari commented Sep 8, 2023

View reviewed changes

hugosenari closed this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: spin locks #22

experiment: spin locks #22

hugosenari commented Sep 8, 2023 •

edited

Loading

Araq commented Sep 8, 2023

hugosenari Sep 8, 2023

hugosenari Sep 8, 2023

hugosenari Sep 8, 2023

Araq Sep 8, 2023

hugosenari commented Sep 8, 2023

Araq commented Sep 8, 2023

hugosenari Sep 8, 2023

hugosenari commented Sep 12, 2023

Araq commented Sep 12, 2023 •

edited

Loading

hugosenari commented Sep 12, 2023

Araq commented Sep 13, 2023

hugosenari commented Sep 13, 2023

experiment: spin locks #22

experiment: spin locks #22

Conversation

hugosenari commented Sep 8, 2023 • edited Loading

Araq commented Sep 8, 2023

hugosenari Sep 8, 2023

Choose a reason for hiding this comment

hugosenari Sep 8, 2023

Choose a reason for hiding this comment

hugosenari Sep 8, 2023

Choose a reason for hiding this comment

Araq Sep 8, 2023

Choose a reason for hiding this comment

hugosenari commented Sep 8, 2023

Araq commented Sep 8, 2023

hugosenari Sep 8, 2023

Choose a reason for hiding this comment

hugosenari commented Sep 12, 2023

Araq commented Sep 12, 2023 • edited Loading

hugosenari commented Sep 12, 2023

Araq commented Sep 13, 2023

hugosenari commented Sep 13, 2023

hugosenari commented Sep 8, 2023 •

edited

Loading

Araq commented Sep 12, 2023 •

edited

Loading