Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: spin locks #22

Closed
wants to merge 2 commits into from
Closed

experiment: spin locks #22

wants to merge 2 commits into from

Conversation

hugosenari
Copy link

@hugosenari hugosenari commented Sep 8, 2023

Disclaimer:

This is another experiment, I would not expect it to be approved

Motivation: Same as from #21

The initial idea was to use spin locks instead of locks, following this post recommendations, I'm using TTAS instead of TAS, I would like to use PAUSE, but is compiler/platform specific.

Unexpected Outcome:

It went pretty bad, worse than current implementation.
So I created another version using more spin locks (like I did for #21), one per thread, so they almost won't fight with each other. And that one went very well.

Results

We can't compare it with #21 result because I changed the test to set EPOCH after waitAll and before spawn, is more accurate to the test intent (check how much time we are spending between spawn)

The benchmark results,

Since only spin locks makes results worse, so I focus on second implementation spin_doctors, it speeds up the results between 3~5 times.

But still no perfect:

  1. T1E0 - T0E1 increased because T0E1 decreased 75% percentile from ~19767ns, to ~2724ns
  2. OP increased because spin locks uses 100% (of CPU)
    • For this test 1 run with 3 spawns, means 5 others are just spinning.
    • Maybe is why I get better results with more spawns

Suggestion:
Some sleep, no stealing.

I know that threads let us sleep using nanoseconds, we could sleep after N runs without any task, this could also help timeout.

There is no task to steal, because I'm only scheduling the same amount of threads as tasks

@Araq
Copy link
Owner

Araq commented Sep 8, 2023

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

So I think the takeaway is that my stuff is very hard to improve upon? :P

Please note that it is important that eventually there is no busy-loop and the OS is allowed to pause threads as this can be important for energy consumption and Malebolgia is supposed to run well on embedded systems. However, a runtime flag like useBusyLoop: bool could be added to the master object.

Comment on lines +168 to +169
ackchyually = i
break
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ackchyually = i
break
ackchyually = i
signal(chan.alarms[i])
break

To wake up only when empty

Comment on lines +179 to +184
wip = tsWip
while not globalStopToken.load(moRelaxed):
# Test
if chan.board[i].load(moRelaxed) != todo:
cpuRelax()
continue
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wip = tsWip
while not globalStopToken.load(moRelaxed):
# Test
if chan.board[i].load(moRelaxed) != todo:
cpuRelax()
continue
wip = tsWip
done = tsDone
idle = 0
while not globalStopToken.load(moRelaxed):
# Test
if chan.board[i].load(moRelaxed) != todo:
inc idle
if idle > 1000 and chan.board[i].compareExchange(done, empty, moRelaxed, moRelease):
idle = 0
wait(chan.alarms[i])
cpuRelax()
continue

wait after N idle runs

Comment on lines +233 to +237
if busyThreads.load(moRelaxed) < ThreadPoolSize:
taskCreated master
send PoolTask(m: addr(master), t: toTask(fn), result: nil)
else:
fn
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know isn't so simple, but:

If mainThread is used to perform some task, who is there to say to other threads what to do.
We win one more thread and may lose - they can be idle - seven.

Suggested change
if busyThreads.load(moRelaxed) < ThreadPoolSize:
taskCreated master
send PoolTask(m: addr(master), t: toTask(fn), result: nil)
else:
fn
if isMainThread() or busyThreads.load(moRelaxed) < ThreadPoolSize:
taskCreated master
send PoolTask(m: addr(master), t: toTask(fn), result: nil)
else:
fn

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea.

@hugosenari
Copy link
Author

There is cpuRelax() and it's mapped to PAUSE or similar, enjoy.

Updated, thanks,

So I think the takeaway is that my stuff is very hard to improve upon? :P

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance) ;-)

However, a runtime flag like useBusyLoop: bool could be added to the master object.

I added suggestions for a wait/signal, but I forgot to review send iterator
It is currently round-robin, means no matter if thread01 is free I set task to next one, this would wake up all threads instead of use ones that are already up.

Round-robin:

  • more time with global atom currentThread (x1)
  • less likely to test current thread as busy (x8)

Start from 0 always

  • no time with global atom currentThread (x1)
  • more likely to test current thread as busy (x8)
  • but won't wake all threads

@Araq
Copy link
Owner

Araq commented Sep 8, 2023

You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance)

I don't really. This stuff is super hard so I tried to write it as simple as possible, though I'm not afraid of atomic instructions.

Comment on lines +104 to +113
echo [
$inNanoseconds(epoch - bigbang), # mesurement precision
$inNanoseconds(ops[0] - epoch), # how fast we perform 1º task
$inNanoseconds(ops[1] - epoch), # how fast we perform 2º task serial after 1º
$inNanoseconds(ops[2] - epoch), # how fast we perform 3º task parallel
$inNanoseconds(ops[3] - epoch), # how fast we perform 4º task serial after 3º
$inNanoseconds(ops[4] - epoch), # how fast we perform 5º task parallel
$inNanoseconds(ops[1] - ops[0]), # serial latency
$inNanoseconds(ops[3] - ops[1]), # parallel latency
].join(sep)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This answers your question about benchmark results

echo [
  "OP",
  "T0E0",
  "T0E1",
  "T1E0",
  "T1E1",
  "T2E0",
  "T0E1-T0E0",
  "T1E0-T0E1",
].join(sep)

Now I realized that OP is misaligned, it was delta of two consecutive getMonoTime() like T0E1-T0E0 (ops[1]-ops[0]) that has the same effect
Now it is time before parApply until first line of body in waitAll

@hugosenari
Copy link
Author

Closing as experiment completed,

Next one, should be work stealing, but you already played with it, and mention as slower.
So I am not motivated.

I'm not afraid of atomic instructions.

I couldn't say the same, I'm still on page 183, and parallelism is only in page 253.

@hugosenari hugosenari closed this Sep 12, 2023
@Araq
Copy link
Owner

Araq commented Sep 12, 2023

Next one, should be work stealing, but you already played with it, and mention as slower. So I am not motivated.

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

@hugosenari
Copy link
Author

But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation.

Malebolgia as MASTER branch or this PR in malebolgia_spin.nim

image

Malebolgia as this PR in malebolgia_spin_doctors.nim

image

My view of how Work Stealing looks like,

image

...
NOPE! 🤣

Ok, to be fair, this representation is worse than real the worst scenario at some point in time:
Because if a thread is shifting poping from other it isn't shifting from itself, or all other at the same time.
But for other graphs, I represented a superposition of all possible cases. 🤔

Notes:
Simplified to single producer/N consumers, in reality can have N producers/consumers
I know that .L isn't the queue, but is what we have to wait to operate the queue.

@Araq
Copy link
Owner

Araq commented Sep 13, 2023

How did you make these graphs? They are awesome!

However, as I said, work stealing doesn't even have to use locks at all and can be done with a lockfree queue. And more importantly, all the highly concurrent/parallel runtimes ended up doing work stealing so it's reasonable to assume that it simply is a very good mechanism.

@hugosenari
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants