-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experiment: spin locks #22
Conversation
There is So I think the takeaway is that my stuff is very hard to improve upon? :P Please note that it is important that eventually there is no busy-loop and the OS is allowed to pause threads as this can be important for energy consumption and Malebolgia is supposed to run well on embedded systems. However, a runtime flag like |
ackchyually = i | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ackchyually = i | |
break | |
ackchyually = i | |
signal(chan.alarms[i]) | |
break |
To wake up only when empty
wip = tsWip | ||
while not globalStopToken.load(moRelaxed): | ||
# Test | ||
if chan.board[i].load(moRelaxed) != todo: | ||
cpuRelax() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wip = tsWip | |
while not globalStopToken.load(moRelaxed): | |
# Test | |
if chan.board[i].load(moRelaxed) != todo: | |
cpuRelax() | |
continue | |
wip = tsWip | |
done = tsDone | |
idle = 0 | |
while not globalStopToken.load(moRelaxed): | |
# Test | |
if chan.board[i].load(moRelaxed) != todo: | |
inc idle | |
if idle > 1000 and chan.board[i].compareExchange(done, empty, moRelaxed, moRelease): | |
idle = 0 | |
wait(chan.alarms[i]) | |
cpuRelax() | |
continue |
wait after N idle runs
if busyThreads.load(moRelaxed) < ThreadPoolSize: | ||
taskCreated master | ||
send PoolTask(m: addr(master), t: toTask(fn), result: nil) | ||
else: | ||
fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know isn't so simple, but:
If mainThread is used to perform some task, who is there to say to other threads what to do.
We win one more thread and may lose - they can be idle - seven.
if busyThreads.load(moRelaxed) < ThreadPoolSize: | |
taskCreated master | |
send PoolTask(m: addr(master), t: toTask(fn), result: nil) | |
else: | |
fn | |
if isMainThread() or busyThreads.load(moRelaxed) < ThreadPoolSize: | |
taskCreated master | |
send PoolTask(m: addr(master), t: toTask(fn), result: nil) | |
else: | |
fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea.
Updated, thanks,
You are lazy naming variables, other that, you try to do your best, so there is less room for improvement (in performance) ;-)
I added suggestions for a wait/signal, but I forgot to review send iterator Round-robin:
Start from 0 always
|
I don't really. This stuff is super hard so I tried to write it as simple as possible, though I'm not afraid of atomic instructions. |
echo [ | ||
$inNanoseconds(epoch - bigbang), # mesurement precision | ||
$inNanoseconds(ops[0] - epoch), # how fast we perform 1º task | ||
$inNanoseconds(ops[1] - epoch), # how fast we perform 2º task serial after 1º | ||
$inNanoseconds(ops[2] - epoch), # how fast we perform 3º task parallel | ||
$inNanoseconds(ops[3] - epoch), # how fast we perform 4º task serial after 3º | ||
$inNanoseconds(ops[4] - epoch), # how fast we perform 5º task parallel | ||
$inNanoseconds(ops[1] - ops[0]), # serial latency | ||
$inNanoseconds(ops[3] - ops[1]), # parallel latency | ||
].join(sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This answers your question about benchmark results
echo [
"OP",
"T0E0",
"T0E1",
"T1E0",
"T1E1",
"T2E0",
"T0E1-T0E0",
"T1E0-T0E1",
].join(sep)
Now I realized that OP is misaligned, it was delta of two consecutive getMonoTime() like T0E1-T0E0 (ops[1]-ops[0]) that has the same effect
Now it is time before parApply until first line of body in waitAll
Closing as experiment completed, Next one, should be work stealing, but you already played with it, and mention as slower.
I couldn't say the same, I'm still on page 183, and parallelism is only in page 253. |
But I didn't use a lockfree queue for it and only tested it on a 8 core M1. Work stealing scales much better for bigger hardware. It shouldn't be hard to beat the existing implementation. |
Malebolgia as MASTER branch or this PR in malebolgia_spin.nim Malebolgia as this PR in malebolgia_spin_doctors.nim My view of how Work Stealing looks like, ... Ok, to be fair, this representation is worse than real the worst scenario at some point in time: Notes: |
How did you make these graphs? They are awesome! However, as I said, work stealing doesn't even have to use locks at all and can be done with a lockfree queue. And more importantly, all the highly concurrent/parallel runtimes ended up doing work stealing so it's reasonable to assume that it simply is a very good mechanism. |
Coggle.it https://coggle.it/diagram/ZQDcj9E4196r1jqQ/t/t0 |
Disclaimer:
This is another experiment, I would not expect it to be approved
Motivation: Same as from #21
The initial idea was to use spin locks instead of locks, following this post recommendations, I'm using TTAS instead of TAS, I would like to use PAUSE, but is compiler/platform specific.
Unexpected Outcome:
It went pretty bad, worse than current implementation.
So I created another version using more spin locks (like I did for #21), one per thread, so they almost won't fight with each other. And that one went very well.
Results
We can't compare it with #21 result because I changed the test to set EPOCH after waitAll and before spawn, is more accurate to the test intent (check how much time we are spending between spawn)
The benchmark results,
Since only spin locks makes results worse, so I focus on second implementation spin_doctors, it speeds up the results between 3~5 times.
But still no perfect:
Suggestion:
Some sleep, no stealing.
I know that threads let us sleep using nanoseconds, we could sleep after N runs without any task, this could also help timeout.
There is no task to steal, because I'm only scheduling the same amount of threads as tasks