-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230
Comments
are |
(that's entirely unrelated to this issue?) |
comment withdrawn :D |
The tracy fibers issue is that the fast thread-local queues get disabled and there's one central queue that gets all events from all threads. wolfpd says that we could probably fix this by having fiber enter/leave reset the |
@benvanik Still open? Still P1? |
Found a great example of memory system effects between single- (read=1 core) and multi-threaded (yellow=32 cores) with cumulative time (ignore total times as they are two benchmark runs with different counts, mean/median is relevant):
(this is with allocator zero page optimizations disabled so that there are no during execution faults)
This scales with worker count; so here's red=1 core/yellow=8 cores:
Without tracy, 1 core / 8 cores / 32 cores:
I believe this is either contention or cache locality issues but have not dug in to see on what (which loads/stores are bad). It could also be memory bandwidth saturation. For 12MB buffers it's definitely putting pressure on the cache hierarchy:
What this hints at is that we run 10x slower/tile on 32x more threads, so we end up with ~3x lower latency overall for this particular dispatch but do so by being 10x more wasteful than we could be if we were perfectly scaling.
I picked this dispatch because it's a fairly mundane elementwise + transpose:
Input MHLO model: https://storage.googleapis.com/iree-shared-files/nod-perf/bert_large_250M.mlir
Compilation flags:
Run flags:
Added to buffer_heap.c iree_hal_heap_buffer_create:
Next steps are to get perf counters to verify what's causing the 10x difference (cache misses, pending memory ops, something else spooky, etc), isolate this into a microbenchmark, and try it on some different systems.
The text was updated successfully, but these errors were encountered: