Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

benvanik · 2022-05-26T20:57:37Z

Found a great example of memory system effects between single- (read=1 core) and multi-threaded (yellow=32 cores) with cumulative time (ignore total times as they are two benchmark runs with different counts, mean/median is relevant):

(this is with allocator zero page optimizations disabled so that there are no during execution faults)

This scales with worker count; so here's red=1 core/yellow=8 cores:

Without tracy, 1 core / 8 cores / 32 cores:

--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
BM_learn/process_time/real_time      23854 ms        23849 ms            6   // 1
BM_learn/process_time/real_time       4776 ms        29411 ms           29   // 8
BM_learn/process_time/real_time       2684 ms        46996 ms           52   // 32

I believe this is either contention or cache locality issues but have not dug in to see on what (which loads/stores are bad). It could also be memory bandwidth saturation. For 12MB buffers it's definitely putting pressure on the cache hierarchy:

CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 512 KiB (x32)
  L3 Unified 16384 KiB (x8)

What this hints at is that we run 10x slower/tile on 32x more threads, so we end up with ~3x lower latency overall for this particular dispatch but do so by being 10x more wasteful than we could be if we were perfectly scaling.

I picked this dispatch because it's a fairly mundane elementwise + transpose:

#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map9 = affine_map<(d0, d1) -> (d1, d0)>
  stream.executable private @__inference_learn_231830_dispatch_2094 {
    stream.executable.export public @__inference_learn_231830_dispatch_2094
    builtin.module {
      func.func @__inference_learn_231830_dispatch_2094(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}, %arg3: i32, %arg4: i32, %arg5: i32 {stream.values = [36954688 : i32, 237015040 : i32, 327192576 : i32, 417370112 : i32, 507547648 : i32, 597725184 : i32, 687902720 : i32, 778080256 : i32, 868257792 : i32, 958435328 : i32, 1048612864 : i32, 1138790400 : i32, 1228967936 : i32, 1319145472 : i32, 1409323008 : i32, 1499500544 : i32, 1589678080 : i32, 1679855616 : i32, 1770033152 : i32, 1860210688 : i32, 2040565760 : i32, 2130743296 : i32, -2074046464 : i32, -2040492032 : i32]}, %arg6: i32 {stream.values = [212684864 : i32, 246255680 : i32, 279826496 : i32, 313397312 : i32, 346968128 : i32, 380538944 : i32, 414109760 : i32, 447680576 : i32, 481251392 : i32, 514822208 : i32, 548393024 : i32, 581963840 : i32, 615534656 : i32, 649105472 : i32, 682676288 : i32, 716247104 : i32, 749817920 : i32, 783388736 : i32, 816959552 : i32, 850530368 : i32, 884101184 : i32, 917672000 : i32, 951242816 : i32, 984813632 : i32]}) {
        %cst = arith.constant 0.00999999977 : f32
        %c32_i64 = arith.constant 32 : i64
        %0 = arith.extui %arg3 : i32 to i64
        %1 = arith.extui %arg4 : i32 to i64
        %2 = arith.shli %1, %c32_i64 : i64
        %3 = arith.ori %0, %2 : i64
        %4 = arith.index_cast %3 : i64 to index
        %5 = arith.extui %arg5 : i32 to i64
        %6 = arith.index_cast %5 : i64 to index
        %7 = arith.extui %arg6 : i32 to i64
        %8 = arith.index_cast %7 : i64 to index
        %9 = stream.binding.subspan %arg0[%4] : !stream.binding -> !flow.dispatch.tensor<readonly:1024x3072xf32>
        %10 = stream.binding.subspan %arg1[%6] : !stream.binding -> !flow.dispatch.tensor<readonly:3072x1024xf32>
        %11 = stream.binding.subspan %arg2[%8] : !stream.binding -> !flow.dispatch.tensor<writeonly:1024x3072xf32>
        %12 = flow.dispatch.tensor.load %9, offsets = [0, 0], sizes = [1024, 3072], strides = [1, 1] : !flow.dispatch.tensor<readonly:1024x3072xf32> -> tensor<1024x3072xf32>
        %13 = flow.dispatch.tensor.load %10, offsets = [0, 0], sizes = [3072, 1024], strides = [1, 1] : !flow.dispatch.tensor<readonly:3072x1024xf32> -> tensor<3072x1024xf32>
        %14 = linalg.init_tensor [1024, 3072] : tensor<1024x3072xf32>
        %15 = linalg.generic {indexing_maps = [#map0, #map9, #map0], iterator_types = ["parallel", "parallel"]} ins(%12, %13 : tensor<1024x3072xf32>, tensor<3072x1024xf32>) outs(%14 : tensor<1024x3072xf32>) {
        ^bb0(%arg7: f32, %arg8: f32, %arg9: f32):
          %16 = arith.mulf %arg8, %cst : f32
          %17 = arith.subf %arg7, %16 : f32
          linalg.yield %17 : f32
        } -> tensor<1024x3072xf32>
        flow.dispatch.tensor.store %15, %11, offsets = [0, 0], sizes = [1024, 3072], strides = [1, 1] : tensor<1024x3072xf32> -> !flow.dispatch.tensor<writeonly:1024x3072xf32>
        return
      }
    }
  }

Input MHLO model: https://storage.googleapis.com/iree-shared-files/nod-perf/bert_large_250M.mlir

Compilation flags:

--iree-stream-resource-index-bits=64
--iree-vm-target-index-bits=64
--iree-hal-target-backends=dylib-llvm-aot
--iree-llvm-target-triple=x86_64
--iree-llvm-target-cpu=host
--iree-llvm-target-cpu-features=host
--iree-codegen-llvm-number-of-threads=32

Run flags:

--task_topology_group_count=32
--function_input=1x512xi32=0
--function_input=1x512xi32=0
--function_input=1x512xi32=0
--function_input=1xi32=0

Added to buffer_heap.c iree_hal_heap_buffer_create:

    memset(buffer->data.data, 0xCD, buffer->data.data_length);

Next steps are to get perf counters to verify what's causing the 10x difference (cache misses, pending memory ops, something else spooky, etc), isolate this into a microbenchmark, and try it on some different systems.

The text was updated successfully, but these errors were encountered:

powderluv · 2022-06-01T23:33:46Z

are --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 default now ? Or would / should they become default ?

benvanik · 2022-06-01T23:42:52Z

(that's entirely unrelated to this issue?)

powderluv · 2022-06-01T23:47:57Z

comment withdrawn :D

benvanik · 2022-06-21T22:52:45Z

Some of this may be thread migration and lock contention (though definitely not all). On Windows I'm noticing that tracy with fiber tracing enabled causes an excessive number of context switches as the shared tracy lock is contended - it's bad enough that I don't think we can use tracy fibers by default right now if we also want any kind of performance investigation. Even with tracy off entirely, though, I still see a great deal more switches than I expected. Almost all are coming from locks and in particular the one in iree_task_queue_try_steal and the main executor lock in iree_task_executor_coordinate. Nearly every time work is stolen or the executor coordination is called on a worker we switch cores:

Our workers dancing around cores:

By just changing the coordination lock and try-steal source locks to try-locks quite a few drop out:

It should look like this, which is init in a model with fake weights that does not require any coordination as it runs:

Making the iree_task_queue_t wait-free would help a lot as would better executor coordination behavior. A quick peek at tracy's lock visualization shows the iree_atomic_slist_t (which isn't atomic yet 🤦) as being the most heavily contended lock, and iree_task_pool_release shows as being fairly contended as well (makes sense as shard tasks cycle through it from all threads).

benvanik · 2022-06-21T22:55:22Z

The tracy fibers issue is that the fast thread-local queues get disabled and there's one central queue that gets all events from all threads. wolfpd says that we could probably fix this by having fiber enter/leave reset the thread_local ProducerWrapper s_token such that the normal lock-free version could be used.

allieculp · 2023-04-05T22:37:19Z

@benvanik Still open? Still P1?

benvanik added performance ⚡ Performance/optimization related work across the compiler and runtime hal/cpu Runtime Host/CPU-based HAL backend labels May 26, 2022

benvanik self-assigned this May 26, 2022

benvanik mentioned this issue Jun 28, 2022

Pathologically bad workgroup distribution for elementwise dispatches on CPU. #9660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

benvanik commented May 26, 2022

powderluv commented Jun 1, 2022

benvanik commented Jun 1, 2022

powderluv commented Jun 1, 2022

benvanik commented Jun 21, 2022

benvanik commented Jun 21, 2022

allieculp commented Apr 5, 2023

Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

Comments

benvanik commented May 26, 2022

powderluv commented Jun 1, 2022

benvanik commented Jun 1, 2022

powderluv commented Jun 1, 2022

benvanik commented Jun 21, 2022

benvanik commented Jun 21, 2022

allieculp commented Apr 5, 2023