Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the 10x multi-threaded perf improvement hiding in distribution/scheduling. #9230

Open
benvanik opened this issue May 26, 2022 · 6 comments
Assignees
Labels
hal/cpu Runtime Host/CPU-based HAL backend performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@benvanik
Copy link
Collaborator

Found a great example of memory system effects between single- (read=1 core) and multi-threaded (yellow=32 cores) with cumulative time (ignore total times as they are two benchmark runs with different counts, mean/median is relevant):
image
(this is with allocator zero page optimizations disabled so that there are no during execution faults)

This scales with worker count; so here's red=1 core/yellow=8 cores:
image

Without tracy, 1 core / 8 cores / 32 cores:

--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
BM_learn/process_time/real_time      23854 ms        23849 ms            6   // 1
BM_learn/process_time/real_time       4776 ms        29411 ms           29   // 8
BM_learn/process_time/real_time       2684 ms        46996 ms           52   // 32

I believe this is either contention or cache locality issues but have not dug in to see on what (which loads/stores are bad). It could also be memory bandwidth saturation. For 12MB buffers it's definitely putting pressure on the cache hierarchy:

CPU Caches:
  L1 Data 32 KiB (x32)
  L1 Instruction 32 KiB (x32)
  L2 Unified 512 KiB (x32)
  L3 Unified 16384 KiB (x8)

What this hints at is that we run 10x slower/tile on 32x more threads, so we end up with ~3x lower latency overall for this particular dispatch but do so by being 10x more wasteful than we could be if we were perfectly scaling.

I picked this dispatch because it's a fairly mundane elementwise + transpose:

#map0 = affine_map<(d0, d1) -> (d0, d1)>
#map9 = affine_map<(d0, d1) -> (d1, d0)>
  stream.executable private @__inference_learn_231830_dispatch_2094 {
    stream.executable.export public @__inference_learn_231830_dispatch_2094
    builtin.module {
      func.func @__inference_learn_231830_dispatch_2094(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}, %arg3: i32, %arg4: i32, %arg5: i32 {stream.values = [36954688 : i32, 237015040 : i32, 327192576 : i32, 417370112 : i32, 507547648 : i32, 597725184 : i32, 687902720 : i32, 778080256 : i32, 868257792 : i32, 958435328 : i32, 1048612864 : i32, 1138790400 : i32, 1228967936 : i32, 1319145472 : i32, 1409323008 : i32, 1499500544 : i32, 1589678080 : i32, 1679855616 : i32, 1770033152 : i32, 1860210688 : i32, 2040565760 : i32, 2130743296 : i32, -2074046464 : i32, -2040492032 : i32]}, %arg6: i32 {stream.values = [212684864 : i32, 246255680 : i32, 279826496 : i32, 313397312 : i32, 346968128 : i32, 380538944 : i32, 414109760 : i32, 447680576 : i32, 481251392 : i32, 514822208 : i32, 548393024 : i32, 581963840 : i32, 615534656 : i32, 649105472 : i32, 682676288 : i32, 716247104 : i32, 749817920 : i32, 783388736 : i32, 816959552 : i32, 850530368 : i32, 884101184 : i32, 917672000 : i32, 951242816 : i32, 984813632 : i32]}) {
        %cst = arith.constant 0.00999999977 : f32
        %c32_i64 = arith.constant 32 : i64
        %0 = arith.extui %arg3 : i32 to i64
        %1 = arith.extui %arg4 : i32 to i64
        %2 = arith.shli %1, %c32_i64 : i64
        %3 = arith.ori %0, %2 : i64
        %4 = arith.index_cast %3 : i64 to index
        %5 = arith.extui %arg5 : i32 to i64
        %6 = arith.index_cast %5 : i64 to index
        %7 = arith.extui %arg6 : i32 to i64
        %8 = arith.index_cast %7 : i64 to index
        %9 = stream.binding.subspan %arg0[%4] : !stream.binding -> !flow.dispatch.tensor<readonly:1024x3072xf32>
        %10 = stream.binding.subspan %arg1[%6] : !stream.binding -> !flow.dispatch.tensor<readonly:3072x1024xf32>
        %11 = stream.binding.subspan %arg2[%8] : !stream.binding -> !flow.dispatch.tensor<writeonly:1024x3072xf32>
        %12 = flow.dispatch.tensor.load %9, offsets = [0, 0], sizes = [1024, 3072], strides = [1, 1] : !flow.dispatch.tensor<readonly:1024x3072xf32> -> tensor<1024x3072xf32>
        %13 = flow.dispatch.tensor.load %10, offsets = [0, 0], sizes = [3072, 1024], strides = [1, 1] : !flow.dispatch.tensor<readonly:3072x1024xf32> -> tensor<3072x1024xf32>
        %14 = linalg.init_tensor [1024, 3072] : tensor<1024x3072xf32>
        %15 = linalg.generic {indexing_maps = [#map0, #map9, #map0], iterator_types = ["parallel", "parallel"]} ins(%12, %13 : tensor<1024x3072xf32>, tensor<3072x1024xf32>) outs(%14 : tensor<1024x3072xf32>) {
        ^bb0(%arg7: f32, %arg8: f32, %arg9: f32):
          %16 = arith.mulf %arg8, %cst : f32
          %17 = arith.subf %arg7, %16 : f32
          linalg.yield %17 : f32
        } -> tensor<1024x3072xf32>
        flow.dispatch.tensor.store %15, %11, offsets = [0, 0], sizes = [1024, 3072], strides = [1, 1] : tensor<1024x3072xf32> -> !flow.dispatch.tensor<writeonly:1024x3072xf32>
        return
      }
    }
  }

Input MHLO model: https://storage.googleapis.com/iree-shared-files/nod-perf/bert_large_250M.mlir

Compilation flags:

--iree-stream-resource-index-bits=64
--iree-vm-target-index-bits=64
--iree-hal-target-backends=dylib-llvm-aot
--iree-llvm-target-triple=x86_64
--iree-llvm-target-cpu=host
--iree-llvm-target-cpu-features=host
--iree-codegen-llvm-number-of-threads=32

Run flags:

--task_topology_group_count=32
--function_input=1x512xi32=0
--function_input=1x512xi32=0
--function_input=1x512xi32=0
--function_input=1xi32=0

Added to buffer_heap.c iree_hal_heap_buffer_create:

    memset(buffer->data.data, 0xCD, buffer->data.data_length);

Next steps are to get perf counters to verify what's causing the 10x difference (cache misses, pending memory ops, something else spooky, etc), isolate this into a microbenchmark, and try it on some different systems.

@benvanik benvanik added performance ⚡ Performance/optimization related work across the compiler and runtime hal/cpu Runtime Host/CPU-based HAL backend labels May 26, 2022
@benvanik benvanik self-assigned this May 26, 2022
@powderluv
Copy link
Collaborator

are --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 default now ? Or would / should they become default ?

@benvanik
Copy link
Collaborator Author

benvanik commented Jun 1, 2022

(that's entirely unrelated to this issue?)

@powderluv
Copy link
Collaborator

comment withdrawn :D

@benvanik
Copy link
Collaborator Author

Some of this may be thread migration and lock contention (though definitely not all). On Windows I'm noticing that tracy with fiber tracing enabled causes an excessive number of context switches as the shared tracy lock is contended - it's bad enough that I don't think we can use tracy fibers by default right now if we also want any kind of performance investigation. Even with tracy off entirely, though, I still see a great deal more switches than I expected. Almost all are coming from locks and in particular the one in iree_task_queue_try_steal and the main executor lock in iree_task_executor_coordinate. Nearly every time work is stolen or the executor coordination is called on a worker we switch cores:
image

Our workers dancing around cores:
image

By just changing the coordination lock and try-steal source locks to try-locks quite a few drop out:
image

It should look like this, which is init in a model with fake weights that does not require any coordination as it runs:
image

Making the iree_task_queue_t wait-free would help a lot as would better executor coordination behavior. A quick peek at tracy's lock visualization shows the iree_atomic_slist_t (which isn't atomic yet 🤦) as being the most heavily contended lock, and iree_task_pool_release shows as being fairly contended as well (makes sense as shard tasks cycle through it from all threads).

@benvanik
Copy link
Collaborator Author

The tracy fibers issue is that the fast thread-local queues get disabled and there's one central queue that gets all events from all threads. wolfpd says that we could probably fix this by having fiber enter/leave reset the thread_local ProducerWrapper s_token such that the normal lock-free version could be used.

@allieculp
Copy link

@benvanik Still open? Still P1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/cpu Runtime Host/CPU-based HAL backend performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
No open projects
Status: No status
Development

No branches or pull requests

3 participants