facilitate vectorization for generic build #4223

heshpdx · 2024-04-10T22:08:07Z

In the process of developing tesseract for SPEC CPUv8 with @zdenop 's help, we used the generic code path to future-proof the benchmark for architectures that do not yet exist. Hence we disable the intrinsics, and rather rely on the C++ code directly. Here, we found an opportunity to assist the compiler in the IntSimdMatrix::MatrixDotVector function. We split the loop so it first does all the work in chunks of four, and then clean up with whatever remainder is left over. If there is a guarantee that w.dim1() is always a multiple of four, then the remainder code is not required. So, this offers some perf improvement on the generic code path, depending on your compiler and system.

Unroll loop into chunks of four operations to facilitate compilers recognizing vectorization opportunity.

src/arch/intsimdmatrix.cpp

stweil · 2024-04-11T05:15:01Z

That looks interesting. Did you already run benchmarks which compare the old and the new code?

heshpdx · 2024-04-11T05:21:40Z

That looks interesting. Did you already run benchmarks which compare the old and the new code?

There was a couple percentage points improvement on my Ampere Altra with gcc-13 -O3, however your mileage will vary.

Co-authored-by: Stefan Weil <sw@weilnetz.de>

src/arch/intsimdmatrix.cpp

heshpdx · 2024-04-16T16:13:29Z

Looks like the macos-13 runner failed due to timeout. Should we kick it and try again?

egorpugin · 2024-04-16T16:23:56Z

No need. It happens.

stweil · 2024-04-16T17:29:07Z

Thanks!

zdenop · 2024-04-18T09:39:40Z

Just to give some hint about this patch:
I made the speed test on old linux computer with following results:
Average speed before applying this code: 61 s.
Average speed after applying this code: 37 s.

Tesseract was build with default configuration options.
Interestingly - when I tested applying #pragma omp parallel for for this patch, the speed drop down to 88 s. So using OpemMP should be carefully tested on old systems.

CPU info:

Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        36 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 1
  On-line CPU(s) list:  0
Vendor ID:              GenuineIntel
  Model name:           Intel(R) Core(TM)2 CPU          4300  @ 1.80GHz
    CPU family:         6
    Model:              15
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s):          1
    Stepping:           2
    CPU max MHz:        1800.0000
    CPU min MHz:        1200.0000
    BogoMIPS:           3591.24
    Flags:              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse ss
                        e2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64
                        monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm pti dtherm
Caches (sum of all):
  L1d:                  32 KiB (1 instance)
  L1i:                  32 KiB (1 instance)
  L2:                   2 MiB (1 instance)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        KVM: Mitigation: VMX unsupported
  L1tf:                 Mitigation; PTE Inversion
  Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
  Meltdown:             Mitigation; PTI
  Mmio stale data:      Unknown: No mitigations
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Vulnerable
  Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

stweil · 2024-04-18T09:53:16Z

Could you please try #pragma omp simd reduction and #pragma omp simd reduction(+:total)?

zdenop · 2024-04-18T11:02:20Z

Do you mean this?

--- a/src/arch/intsimdmatrix.cpp
+++ b/src/arch/intsimdmatrix.cpp
@@ -91,6 +91,7 @@ void IntSimdMatrix::MatrixDotVector(const GENERIC_2D_ARRAY<int8_t> &w,
     int total1 = 0;
     int total2 = 0;
     int total3 = 0;
+#pragma omp simd reduction(+ : total0, total1, total2, total3)
     for (int j = 0; j < num_in; ++j) {
       total0 += wi0[j] * u[j];
       total1 += wi1[j] * u[j];
diff --git a/test b/test
index 27618999..3ea10996 160000

Is it slightly slower by 0.2 s.

Update intsimdmatrix.cpp

ebd70c2

Unroll loop into chunks of four operations to facilitate compilers recognizing vectorization opportunity.

stweil reviewed Apr 11, 2024

View reviewed changes

src/arch/intsimdmatrix.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

fc7c521

Co-authored-by: Stefan Weil <sw@weilnetz.de>

heshpdx commented Apr 11, 2024

View reviewed changes

src/arch/intsimdmatrix.cpp Show resolved Hide resolved

zdenop approved these changes Apr 12, 2024

View reviewed changes

stweil approved these changes Apr 16, 2024

View reviewed changes

stweil merged commit bae520e into tesseract-ocr:main Apr 16, 2024
6 of 7 checks passed

BrewTestBot mentioned this pull request Jun 6, 2024

tesseract 5.4.0 Homebrew/homebrew-core#173888

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

facilitate vectorization for generic build #4223

facilitate vectorization for generic build #4223

heshpdx commented Apr 10, 2024

stweil commented Apr 11, 2024

heshpdx commented Apr 11, 2024

heshpdx commented Apr 16, 2024

egorpugin commented Apr 16, 2024

stweil commented Apr 16, 2024

zdenop commented Apr 18, 2024

stweil commented Apr 18, 2024

zdenop commented Apr 18, 2024

facilitate vectorization for generic build #4223

facilitate vectorization for generic build #4223

Conversation

heshpdx commented Apr 10, 2024

stweil commented Apr 11, 2024

heshpdx commented Apr 11, 2024

heshpdx commented Apr 16, 2024

egorpugin commented Apr 16, 2024

stweil commented Apr 16, 2024

zdenop commented Apr 18, 2024

stweil commented Apr 18, 2024

zdenop commented Apr 18, 2024