-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
facilitate vectorization for generic build #4223
Conversation
Unroll loop into chunks of four operations to facilitate compilers recognizing vectorization opportunity.
That looks interesting. Did you already run benchmarks which compare the old and the new code? |
There was a couple percentage points improvement on my Ampere Altra with gcc-13 -O3, however your mileage will vary. |
Co-authored-by: Stefan Weil <sw@weilnetz.de>
Looks like the macos-13 runner failed due to timeout. Should we kick it and try again? |
No need. It happens. |
Thanks! |
Just to give some hint about this patch: Tesseract was build with default configuration options. CPU info: Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz
CPU family: 6
Model: 15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Stepping: 2
CPU max MHz: 1800.0000
CPU min MHz: 1200.0000
BogoMIPS: 3591.24
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse ss
e2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64
monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm pti dtherm
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 2 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected |
Could you please try |
Do you mean this? --- a/src/arch/intsimdmatrix.cpp
+++ b/src/arch/intsimdmatrix.cpp
@@ -91,6 +91,7 @@ void IntSimdMatrix::MatrixDotVector(const GENERIC_2D_ARRAY<int8_t> &w,
int total1 = 0;
int total2 = 0;
int total3 = 0;
+#pragma omp simd reduction(+ : total0, total1, total2, total3)
for (int j = 0; j < num_in; ++j) {
total0 += wi0[j] * u[j];
total1 += wi1[j] * u[j];
diff --git a/test b/test
index 27618999..3ea10996 160000 Is it slightly slower by 0.2 s. |
In the process of developing tesseract for SPEC CPUv8 with @zdenop 's help, we used the generic code path to future-proof the benchmark for architectures that do not yet exist. Hence we disable the intrinsics, and rather rely on the C++ code directly. Here, we found an opportunity to assist the compiler in the
IntSimdMatrix::MatrixDotVector
function. We split the loop so it first does all the work in chunks of four, and then clean up with whatever remainder is left over. If there is a guarantee thatw.dim1()
is always a multiple of four, then the remainder code is not required. So, this offers some perf improvement on the generic code path, depending on your compiler and system.