Maximize ALU utilization by avoiding pipeline bubbles #72

PENGUINLIONG · 2020-10-30T04:40:45Z

Profiling data shows that the current implementation only reaches 70% of the full ALU capacity (on Adreno 640), limited by the data dependency between instructions. The provided implementation can reach 100% ALU utility and reflect the actual maximal performance of an OpenCL platform.

krrishnarraj · 2020-11-08T11:22:16Z

I tried this patch on my local pc. This is the output on pocl platform:

Platform: Portable Computing Language
Device: pthread-Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Driver version : 1.5 (Linux x64)
Compute units : 12
Clock frequency : 4100 MHz

Global memory bandwidth (GBPS)
  float   : 27.44
  float2  : 29.97
  float4  : 31.91
  float8  : 31.71
  float16 : 27.86

Single-precision compute (GFLOPS)
  float   : 23.49
  float2  : 47.32
  float4  : 95.59
  float8  : 190.82
  float16 : 343.63

No half precision support! Skipped

Double-precision compute (GFLOPS)
  double   : 23.37
  double2  : 46.87
  double4  : 94.91
  double8  : 175.67
  double16 : 280.08

Integer compute (GIOPS)
  int   : 27183.34
  int2  : 8061.12
  int4  : 5482.94
  int8  : 2874.55
  int16 : 2959.33

Integer compute Fast 24bit (GIOPS)
  int   : 27745.27
  int2  : 8186.09
  int4  : 5211.50
  int8  : 2745.67
  int16 : 3050.98

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 15.34
  enqueueReadBuffer               : 15.32
  enqueueWriteBuffer non-blocking : 15.39
  enqueueReadBuffer non-blocking  : 15.37
  enqueueMapBuffer(for read)      : 11645.79
    memcpy from mapped ptr        : 15.22
  enqueueUnmap(after write)       : 16989.59
    memcpy to mapped ptr          : 15.26

Kernel launch latency : 17.60 us

The integer compute number looks abnormally high. My guess is that the compiler is optimising redundant calculations on the rhs in
#define MAD_4(x, y, z) z += (y*x) + y; z += (x*y) + x; z += (y*x) + y; z += (x*y) + x;

doe300 · 2020-11-08T12:09:34Z

Wouldn't this also invalidate all previous benchmark results?

krrishnarraj · 2020-11-08T14:57:33Z

No. These results are on this patchset. Introduction of z variable removes the dependency between statements and allows compiler to optimise out repeated (y*x) + y calculation

krrishnarraj · 2020-11-08T15:02:03Z

I understand the intention here. We need a better way to optimise this

Clean up

ee4aa77

PENGUINLIONG force-pushed the master branch from 07ff520 to 97a1b41 Compare April 4, 2021 13:17

Make it easier to compile

f5963b5

PENGUINLIONG force-pushed the master branch from 97a1b41 to f5963b5 Compare April 4, 2021 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximize ALU utilization by avoiding pipeline bubbles #72

Maximize ALU utilization by avoiding pipeline bubbles #72

PENGUINLIONG commented Oct 30, 2020

krrishnarraj commented Nov 8, 2020 •

edited

Loading

doe300 commented Nov 8, 2020

krrishnarraj commented Nov 8, 2020

krrishnarraj commented Nov 8, 2020

Maximize ALU utilization by avoiding pipeline bubbles #72

Are you sure you want to change the base?

Maximize ALU utilization by avoiding pipeline bubbles #72

Conversation

PENGUINLIONG commented Oct 30, 2020

krrishnarraj commented Nov 8, 2020 • edited Loading

doe300 commented Nov 8, 2020

krrishnarraj commented Nov 8, 2020

krrishnarraj commented Nov 8, 2020

krrishnarraj commented Nov 8, 2020 •

edited

Loading