Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximize ALU utilization by avoiding pipeline bubbles #72

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

PENGUINLIONG
Copy link

Profiling data shows that the current implementation only reaches 70% of the full ALU capacity (on Adreno 640), limited by the data dependency between instructions. The provided implementation can reach 100% ALU utility and reflect the actual maximal performance of an OpenCL platform.

@krrishnarraj
Copy link
Owner

krrishnarraj commented Nov 8, 2020

I tried this patch on my local pc. This is the output on pocl platform:

Platform: Portable Computing Language
Device: pthread-Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Driver version : 1.5 (Linux x64)
Compute units : 12
Clock frequency : 4100 MHz

Global memory bandwidth (GBPS)
  float   : 27.44
  float2  : 29.97
  float4  : 31.91
  float8  : 31.71
  float16 : 27.86

Single-precision compute (GFLOPS)
  float   : 23.49
  float2  : 47.32
  float4  : 95.59
  float8  : 190.82
  float16 : 343.63

No half precision support! Skipped

Double-precision compute (GFLOPS)
  double   : 23.37
  double2  : 46.87
  double4  : 94.91
  double8  : 175.67
  double16 : 280.08

Integer compute (GIOPS)
  int   : 27183.34
  int2  : 8061.12
  int4  : 5482.94
  int8  : 2874.55
  int16 : 2959.33

Integer compute Fast 24bit (GIOPS)
  int   : 27745.27
  int2  : 8186.09
  int4  : 5211.50
  int8  : 2745.67
  int16 : 3050.98

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 15.34
  enqueueReadBuffer               : 15.32
  enqueueWriteBuffer non-blocking : 15.39
  enqueueReadBuffer non-blocking  : 15.37
  enqueueMapBuffer(for read)      : 11645.79
    memcpy from mapped ptr        : 15.22
  enqueueUnmap(after write)       : 16989.59
    memcpy to mapped ptr          : 15.26

Kernel launch latency : 17.60 us

The integer compute number looks abnormally high. My guess is that the compiler is optimising redundant calculations on the rhs in
#define MAD_4(x, y, z) z += (y*x) + y; z += (x*y) + x; z += (y*x) + y; z += (x*y) + x;

@doe300
Copy link
Contributor

doe300 commented Nov 8, 2020

Wouldn't this also invalidate all previous benchmark results?

@krrishnarraj
Copy link
Owner

No. These results are on this patchset. Introduction of z variable removes the dependency between statements and allows compiler to optimise out repeated (y*x) + y calculation

@krrishnarraj
Copy link
Owner

I understand the intention here. We need a better way to optimise this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants