Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated results with various Intel, AMD and NVidia hardware #75

Merged
merged 1 commit into from
Feb 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

Platform: AMD Accelerated Parallel Processing
Device: gfx906
Driver version : 3204.0 (HSA1.1,LC) (Linux x64)
Compute units : 60
Clock frequency : 1725 MHz

Global memory bandwidth (GBPS)
float : 766.24
float2 : 756.53
float4 : 740.95
float8 : 727.71
float16 : 685.31

Single-precision compute (GFLOPS)
float : 12886.15
float2 : 12773.94
float4 : 12636.76
float8 : 12363.97
float16 : 12180.00

Half-precision compute (GFLOPS)
half : 6522.77
half2 : 24971.55
half4 : 24781.20
half8 : 24465.16
half16 : 23955.72

Double-precision compute (GFLOPS)
double : 6350.20
double2 : 6319.02
double4 : 6291.70
double8 : 5880.47
double16 : 6143.47

Integer compute (GIOPS)
int : 4325.27
int2 : 4317.88
int4 : 4307.68
int8 : 4289.82
int16 : 4242.46

Integer compute Fast 24bit (GIOPS)
int : 12395.53
int2 : 12199.22
int4 : 11631.28
int8 : 11757.87
int16 : 11833.97

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 11.86
enqueueReadBuffer : 11.53
enqueueWriteBuffer non-blocking : 11.52
enqueueReadBuffer non-blocking : 11.43
enqueueMapBuffer(for read) : 192599.44
memcpy from mapped ptr : 11.78
enqueueUnmap(after write) : 286331.16
memcpy to mapped ptr : 11.97

Kernel launch latency : 11.44 us

Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

Platform: AMD Accelerated Parallel Processing
Device: gfx906
Driver version : 3212.0 (HSA1.1,LC) (Linux x64)
Compute units : 60
Clock frequency : 1725 MHz

Global memory bandwidth (GBPS)
float : 767.51
float2 : 748.01
float4 : 675.73
float8 : 717.26
float16 : 585.05

Single-precision compute (GFLOPS)
float : 12713.40
float2 : 12396.88
float4 : 12340.44
float8 : 12001.62
float16 : 11861.35

Half-precision compute (GFLOPS)
half : 6434.60
half2 : 23781.07
half4 : 23540.66
half8 : 23181.37
half16 : 22714.81

Double-precision compute (GFLOPS)
double : 6084.21
double2 : 6160.23
double4 : 5970.83
double8 : 5964.05
double16 : 5833.33

Integer compute (GIOPS)
int : 4241.74
int2 : 4223.93
int4 : 4227.38
int8 : 4198.92
int16 : 4162.48

Integer compute Fast 24bit (GIOPS)
int : 11717.45
int2 : 11599.73
int4 : 11107.29
int8 : 11331.84
int16 : 11263.35

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 15.68
enqueueReadBuffer : 15.39
enqueueWriteBuffer non-blocking : 15.61
enqueueReadBuffer non-blocking : 11.47
enqueueMapBuffer(for read) : 85048.86
memcpy from mapped ptr : 15.67
enqueueUnmap(after write) : 182764.56
memcpy to mapped ptr : 16.07

Kernel launch latency : 10.55 us

Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

Platform: AMD Accelerated Parallel Processing
Device: gfx906+sram-ecc
Driver version : 3137.0 (HSA1.1,LC) (Linux x64)
Compute units : 60
Clock frequency : 1725 MHz

Global memory bandwidth (GBPS)
float : 765.79
float2 : 655.94
float4 : 645.82
float8 : 652.67
float16 : 582.26

Single-precision compute (GFLOPS)
float : 12710.35
float2 : 12307.32
float4 : 12124.76
float8 : 12007.03
float16 : 11834.00

Half-precision compute (GFLOPS)
half : 6422.43
half2 : 23564.34
half4 : 23395.76
half8 : 23167.34
half16 : 22676.43

Double-precision compute (GFLOPS)
double : 5978.52
double2 : 5953.91
double4 : 5929.22
double8 : 5892.56
double16 : 5814.56

Integer compute (GIOPS)
int : 4238.15
int2 : 4228.25
int4 : 4214.90
int8 : 4198.91
int16 : 4149.22

Integer compute Fast 24bit (GIOPS)
int : 11816.17
int2 : 11582.84
int4 : 11094.79
int8 : 11323.87
int16 : 11321.21

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 15.91
enqueueReadBuffer : 15.35
enqueueWriteBuffer non-blocking : 11.95
enqueueReadBuffer non-blocking : 12.24
enqueueMapBuffer(for read) : 130150.53
memcpy from mapped ptr : 15.90
enqueueUnmap(after write) : 248264.02
memcpy to mapped ptr : 16.02

Kernel launch latency : 15.64 us

61 changes: 61 additions & 0 deletions results/AMD_Accelerated_Parallel_Processing/Radeon-VII-Pro.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

Platform: AMD Accelerated Parallel Processing
Device: gfx906
Driver version : 3204.0 (HSA1.1,LC) (Linux x64)
Compute units : 60
Clock frequency : 1700 MHz

Global memory bandwidth (GBPS)
float : 783.04
float2 : 741.34
float4 : 723.88
float8 : 732.36
float16 : 679.49

Single-precision compute (GFLOPS)
float : 12727.97
float2 : 12632.55
float4 : 12403.68
float8 : 12147.13
float16 : 11960.99

Half-precision compute (GFLOPS)
half : 6425.83
half2 : 24459.28
half4 : 24278.00
half8 : 23921.18
half16 : 23455.81

Double-precision compute (GFLOPS)
double : 6206.76
double2 : 6176.21
double4 : 6135.32
double8 : 6107.36
double16 : 5924.13

Integer compute (GIOPS)
int : 4186.51
int2 : 4019.41
int4 : 4003.08
int8 : 4029.69
int16 : 3976.25

Integer compute Fast 24bit (GIOPS)
int : 11493.50
int2 : 10816.38
int4 : 10109.61
int8 : 10421.03
int16 : 10354.31

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 16.91
enqueueReadBuffer : 16.85
enqueueWriteBuffer non-blocking : 16.91
enqueueReadBuffer non-blocking : 16.83
enqueueMapBuffer(for read) : 128591.83
memcpy from mapped ptr : 16.77
enqueueUnmap(after write) : 238609.30
memcpy to mapped ptr : 16.91

Kernel launch latency : 14.06 us

56 changes: 56 additions & 0 deletions results/Intel(R)_OpenCL/Intel_Core_i7-10875h.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@

Platform: Intel(R) CPU Runtime for OpenCL(TM) Applications
Device: Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
Driver version : 18.1.0.0920 (Linux x64)
Compute units : 16
Clock frequency : 2300 MHz

Global memory bandwidth (GBPS)
float : 25.86
float2 : 26.54
float4 : 24.25
float8 : 22.53
float16 : 22.85

Single-precision compute (GFLOPS)
float : 160.53
float2 : 319.05
float4 : 573.63
float8 : 593.37
float16 : 320.20

No half precision support! Skipped

Double-precision compute (GFLOPS)
double : 160.23
double2 : 289.43
double4 : 295.46
double8 : 171.52
double16 : 257.58

Integer compute (GIOPS)
int : 58.06
int2 : 112.07
int4 : 206.06
int8 : 133.53
int16 : 257.37

Integer compute Fast 24bit (GIOPS)
int : 49.43
int2 : 89.90
int4 : 139.45
int8 : 151.41
int16 : 88.01

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 10.69
enqueueReadBuffer : 10.74
enqueueWriteBuffer non-blocking : 10.68
enqueueReadBuffer non-blocking : 10.57
enqueueMapBuffer(for read) : 13801.31
memcpy from mapped ptr : 10.75
enqueueUnmap(after write) : 16698.94
memcpy to mapped ptr : 10.85

Kernel launch latency : 3.27 us

56 changes: 56 additions & 0 deletions results/Intel(R)_OpenCL/Xeon_Phi_5110.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@

Platform: Intel(R) OpenCL
Device: Intel(R) Many Integrated Core Acceleration Card
Driver version : 1.2 (Linux x64)
Compute units : 236
Clock frequency : 1052 MHz

Global memory bandwidth (GBPS)
float : 62.52
float2 : 44.56
float4 : 76.55
float8 : 84.92
float16 : 2.15

Single-precision compute (GFLOPS)
float : 1778.74
float2 : 1889.33
float4 : 1884.25
float8 : 1877.49
float16 : 1850.36

No half precision support! Skipped

Double-precision compute (GFLOPS)
double : 967.75
double2 : 966.69
double4 : 964.23
double8 : 958.01
double16 : 295.92

Integer compute (GIOPS)
int : 968.24
int2 : 970.23
int4 : 968.07
int8 : 968.20
int16 : 958.80

Integer compute Fast 24bit (GIOPS)
int : 968.37
int2 : 969.56
int4 : 967.91
int8 : 961.61
int16 : 950.62

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.86
enqueueReadBuffer : 3.45
enqueueWriteBuffer non-blocking : 3.34
enqueueReadBuffer non-blocking : 3.46
enqueueMapBuffer(for read) : 137.16
memcpy from mapped ptr : 3.02
enqueueUnmap(after write) : 6.91
memcpy to mapped ptr : 2.97

Kernel launch latency : 77.33 us

Loading