Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on Debian stretch with beignet #231

Closed
vi opened this issue Dec 21, 2017 · 60 comments
Closed

Tests fail on Debian stretch with beignet #231

vi opened this issue Dec 21, 2017 · 60 comments

Comments

@vi
Copy link

vi commented Dec 21, 2017

With beignet 1.3.2-1 and CLBlast v1.2.0 it fails multiple tests:

Total Test time (real) = 397.13 sec

The following tests FAILED:
	  5 - clblast_test_xdot (Failed)
	  6 - clblast_test_xdotu (Failed)
	  7 - clblast_test_xdotc (Failed)
	  8 - clblast_test_xnrm2 (Failed)
	  9 - clblast_test_xasum (Failed)
	 12 - clblast_test_xgbmv (OTHER_FAULT)
	 34 - clblast_test_xgemm (OTHER_FAULT)
	 37 - clblast_test_xsyrk (Failed)
	 38 - clblast_test_xherk (Failed)
	 39 - clblast_test_xsyr2k (Failed)
	 40 - clblast_test_xher2k (Failed)
	 46 - clblast_test_xgemmbatched (OTHER_FAULT)

Additionally matmul build with NETLIB CLBlast fails multiplication if matrix is big enough:

$ ./matmul_cl -n 191 -a 6
...
Central cell: 42.8968
$ ./matmul_cl -n 192 -a 6
...
Central cell: 0

On master branch it also fails.

@CNugteren
Copy link
Owner

Thanks for reporting. Could you give a bit more info though? Which device are you testing on? And can you post the results of the failing test runs?

@vi
Copy link
Author

vi commented Dec 21, 2017

Lenovo Thinkpad X230. Linux vi-notebook 4.9.33-grsec-64+ #85 SMP PREEMPT Sat Jul 15 00:47:47 +03 2017 x86_64 GNU/Linux 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)

How do I get those results? Is it just the terminal output?

@CNugteren
Copy link
Owner

Would be helpful to run clinfo or the included clblast_test_diagnostics tool to get the name of your device (e.g. HD Graphics Haswell Ultrabook GT2 Mobile).

CMake just runs the test executables, but stores the output somewhere else. You can probably find that in subfolders on disk. Consult the CMake/CTest documentation to get more info. Otherwise, you can just manually run the test executables, e.g. ./clblast_test_xaxpy.

@vi
Copy link
Author

vi commented Dec 21, 2017

Number of platforms                               1
  Platform Name                                   Intel Gen OCL Driver
  Platform Vendor                                 Intel
  Platform Version                                OpenCL 2.0 beignet 1.3
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
  Platform Extensions function suffix             Intel

  Platform Name                                   Intel Gen OCL Driver
Number of devices                                 1
  Device Name                                     Intel(R) HD Graphics IvyBridge M GT2
  Device Vendor                                   Intel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 beignet 1.3
  Driver Version                                  1.3
  Device OpenCL C Version                         OpenCL C 1.2 beignet 1.3
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None, None, None
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple              16
  Preferred / native vector sizes                 
    char                                                16 / 8       
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 8        (n/a)
    float                                                4 / 4       
    double                                               0 / 2        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              2147483648 (2GiB)
  Error Correction support                        No
  Max memory allocation                           1610612736 (1.5GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   4096 bytes
    Pitch alignment for 2D image buffers          1 bytes
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      80ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;block_motion_estimate_intel;
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel Gen OCL Driver
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [Intel]
  clCreateContext(NULL, ...) [default]            Success [Intel]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]


 --- OpenCL device naming:
* Device type                   GPU
* Device name                   Intel(R) HD Graphics IvyBridge M GT2
* Platform vendor               Intel
* Platform version              OpenCL 2.0 beignet 1.3

 --- CLBlast device naming:
* Device type                   GPU
* Device name                   Intel(R) HD Graphics IvyBridge M GT2
* Device vendor                 Intel
* Device architecture           

 --- OpenCL device properties:
* Max work group size           512
* Max work item dimensions      3
* - Max work item size #0       512
* - Max work item size #1       512
* - Max work item size #2       512
* Local memory size             65536KB
* Extensions:
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation

 --- Some OpenCL library benchmarks (functions from clpp11.h):
* queue.GetContext()            0.0003 ms
* queue.GetDevice()             0.0002 ms
* device.Name()                 0.0002 ms
* device.Vendor()               0.0001 ms
* device.Version()              0.0002 ms
* device.Platform()             0.0001 ms
* Buffer<float>(context, 1024)  0.0071 ms
$ DISPLAY= ./clblast_test_xaxpy

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -cblas 1 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SAXPY' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   ::::::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
   36 test(s) passed
   0 test(s) skipped
   0 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DAXPY' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'CAXPY' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   ::::::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
   36 test(s) passed
   0 test(s) skipped
   0 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'ZAXPY' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HAXPY' routine.
* All tests skipped: Unsupported precision
$ DISPLAY= ./clblast_test_xdot

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -cblas 1 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SDOT' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   Error rate 100.00%: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Pass rate   0.0%: 0 passed / 0 skipped / 36 failed
* Completed all test-cases for this routine. Results:
   0 test(s) passed
   0 test(s) skipped
   36 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DDOT' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HDOT' routine.
* All tests skipped: Unsupported precision

@vi
Copy link
Author

vi commented Dec 21, 2017

Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?

@CNugteren
Copy link
Owner

OK thanks, probably Intel(R) HD Graphics IvyBridge M GT2 was sufficient info though ;-) Could you post the output of all failing tests? Either as attachment or as Gist/Pastebin, because otherwise this issue becomes a bit unreadable.

Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?

What do you mean exactly? And which routines? Are you using the Netlib API by the way instead of the OpenCL API? That's not recommended for speed as you might already know, definitely not on small GPUs such as the Intel GPU you have.

But anyway, let's fix the tests first. One thing you can try is to run the tuners (see README), because perhaps the defaults are not suitable for your particular GPU? I've tested on other Intel GPUs with Beignet with succes.

@vi
Copy link
Author

vi commented Dec 21, 2017

What do you mean exactly? And which routines?

I see messages in tests about failed operations because of missing features in GPU ("Unsupported precision" such as double or half floats). With Netlib API I expect all GPU details to be fully abstracted from user application, but this dependance on GPU features with exceptions in case of missing things is abstraction leak. A proper way would be fall back to CPU implementation if GPU can't do something. I don't know which exact routines (never programmed for BLAS so far), but I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.


Could you post the output of all failing tests?

for i in clblast_test_*; do echo $i; DISPLAY= ./$i; echo $?; done &> log ?

@vi
Copy link
Author

vi commented Dec 21, 2017

Tried tuning (log and jsons), but the python script fails afterwards:

$ python ../scripts/database/database.py . ..
[database] Downloading database from 'https://raw.githubusercontent.com/CNugteren/CLBlast-database/master/database.json'...
[database] Loading database from '../scripts/database/database.json'
[database] Processing './clblast_copy_32.json' with 128 new items
[database] Processing './clblast_routine_gemm_32.json' with 31 new items
[database] Processing './clblast_xger_32.json' with 108 new items
[database] Processing './clblast_xdot_2_32.json' with 5 new items
[database] Processing './clblast_padtranspose_32.json' with 14 new items
[database] Processing './clblast_xgemm_direct_2_32.json' with 125 new items
[database] Processing './clblast_xaxpy_32.json' with 64 new items
[database] Processing './clblast_xgemm_2_32.json' with 229 new items
[database] Processing './clblast_xgemv_fast_32.json' with 30 new items
[database] Processing './clblast_xdot_1_32.json' with 5 new items
[database] Processing './clblast_xgemv_fast_rot_32.json' with 68 new items
[database] Processing './clblast_xgemm_direct_1_32.json' with 45 new items
[database] Processing './clblast_xgemv_32.json' with 12 new items
[database] Processing './clblast_pad_32.json' with 72 new items
[database] Processing './clblast_transpose_32.json' with 48 new items
[database] Processing './clblast_xgemm_1_32.json' with 560 new items
[database] Saving database to '../scripts/database/database.json'
[database] Calculating the best results per device/kernel...
[database] Calculating the default values...
[database] Producing a C++ database in '../src/database/kernels'...
Traceback (most recent call last):
  File "../scripts/database/database.py", line 154, in <module>
    main(sys.argv[1:])
  File "../scripts/database/database.py", line 148, in main
    clblast.print_cpp_database(database_best_results, cpp_database_path)
  File "/mnt/src/git/CLBlast/scripts/database/database/clblast.py", line 177, in print_cpp_database
    assert len(kernel_database) == 1
AssertionError

@sivagnanamn
Copy link
Contributor

@vi To get rid of the python error, delete the ../scripts/database/database.json file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.

@CNugteren
Copy link
Owner

Thanks for sharing the output of the tests! A quick glance shows that it might be just the reduce and matrix-multiplication kernels failing, they are used in quite a few cases. Let's first see if the tuning can fix them.

To get rid of the python error, delete the ../scripts/database/database.json file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.

Don't think that's going to work, since he just got a fresh copy anyway.

Tried tuning (log and jsons), but the python script fails afterwards.

OK, thanks for sharing the JSONs. I'll take a look myself this weekend at what's going wrong and I'll try to fix it and make the error message more meaningful for future cases. I'll report back as soon as I have something for you.

I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.

I understand your point, but that might be less trivial to implement than you suggest it. First of all, you'll have to query OpenCL to see what's supported and what not. Then, you'll have to call a BLAS routine, which is not trivial to do since CLBlast also uses cblas.h for that case. But then you'll need all the extra logic to make sure this runs on any platform with any CPU BLAS, and you'll need proper testing. However, even after doing all that, the Netlib API of CLBlast is only useful if you think about speed before using it, e.g. an AXPY operation will be slower due to memory copying overhead. Even an GEMM operation might be slower. So ideally you'll want to decision making per machine/routine/parameters. My conclusion here is that it will take too much effort to make the drop-in Netlib API of CLBlast useful in the sense that you describe it. And since it's not the main/recommended API, I will not be able to work on this in the foreseeable future, but I'll accept pull requests of course.

@CNugteren
Copy link
Owner

OK, I've just tested with your JSON files (thanks again for sharing), and I didn't encounter any issue. So it is likely that some things changed in the database in the meantime since the release of CLBlast v1.2.0. So there are two things you could do:

  • Replace scripts/database/database.json with a corresponding v1.2.0 version from the CLBlast-database repository, direct download link here.
  • Or (perhaps better), check out the latest master branch of CLBlast. For your convenience, I have just added the tuning results as well, so with the latest master you can direct re-compile and re-run the tests, hopefully more will now pass. Not sure though, but it is worth trying first before going on.

@vi
Copy link
Author

vi commented Dec 23, 2017

  1. Checked out the master (7aabeb4)
  2. Re-run the script: [database] All done.
  3. Checked git diff - 26 chunks
  4. Re-compile: ninja - 195 targets
  5. Re-test: ninja test
The following tests FAILED:
	  5 - clblast_test_xdot (Failed)
	  6 - clblast_test_xdotu (Failed)
	  7 - clblast_test_xdotc (Failed)
	  8 - clblast_test_xnrm2 (Failed)
	  9 - clblast_test_xasum (Failed)
	 12 - clblast_test_xgbmv (OTHER_FAULT)
	 34 - clblast_test_xgemm (OTHER_FAULT)
	 37 - clblast_test_xsyrk (Failed)
	 38 - clblast_test_xherk (Failed)
	 39 - clblast_test_xsyr2k (Failed)
	 40 - clblast_test_xher2k (Failed)
	 46 - clblast_test_xgemmbatched (OTHER_FAULT)
	 48 - clblast_test_preprocessor (OTHER_FAULT)

6. Revert what database.py done: git reset --hard.
7. Re-build ninja - 53 targets
8. Re-test: ninja test

Same running time, same Faileds, same OTHER_FAULTs.

@CNugteren
Copy link
Owner

OK, thanks for testing. Let's first try to resolve the OTHER_FAULT issues, and then later the one that report regular correctness issues. For three of those (GBMV, GEMM, and GEMMBATCHED), I see quite curious output, e.g.:

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 112 (transposed)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 112 (transposed)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted

This error message is from Beignet, not from CLBlast. It suddenly cannot find your device anymore, which is strange. This seems to suggest something is wrong with your OpenCL set-up, or there is perhaps a bug in beignet? I could not find any issue on the Beignet Bugzilla tracker, but perhaps you can search and file one? First also check if it is reproducible, i.e. does it always fail at exactly the same test?

Then there is another issue with clblast_test_preprocessor it seems, but this seems to be a linker issue. I can't reproduce that myself on a Debian 9 / beignet system, tried with Clang and GCC, but both seem to work fine. Which compiler are you using?

@vi
Copy link
Author

vi commented Dec 24, 2017

Shall I try downgrading beignet to v1.2?

@vi
Copy link
Author

vi commented Dec 24, 2017

gcc (Debian 6.3.0-18) 6.3.0 20170516

@vi
Copy link
Author

vi commented Dec 24, 2017

Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?

@CNugteren
Copy link
Owner

Shall I try downgrading beignet to v1.2?

Not sure, you could try perhaps. I believe there was at least one other CLBlast user with your GPU, since there were already tuning results, so it must have worked correctly on some system at some point.

gcc (Debian 6.3.0-18) 6.3.0 20170516

On my Debian 9 test system I get exactly the same output when I run g++ --version, and there everything compiles and links correctly. Anyway, let's not spent too much time on that, it's just a small issue of linking a not-important specific test.

Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?

When I test with Beignet I also run X at the same time and I don't see any issues.

So I would search the issue with Beignet or with the GPU drivers. Try other versions perhaps, or otherwise report the issue with Beignet. Could well be that the other failing tests are related to this as well...

@CNugteren
Copy link
Owner

So, did you have any luck with another version of Beignet? Or did you report this issue with the developers of Beignet?

@vi
Copy link
Author

vi commented Jan 28, 2018

Not yet. And I'm not sure what steps should I do for the reporting. Is there some minimal failing case which supposed to work, but doesn't?

@CNugteren
Copy link
Owner

I'm not sure we need a minimal failing case here. If you look at the error you are getting returned from clGetDeviceIDs it means it cannot find your GPU:

beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)

However, it did use your GPU in the tests just moments before that. So it seems some time related instability. First also check if it is reproducible, i.e. does it always fail at exactly the same test instance?

Beignet bugs can be filed here. You could refer to this issue perhaps? First also double-check the list of existing Beignet bugs

@vi
Copy link
Author

vi commented Jan 28, 2018

Tried running utests_run, it seems to work...

$ /usr/lib/x86_64-linux-gnu/beignet/utest_run
...
summary:
----------
  total: 1000
  run: 959
  pass: 959
  fail: 0
  pass rate: 1.000000

@vi
Copy link
Author

vi commented Jan 28, 2018

Notes:

  1. If I don't unset DISPLAY, I often get spammed repeated "Maximum number of clients reached:" message, from the test as well as other unrelated apps.

  2. ./clblast_test_xgbmv seems to fail always at the same place with the same OpenCL error: clGetDeviceIDs: -1 error.

I though about reducing the clblast_test_xgbmv to something smaller, but the testing system is too complicated and I stopped trying after observing this.

How, for example, move the 'regular behaviour' for '102 (col-major) 112 (transposed)' to the first place? What test is when 156's : being printed? What if duplicate the first test ('regular behaviour' for '101 (row-major) 111 (regular)') 4 times instead of going to further tests?

@CNugteren
Copy link
Owner

CNugteren commented Jan 29, 2018

OK, thanks for trying. I general I don't think anything can be done from the CLBlast side. Because if calling clGetDeviceIDs works at first a few 100 times and then doesn't work anymore, that means something strange is going on. So honestly I believe it is a bug in Beignet.

But you are right that trying to pinpoint whether it always the same test that fails is a good idea. What you can first do indeed is only to test that particular case. I'll help you out. Let's try two steps:

  1. First only run the col-major & transposed case only. For that, remove Transpose::kNo, from testblas.cpp#L26 and remove Layout::kRowMajor, from testblas.hpp#L147.

  2. Then, if it still fails, you can run the test with -verbose option on the command-line. That should show you the values for m,n,kl,ku,lda,incx,iny it is testing for. You can adjust those values around testblas.hpp#L129: by changing kIncrements, kMatrixVectorDims, and kBandSizes.

@vi
Copy link
Author

vi commented Jan 29, 2018

I instead tried this:

diff --git a/test/correctness/testblas.cpp b/test/correctness/testblas.cpp
index aa4b478..be28ed3 100644
--- a/test/correctness/testblas.cpp
+++ b/test/correctness/testblas.cpp
@@ -23,7 +23,7 @@ namespace clblast {
 
 // The transpose configurations to test with: template parameter dependent
 template <> const std::vector<Transpose> TestBlas<half,half>::kTransposes = {Transpose::kNo, Transpose::kYes};
-template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kYes};
+template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kNo};
 template <> const std::vector<Transpose> TestBlas<double,double>::kTransposes = {Transpose::kNo, Transpose::kYes};
 template <> const std::vector<Transpose> TestBlas<float2,float2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
 template <> const std::vector<Transpose> TestBlas<double2,double2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
diff --git a/test/correctness/testblas.hpp b/test/correctness/testblas.hpp
index 4e02fd2..9c0830b 100644
--- a/test/correctness/testblas.hpp
+++ b/test/correctness/testblas.hpp
@@ -144,7 +144,7 @@ template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kMatS
 template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kVecSizes = {0, kBufferSize - 1, kBufferSize};
 
 // The layout/triangle options to test with
-template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kColMajor};
+template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kRowMajor};
 template <typename T, typename U> const std::vector<Triangle> TestBlas<T,U>::kTriangles = {Triangle::kUpper, Triangle::kLower};
 template <typename T, typename U> const std::vector<Side> TestBlas<T,U>::kSides = {Side::kLeft, Side::kRight};
 template <typename T, typename U> const std::vector<Diagonal> TestBlas<T,U>::kDiagonals = {Diagonal::kUnit, Diagonal::kNonUnit};

and got this:

* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted

Is there a simple program that just calls clGetDeviceIDs in endless loop? (I'm not familiar with OpenCL/Cuda/GPU world in general yet).

@CNugteren
Copy link
Owner

Hmm, interesting, so you think it would perhaps always happen at the n-th call to that function?

Something like this could help you perhaps:

#include <CL/opencl.h>
#include <cstdio>

#define NUM_RUNS 50
#define PLATFORM_ID 0

int main() {
  int status;

  for (int i = 0; i < NUM_RUNS; ++i) {
    printf("Test %d\n", i);

    cl_uint num_platforms = 0;
    status = clGetPlatformIDs(0, NULL, &num_platforms);
    if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #1\n"); return 1; }

    cl_platform_id* platforms = new cl_platform_id[num_platforms];
    status = clGetPlatformIDs(num_platforms, platforms, NULL);
    if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #2\n"); return 1; }

    cl_uint result = 0;
    cl_platform_id platform = platforms[PLATFORM_ID];
    status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &result);
    if (status != CL_SUCCESS) { printf("Error in clGetDeviceIDs\n"); return 1; }

    delete[] platforms;
  }
  return 0;
}

@vi
Copy link
Author

vi commented Jan 29, 2018

This does not fail (increased NUM_RUNS and removed the main printf), even if I also start clblast_test_xgbmv in parallel.

Running the test under valgrind (snippet):

* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   :::::::::::::::::::::::::::==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
:==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
==6005== 
$ ulimit -n
1024

There are a lot of /dev/dri/renderD128 open files.

After ulimit -n 4096 the test succeeds.

$ ulimit -n 10
$ ./clblast_test_xgbmv 
...
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
...
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted

@CNugteren
Copy link
Owner

In your last example it fails at a different place then before, so the location is not deterministic?

@vi
Copy link
Author

vi commented Jan 30, 2018

It fails when file descriptors run out. CLBlast test (or some dep) opens them, but does not close properly.

ulimit -n sets maximum number of file descriptors, so the fewer opened files allowed, the sooner it fails.

@CNugteren
Copy link
Owner

CNugteren commented Jan 30, 2018

OK, never heard of those. CLBlast doesn't open any files while testing. Must be Beignet related then I guess?

Can you then re-run all the tests with that 'fix' applied and report which ones still have open issues?

@vi
Copy link
Author

vi commented Feb 3, 2018

For completess: the GPU may have felt a little bit sick at the time of test. At least the graphical scaling glitch is still here.

@CNugteren
Copy link
Owner

Sorry I forget about this issue. Thanks for testing. The tuner result looks OK. Not sure how to continue though since I can't test myself and start to debug the issue, because I can't reproduce it.

Perhaps this other issue #149 might help you out. It seems it is also the same GPU, but a different OpenCL (not Beignet but Apple OpenCL).

@vi
Copy link
Author

vi commented Feb 13, 2018

If needed I can run special modified versions for debugging or maybe give access for remote debugging on my laptop.

But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).

@CNugteren
Copy link
Owner

But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).

You can make install it into a directory specified when you ran CMake (-DCMAKE_INSTALL_PREFIX=/path/to/install). If none specified, it will just install in your system's path and will overwrite any existing OpenCL afaik. Otherwise you'll have multiple OpenCL platforms and you'll need to select the right one.

@vi
Copy link
Author

vi commented Feb 13, 2018

select the right one

How do I select the right one ensuring no pieces of the wrong one is on the way and also without disruptive changes to the system from root? It it just LD_LIBRARY_PATH or LD_PRELOAD or something trickier?

@CNugteren
Copy link
Owner

Not sure, I'm not an expert on that... But what I meant was 'select' in the OpenCL platform sense. If you do it right, you might have both Beignet's co-existing on your system, clinfo will show both. But you can of course also try to set the library path.

@CNugteren
Copy link
Owner

Any updates here? Or should we conclude it is not CLBlast-related?

@vi
Copy link
Author

vi commented Mar 30, 2018

Not experimented yet with other Beignets. Not sure if it is appropriate to report bugs there without trying fresher build...

Maybe CLBlast is doing things OK, but also can contain workaround for broken platforms...

If/when I come back to experimenting with OpenCL in general and Beignet and/or CLBlast in particular, I'll comment.

@CNugteren
Copy link
Owner

Intel now has a new open-source implementation that is replacing Beignet. Perhaps it is time to try the new Intel NEO?

@vi
Copy link
Author

vi commented May 17, 2018

Gen8 (Broadwell) and beyond

Is it something new-ish? Unlikely that it would work on my laptop.

@CNugteren
Copy link
Owner

Indeed, it seems that your hardware is not supported. Neo is new indeed, Beignet is now discontinued, so that won't lead to solving this issue either it seems.

How do you suggest we proceed? Do you still have time to test things? We could also close this issue and say that older hardware is not properly supported in all cases...

@vi
Copy link
Author

vi commented May 17, 2018

Do you still have time to test things?

Yes, I'm constantly trying various things (CLBlast being a detour from experimenting with various deep learning toys and thinking "what if I can workaround missing OpenCL support for ... by using CLBlast instead of usual BLAS library").

How do you suggest we proceed?

Maybe like previously, me trying updated beingnet (or just waiting until eventually updated Beignet comes to Debian Stable), then maybe reporting additional issues to Beignet.

@CNugteren
Copy link
Owner

Any updates from your side?

@vi
Copy link
Author

vi commented Jul 29, 2018

Not yet.

Is it something urgent or you just don't want a danging open issue? I'll report results here if/when I resume experimentation regardless of closedness status of this issue.

For now I just treat my laptop as Not Ready For GPU Computing.

@CNugteren
Copy link
Owner

OK. Yes, I see this as a list of things I have to work on :-) I can also add your setup to the list of known issues, close this issue, and we can follow-up later with you and/or Intel when you have time to see if anything can be fixed?

@vi
Copy link
Author

vi commented Jul 29, 2018

Got round and installed beignet from master.

3 passing tests in ./clblast_test_xdot disappeared and all of them fail. Although ./clblast_test_xdotc keeps on working. FD leak persists.

Beignet's own tests almost succeed: pass rate: 0.999005.

Using clblast dda1e56 and beignet 591d387327ce35f03a6152d4c823415729e221f2.

@vi
Copy link
Author

vi commented Jul 29, 2018

Tried beignet 1.2.1 (097365ed1a79cd03dc689b37b03552e455eb3854) and seeing more successful tests.

@vi
Copy link
Author

vi commented Jul 29, 2018

Tests now look much better:

$ ninja test
[0/1] Running tests...
Test project /home/vi/src/git/CLBlast/build
      Start  1: clblast_test_xswap
 1/51 Test  #1: clblast_test_xswap .................   Passed    1.58 sec
      Start  2: clblast_test_xscal
 2/51 Test  #2: clblast_test_xscal .................   Passed    1.16 sec
      Start  3: clblast_test_xcopy
 3/51 Test  #3: clblast_test_xcopy .................   Passed    1.56 sec
      Start  4: clblast_test_xaxpy
 4/51 Test  #4: clblast_test_xaxpy .................   Passed    1.59 sec
      Start  5: clblast_test_xdot
 5/51 Test  #5: clblast_test_xdot ..................   Passed    0.82 sec
      Start  6: clblast_test_xdotu
 6/51 Test  #6: clblast_test_xdotu .................   Passed    0.88 sec
      Start  7: clblast_test_xdotc
 7/51 Test  #7: clblast_test_xdotc .................   Passed    0.89 sec
      Start  8: clblast_test_xnrm2
 8/51 Test  #8: clblast_test_xnrm2 .................   Passed    1.16 sec
      Start  9: clblast_test_xasum
 9/51 Test  #9: clblast_test_xasum .................   Passed    1.13 sec
      Start 10: clblast_test_xamax
10/51 Test #10: clblast_test_xamax .................   Passed    1.12 sec
      Start 11: clblast_test_xgemv
11/51 Test #11: clblast_test_xgemv .................   Passed    5.75 sec
      Start 12: clblast_test_xgbmv
12/51 Test #12: clblast_test_xgbmv .................   Passed   53.21 sec
      Start 13: clblast_test_xhemv
13/51 Test #13: clblast_test_xhemv .................   Passed    1.98 sec
      Start 14: clblast_test_xhbmv
14/51 Test #14: clblast_test_xhbmv .................   Passed    4.82 sec
      Start 15: clblast_test_xhpmv
15/51 Test #15: clblast_test_xhpmv .................   Passed    1.97 sec
      Start 16: clblast_test_xsymv
16/51 Test #16: clblast_test_xsymv .................   Passed    1.72 sec
      Start 17: clblast_test_xsbmv
17/51 Test #17: clblast_test_xsbmv .................   Passed    3.97 sec
      Start 18: clblast_test_xspmv
18/51 Test #18: clblast_test_xspmv .................   Passed    1.80 sec
      Start 19: clblast_test_xtrmv
19/51 Test #19: clblast_test_xtrmv .................   Passed   28.67 sec
      Start 20: clblast_test_xtbmv
20/51 Test #20: clblast_test_xtbmv .................   Passed   78.94 sec
      Start 21: clblast_test_xtpmv
21/51 Test #21: clblast_test_xtpmv .................   Passed   20.33 sec
      Start 22: clblast_test_xtrsv
22/51 Test #22: clblast_test_xtrsv .................   Passed   30.57 sec
      Start 23: clblast_test_xger
23/51 Test #23: clblast_test_xger ..................   Passed    1.49 sec
      Start 24: clblast_test_xgeru
24/51 Test #24: clblast_test_xgeru .................   Passed    1.73 sec
      Start 25: clblast_test_xgerc
25/51 Test #25: clblast_test_xgerc .................   Passed    1.68 sec
      Start 26: clblast_test_xher
26/51 Test #26: clblast_test_xher ..................   Passed    0.95 sec
      Start 27: clblast_test_xhpr
27/51 Test #27: clblast_test_xhpr ..................   Passed    0.77 sec
      Start 28: clblast_test_xher2
28/51 Test #28: clblast_test_xher2 .................   Passed    1.74 sec
      Start 29: clblast_test_xhpr2
29/51 Test #29: clblast_test_xhpr2 .................   Passed    1.33 sec
      Start 30: clblast_test_xsyr
30/51 Test #30: clblast_test_xsyr ..................   Passed    0.89 sec
      Start 31: clblast_test_xspr
31/51 Test #31: clblast_test_xspr ..................   Passed    0.72 sec
      Start 32: clblast_test_xsyr2
32/51 Test #32: clblast_test_xsyr2 .................   Passed    1.62 sec
      Start 33: clblast_test_xspr2
33/51 Test #33: clblast_test_xspr2 .................   Passed    1.28 sec
      Start 34: clblast_test_xgemm
34/51 Test #34: clblast_test_xgemm .................   Passed   39.85 sec
      Start 35: clblast_test_xsymm
35/51 Test #35: clblast_test_xsymm .................   Passed    5.64 sec
      Start 36: clblast_test_xhemm
36/51 Test #36: clblast_test_xhemm .................   Passed    2.36 sec
      Start 37: clblast_test_xsyrk
37/51 Test #37: clblast_test_xsyrk .................   Passed    3.41 sec
      Start 38: clblast_test_xherk
38/51 Test #38: clblast_test_xherk .................   Passed    1.78 sec
      Start 39: clblast_test_xsyr2k
39/51 Test #39: clblast_test_xsyr2k ................   Passed    5.20 sec
      Start 40: clblast_test_xher2k
40/51 Test #40: clblast_test_xher2k ................   Passed    2.59 sec
      Start 41: clblast_test_xtrmm
41/51 Test #41: clblast_test_xtrmm .................   Passed   59.26 sec
      Start 42: clblast_test_xtrsm
42/51 Test #42: clblast_test_xtrsm .................   Passed   74.42 sec
      Start 43: clblast_test_xhad
43/51 Test #43: clblast_test_xhad ..................   Passed    1.01 sec
      Start 44: clblast_test_xomatcopy
44/51 Test #44: clblast_test_xomatcopy .............   Passed    1.11 sec
      Start 45: clblast_test_xim2col
45/51 Test #45: clblast_test_xim2col ...............   Passed    3.37 sec
      Start 46: clblast_test_xaxpybatched
46/51 Test #46: clblast_test_xaxpybatched ..........   Passed    4.14 sec
      Start 47: clblast_test_xgemmbatched
47/51 Test #47: clblast_test_xgemmbatched ..........   Passed   65.15 sec
      Start 48: clblast_test_xgemmstridedbatched
48/51 Test #48: clblast_test_xgemmstridedbatched ...   Passed   63.62 sec
      Start 49: clblast_test_override_parameters
49/51 Test #49: clblast_test_override_parameters ...   Passed    9.00 sec
      Start 50: clblast_test_retrieve_parameters
50/51 Test #50: clblast_test_retrieve_parameters ...   Passed    0.16 sec
      Start 51: clblast_test_preprocessor
51/51 Test #51: clblast_test_preprocessor ..........***Exception: Other  7.65 sec

98% tests passed, 1 tests failed out of 51

Total Test time (real) = 609.63 sec

The following tests FAILED:
	 51 - clblast_test_preprocessor (OTHER_FAULT)
Errors while running CTest
FAILED: CMakeFiles/test.util 

Opened filehandles of /dev/dri/renderD128 keep on accumulating during tests that take much time.

Shall I run the tuning process?

@CNugteren
Copy link
Owner

OK, that is good news, so Beignet 1.2.1 works quite good. One failing test it seems, shall we try and see if we can solve that? It is a bit of a special thing though, not really needed in all cases. But perhaps you can give me the output when running ./clblast_test_preprocessor? And perhaps gdb result as well?

@vi
Copy link
Author

vi commented Jul 30, 2018

$ ./clblast_test_preprocessor

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]

* Testing simple OpenCL pre-processor for 'XaxpyFastest'
* Testing simple OpenCL pre-processor for 'Xger'
* Testing simple OpenCL pre-processor for 'XgemvFast'
* Testing simple OpenCL pre-processor for 'CopyMatrixFast'
* Testing simple OpenCL pre-processor for 'CopyPadMatrix'
* Testing simple OpenCL pre-processor for 'TransposeMatrixFast'
* Testing simple OpenCL pre-processor for 'TransposePadMatrix'
* Testing simple OpenCL pre-processor for 'Xgemm'
Warning unknown condition: 1
Warning unknown condition: (0
Warning unknown condition: 0)
Warning unknown condition: 2 != SUBGROUP_SIZE
Warning unknown condition: 8 < SUBGROUP_SIZE
* Testing simple OpenCL pre-processor for 'XgemmDirectTN'

    11 test(s) passed
    0 test(s) failed

* Testing simple OpenCL pre-processor for 'XaxpyFastest'
ASSERTION FAILED: 0
  at file /home/vi/src/git/beignet/backend/src/backend/gen_encoder.cpp, function virtual void gbe::GenEncoder::handleDouble(gbe::GenEncoder*, uint32_t, gbe::GenRegister, gbe::GenRegister, gbe::GenRegister), line 648
Trace/breakpoint trap
$ gdb -args ./clblast_test_preprocessor
...
Reading symbols from ./clblast_test_preprocessor...(no debugging symbols found)...done.
(gdb) r
Starting program: /mnt/src/git/CLBlast/build/clblast_test_preprocessor 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x3fff2513700 (LWP 21776)]
[New Thread 0x3ffefd12700 (LWP 21777)]
[New Thread 0x3ffed511700 (LWP 21778)]

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]

* Testing simple OpenCL pre-processor for 'XaxpyFastest'
* Testing simple OpenCL pre-processor for 'Xger'
* Testing simple OpenCL pre-processor for 'XgemvFast'
* Testing simple OpenCL pre-processor for 'CopyMatrixFast'
* Testing simple OpenCL pre-processor for 'CopyPadMatrix'
* Testing simple OpenCL pre-processor for 'TransposeMatrixFast'
* Testing simple OpenCL pre-processor for 'TransposePadMatrix'
* Testing simple OpenCL pre-processor for 'Xgemm'
Warning unknown condition: 1
Warning unknown condition: (0
Warning unknown condition: 0)
Warning unknown condition: 2 != SUBGROUP_SIZE
Warning unknown condition: 8 < SUBGROUP_SIZE
* Testing simple OpenCL pre-processor for 'XgemmDirectTN'

    11 test(s) passed
    0 test(s) failed

* Testing simple OpenCL pre-processor for 'XaxpyFastest'
ASSERTION FAILED: 0
  at file /home/vi/src/git/beignet/backend/src/backend/gen_encoder.cpp, function virtual void gbe::GenEncoder::handleDouble(gbe::GenEncoder*, uint32_t, gbe::GenRegister, gbe::GenRegister, gbe::GenRegister), line 648

Thread 1 "clblast_test_pr" received signal SIGTRAP, Trace/breakpoint trap.
gbe::onFailedAssertion (msg=<optimized out>, file=<optimized out>, fn=<optimized out>, line=<optimized out>)
    at /home/vi/src/git/beignet/backend/src/sys/assert.cpp:76
76	    _exit(-1);
(gdb) bt
#0  gbe::onFailedAssertion (msg=<optimized out>, file=<optimized out>, fn=<optimized out>, line=<optimized out>)
    at /home/vi/src/git/beignet/backend/src/sys/assert.cpp:76
#1  0x000003ffe57d738e in gbe::GenEncoder::MUL (this=<optimized out>, dest=..., src0=..., src1=...)
    at /home/vi/src/git/beignet/backend/src/backend/gen_encoder.cpp:860
#2  0x000003ffe5790f18 in gbe::GenContext::emitBinaryInstruction (this=0x2aaab401780, insn=...)
    at /home/vi/src/git/beignet/backend/src/backend/gen_context.cpp:767
#3  0x000003ffe57b9a97 in gbe::GenContext::emitInstructionStream (this=this@entry=0x2aaab401780)
    at /home/vi/src/git/beignet/backend/src/./backend/gen_insn_selection.hxx:36
#4  0x000003ffe57b9f5a in gbe::GenContext::emitCode (this=0x2aaab401780) at /home/vi/src/git/beignet/backend/src/backend/gen_context.cpp:3858
#5  0x000003ffe568cd22 in gbe::Context::compileKernel (this=this@entry=0x2aaab401780) at /home/vi/src/git/beignet/backend/src/backend/context.cpp:389
#6  0x000003ffe57cc2d5 in gbe::GenProgram::compileKernel (this=<optimized out>, unit=..., name="Xaxpy", relaxMath=<optimized out>, 
    profiling=<optimized out>) at /home/vi/src/git/beignet/backend/src/backend/gen_program.cpp:212
#7  0x000003ffe56902b6 in gbe::Program::buildFromUnit (this=this@entry=0x2aaaaf00990, unit=..., error="")
    at /home/vi/src/git/beignet/backend/src/backend/program.cpp:188
#8  0x000003ffe5690930 in gbe::Program::buildFromLLVMFile (this=this@entry=0x2aaaaf00990, fileName=fileName@entry=0x0, 
    module=module@entry=0x2aaaaf17bb0, error="", optLevel=optLevel@entry=1) at /home/vi/src/git/beignet/backend/src/backend/program.cpp:163
#9  0x000003ffe57cc985 in gbe::genProgramNewFromLLVM (deviceID=358, fileName=0x0, module=0x2aaaaf17bb0, llvm_ctx=0x2aaab3f8130, 
    asm_file_name=<optimized out>, stringSize=1048576, err=0x2aaaaf20950 "", errSize=0x2aaaaeb8800, optLevel=1, 
    options=0x3ffffffd9b0 " -cl-std=CL1.1") at /home/vi/src/git/beignet/backend/src/backend/gen_program.cpp:456
#10 0x000003ffe569f863 in gbe::programNewFromSource (deviceID=358, source=<optimized out>, stringSize=1048576, 
    options=0x3ffffffd9b0 " -cl-std=CL1.1", err=0x2aaaaf20950 "", errSize=0x2aaaaeb8800)
    at /home/vi/src/git/beignet/backend/src/backend/program.cpp:1027
#11 0x000003ffeaab4f48 in cl_program_build (p=p@entry=0x2aaaaeb8770, options=0x3ffffffd9b0 " -cl-std=CL1.1")
    at /home/vi/src/git/beignet/src/cl_program.c:589
#12 0x000003ffeaaac426 in clBuildProgram (program=0x2aaaaeb8770, num_devices=<optimized out>, device_list=<optimized out>, options=<optimized out>, 
    pfn_notify=0x0, user_data=0x0) at /home/vi/src/git/beignet/src/cl_api.c:957
#13 0x000003fff78befee in clblast::CompileFromSource(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Device const&, clblast::Context const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, unsigned long, bool) () from /mnt/src/git/CLBlast/build/libclblast.so.1
#14 0x000002aaaab23d96 in clblast::TestKernel(clblast::Device const&, clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Precision) ()
#15 0x000002aaaab24538 in clblast::RunPreprocessor(int, char**, bool, clblast::Precision) ()
#16 0x000002aaaaacbaac in main ()

(gdb) bt full
#0  gbe::onFailedAssertion (msg=<optimized out>, file=<optimized out>, fn=<optimized out>, line=<optimized out>)
    at /home/vi/src/git/beignet/backend/src/sys/assert.cpp:76
        __PRETTY_FUNCTION__ = "void gbe::onFailedAssertion(const char*, const char*, const char*, int32_t)"
#1  0x000003ffe57d738e in gbe::GenEncoder::MUL (this=<optimized out>, dest=..., src0=..., src1=...)
    at /home/vi/src/git/beignet/backend/src/backend/gen_encoder.cpp:860
        __PRETTY_FUNCTION__ = "void gbe::GenEncoder::MUL(gbe::GenRegister, gbe::GenRegister, gbe::GenRegister)"
#2  0x000003ffe5790f18 in gbe::GenContext::emitBinaryInstruction (this=0x2aaab401780, insn=...)
    at /home/vi/src/git/beignet/backend/src/backend/gen_context.cpp:767
        dst = <optimized out>
        __PRETTY_FUNCTION__ = "virtual void gbe::GenContext::emitBinaryInstruction(const gbe::SelectionInstruction&)"
#3  0x000003ffe57b9a97 in gbe::GenContext::emitInstructionStream (this=this@entry=0x2aaab401780)
    at /home/vi/src/git/beignet/backend/src/./backend/gen_insn_selection.hxx:36
        opcode = <optimized out>
        insn = @0x2aaab7fa550: {<NonCopyable> = {<No data fields>}, <gbe::intrusive_list_node> = {next = 0x2aaab7fa710, prev = 0x2aaab7fa780}, 
          parent = 0x2aaab7e23a0, state = {physicalFlag = 1, flag = 0, subFlag = 0, grfFlag = 1, externFlag = 0, modFlag = 0, flagGen = 0, 
            execWidth = 16, quarterControl = 0, nibControl = 0, accWrEnable = 0, noMask = 0, predicate = 0, inversePredicate = 0, saturate = 0, 
            flagIndex = 0}, extra = {{function = 0, elem = 0}, {width = 0, vstride = 0, hstride = 0, offset = 0}, {scratchOffset = 0, 
              scratchMsgHeader = 0}, {bti = 0, msglen = 0, is3DWrite = 0}, {rdbti = 0, sampler = 0, rdmsglen = 0, isLD = false, isUniform = false}, {
              vme_bti = 0, msg_type = 0, vme_search_path_lut = 0, lut_sub = 0}, barrierType = 0, waitType = 0, longjmp = false, indirect_offset = 0, 
            {pointNum = 0, timestampType = 0}, {profilingType = 0, profilingBTI = 0}, {printfNum = 0, printfBTI = 0, continueFlag = 0, 
              printfSize = 0}, workgroupOp = 0}, opcode = 35 '#', dstNum = 1 '\001', srcNum = 2 '\002', index = 0, index1 = 0, ID = 44, DBGInfo = {
            line = 2877236880, col = 682}, regs = 0x2aaab7fa590}
        __for_range = @0x2aaab7e23b0: {<gbe::intrusive_list_base> = {m_root = {next = 0x2aaab7df7b0, prev = 0x2aaab687ca0}}, <No data fields>}
        block = @0x2aaab7e23a0: {<NonCopyable> = {<No data fields>}, <gbe::intrusive_list_node> = {next = 0x2aaab696800, prev = 0x2aaab7e2340}, 
          insnList = {<gbe::intrusive_list_base> = {m_root = {next = 0x2aaab7df7b0, prev = 0x2aaab687ca0}}, <No data fields>}, 
          vectorList = {<gbe::intrusive_list_base> = {m_root = {next = 0x2aaab299b90, prev = 0x2aaab7f25b0}}, <No data fields>}, 
          tmp = {<std::vector<gbe::ir::Register, gbe::Allocator<gbe::ir::Register> >> = std::vector of length 3, capacity 4 = {{unsafe = 83}, {
                unsafe = 84}, {unsafe = 86}}, <No data fields>}, bb = 0x2aaab802620, isLargeBlock = false, endifLabel = {unsafe = 6}, 
          endifOffset = -1, hasBarrier = false, hasBranch = false, removeSimpleIfEndif = false}
        __for_range = @0x2aaab74d480: {<gbe::intrusive_list_base> = {m_root = {next = 0x2aaab74d600, prev = 0x2aaab696860}}, <No data fields>}
        __PRETTY_FUNCTION__ = "void gbe::GenContext::emitInstructionStream()"
#4  0x000003ffe57b9f5a in gbe::GenContext::emitCode (this=0x2aaab401780) at /home/vi/src/git/beignet/backend/src/backend/gen_context.cpp:3858
        genKernel = 0x2aaab7e2500
#5  0x000003ffe568cd22 in gbe::Context::compileKernel (this=this@entry=0x2aaab401780) at /home/vi/src/git/beignet/backend/src/backend/context.cpp:389
No locals.
#6  0x000003ffe57cc2d5 in gbe::GenProgram::compileKernel (this=<optimized out>, unit=..., name="Xaxpy", relaxMath=<optimized out>, 
    profiling=<optimized out>) at /home/vi/src/git/beignet/backend/src/backend/gen_program.cpp:212
---Type <return> to continue, or q <return> to quit---
        simdWidth = 16
        limitRegisterPressure = false
        reservedSpillRegs = 0
        simdFn = 0x2aaab7f73d0
        fn = <optimized out>
        __PRETTY_FUNCTION__ = "virtual gbe::Kernel* gbe::GenProgram::compileKernel(const gbe::ir::Unit&, const string&, bool, int)"
        codeGenNum = 4
        codeGen = 0
        ctx = <optimized out>
        kernel = 0x0
#7  0x000003ffe56902b6 in gbe::Program::buildFromUnit (this=this@entry=0x2aaaaf00990, unit=..., error="")
    at /home/vi/src/git/beignet/backend/src/backend/program.cpp:188
        name = "Xaxpy"
        kernel = <optimized out>
        __for_range = @0x2aaab40ba38: {<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, gbe::ir::Function*, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, gbe::Allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, gbe::ir::Function*> > >> = std::map with 4 elements = {["Xaxpy"] = 0x2aaab7f73d0, 
            ["XaxpyBatched"] = 0x2aaab2cbe90, ["XaxpyFaster"] = 0x2aaaaee9db0, 
            ["XaxpyFastest"] = 0x2aaab5d3830}, <NonCopyable> = {<No data fields>}, <No data fields>}
        kernelNum = <optimized out>
        strictMath = <optimized out>
#8  0x000003ffe5690930 in gbe::Program::buildFromLLVMFile (this=this@entry=0x2aaaaf00990, fileName=fileName@entry=0x0, 
    module=module@entry=0x2aaaaf17bb0, error="", optLevel=optLevel@entry=1) at /home/vi/src/git/beignet/backend/src/backend/program.cpp:163
        error2 = ""
        unit = 0x2aaab40ba00
        cloned_module = 0x2aaab0385b0
        ret = false
        strictMath = <optimized out>
#9  0x000003ffe57cc985 in gbe::genProgramNewFromLLVM (deviceID=358, fileName=0x0, module=0x2aaaaf17bb0, llvm_ctx=0x2aaab3f8130, 
    asm_file_name=<optimized out>, stringSize=1048576, err=0x2aaaaf20950 "", errSize=0x2aaaaeb8800, optLevel=1, 
    options=0x3ffffffd9b0 " -cl-std=CL1.1") at /home/vi/src/git/beignet/backend/src/backend/gen_program.cpp:456
        fast_relaxed_math = <optimized out>
        error = ""
#10 0x000003ffe569f863 in gbe::programNewFromSource (deviceID=358, source=<optimized out>, stringSize=1048576, 
    options=0x3ffffffd9b0 " -cl-std=CL1.1", err=0x2aaaaf20950 "", errSize=0x2aaaaeb8800)
    at /home/vi/src/git/beignet/backend/src/backend/program.cpp:1027
        clangErrSize = 0
---Type <return> to continue, or q <return> to quit---
        optLevel = 1
        clOpt = std::vector of length 5, capacity 8 = {"-I/mnt/src/git/beignet/build/backend/src/libocl//usr/local/lib/beignet//include/", 
          "-D__OPENCL_C_VERSION__=110", "-cl-std=CL1.1", "-include", "ocl.h"}
        dumpLLVMFileName = ""
        dumpASMFileName = ""
        dumpSPIRBinaryName = ""
        p = <optimized out>
        out_module = 0x2aaaaf17bb0
        llvm_ctx = 0x2aaab3f8130
        llvm_mutex = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
                __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}
#11 0x000003ffeaab4f48 in cl_program_build (p=p@entry=0x2aaaaeb8770, options=0x3ffffffd9b0 " -cl-std=CL1.1")
    at /home/vi/src/git/beignet/src/cl_program.c:589
        err = 0
        i = 0
        copyed = 0
#12 0x000003ffeaaac426 in clBuildProgram (program=0x2aaaaeb8770, num_devices=<optimized out>, device_list=<optimized out>, options=<optimized out>, 
    pfn_notify=0x0, user_data=0x0) at /home/vi/src/git/beignet/src/cl_api.c:957
        err = 0
        __PRETTY_FUNCTION__ = "clBuildProgram"
#13 0x000003fff78befee in clblast::CompileFromSource(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Precision, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Device const&, clblast::Context const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, unsigned long, bool) () from /mnt/src/git/CLBlast/build/libclblast.so.1
No symbol table info available.
#14 0x000002aaaab23d96 in clblast::TestKernel(clblast::Device const&, clblast::Context const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, clblast::Precision) ()
No symbol table info available.
#15 0x000002aaaab24538 in clblast::RunPreprocessor(int, char**, bool, clblast::Precision) ()
No symbol table info available.
#16 0x000002aaaaacbaac in main ()
No symbol table info available.
(gdb) quit
A debugging session is active.

	Inferior 1 [process 21753] will be killed.

Quit anyway? (y or n) y

@vi
Copy link
Author

vi commented Jul 30, 2018

First phase of tuning succeed (there are 40 JSONs), but database.py failed:

$ python ../scripts/database/database.py . ..
[database] Loading database from '../scripts/database/database.json'
[database] Processing './clblast_transpose_3232.json' with 44 new items
[database] Processing './clblast_xgemm_12_3232.json' with 97 new items
[database] Processing './clblast_xgemv_fast_rot_3232.json' with 68 new items
[database] Processing './clblast_copy_32.json' with 128 new items
[database] Processing './clblast_xgemv_fast_3232.json' with 30 new items
[database] Processing './clblast_xdot_2_3232.json' with 5 new items
[database] Processing './clblast_xgemm_direct_1_3232.json' with 45 new items
[database] Processing './clblast_xger_32.json' with 108 new items
[database] Processing './clblast_padtranspose_3232.json' with 14 new items
[database] Processing './clblast_xdot_2_32.json' with 5 new items
[database] Processing './clblast_gemm_routine_32.json' with 31 new items
[database] Processing './clblast_padtranspose_32.json' with 16 new items
[database] Processing './clblast_pad_3232.json' with 72 new items
[database] Processing './clblast_xgemm_11_3232.json' with 374 new items
[database] Processing './clblast_xgemm_direct_2_32.json' with 125 new items
[database] Processing './clblast_copy_3232.json' with 128 new items
[database] Processing './clblast_xdot_1_3232.json' with 5 new items
[database] Processing './clblast_xgemm_12_32.json' with 93 new items
[database] Processing './clblast_xgemv_3232.json' with 12 new items
[database] Processing './clblast_xaxpy_32.json' with 64 new items
[database] Processing './clblast_xgemm_2_32.json' with 229 new items
[database] Processing './clblast_invert_3232.json' with 2 new items
[database] Processing './clblast_xgemv_fast_32.json' with 30 new items
[database] Processing './clblast_xdot_1_32.json' with 5 new items
[database] Processing './clblast_xgemv_fast_rot_32.json' with 68 new items
[database] Processing './clblast_xaxpy_3232.json' with 64 new items
[database] Processing './clblast_routine_xtrsv_32.json' with 4 new items
[database] Processing './clblast_xgemm_2_3232.json' with 155 new items
[database] Processing './clblast_routine_xtrsv_3232.json' with 4 new items
[database] Processing './clblast_xgemm_direct_1_32.json' with 45 new items
[database] Processing './clblast_xgemv_32.json' with 12 new items
[database] Processing './clblast_pad_32.json' with 72 new items
[database] Processing './clblast_xgemm_11_32.json' with 386 new items
[database] Processing './clblast_xger_3232.json' with 108 new items
[database] Processing './clblast_invert_32.json' with 2 new items
[database] Processing './clblast_xgemm_direct_2_3232.json' with 59 new items
[database] Processing './clblast_gemm_routine_3232.json' with 31 new items
[database] Processing './clblast_transpose_32.json' with 52 new items
[database] Processing './clblast_xgemm_1_32.json' with 558 new items
[database] Processing './clblast_xgemm_1_3232.json' with 490 new items
[database] Saving database to '../scripts/database/database.json'
[database] Calculating the best results per device/kernel...
[database] Calculating the default values...
[database] Producing a C++ database in '../src/database/kernels'...
[database] No results found for invert:16, retrieving defaults from invert:32
[database] No results found for invert:64, retrieving defaults from invert:32
[database] No results found for invert:6464, retrieving defaults from invert:32
[database] No results found for trsv_routine:16, retrieving defaults from trsv_routine:32
[database] No results found for trsv_routine:64, retrieving defaults from trsv_routine:32
[database] No results found for trsv_routine:6464, retrieving defaults from trsv_routine:32
Traceback (most recent call last):
  File "../scripts/database/database.py", line 185, in <module>
    main(sys.argv[1:])
  File "../scripts/database/database.py", line 179, in main
    clblast.print_cpp_database(database_best_results, cpp_database_path)
  File "/mnt/src/git/CLBlast/scripts/database/database/clblast.py", line 231, in print_cpp_database
    assert parameter_name == parameter_names[parameter_index]
AssertionError

@vi
Copy link
Author

vi commented Jul 30, 2018

Results and output of the first phase of tuning: https://vi-server.org/pub/clblast_beignet_gen3_tuning.7z

@CNugteren
Copy link
Owner

OK, thanks for the feedback.

ASSERTION FAILED: 0
at file /home/vi/src/git/beignet/backend/src/backend/gen_encoder.cpp, function virtual void gbe::GenEncoder::handleDouble(gbe::GenEncoder*, uint32_t, gbe::GenRegister, gbe::GenRegister, gbe::GenRegister), line 648
Trace/breakpoint trap

So that's definitely a Beignet bug, so let's forget about that. This 'preprocessor' is not enabled anyway for your GPU, so a failed test won't harm you.

Good to see that the tuning also works! About the Python script, I tried to reproduce with your results but didn't get your issue. Perhaps you have an old database on disk? You could try to remove scripts/database/database.json and then re-try (it will download the latest version).

@vi
Copy link
Author

vi commented Jul 30, 2018

After rm ../scripts/database/database.json it worked.

If database format changes without changing the download URL, does it mean that old CLBlast versions are untunable anymore? Maybe it should download not from master, but from current commit?

@CNugteren
Copy link
Owner

Yes, you are right. It should ideally be a git submodule or something. But not a super urgent thing I guess, because it is mostly power users that do this and the use-case of tuning first and then a few months later again is not so common.

So, what I'll do now is add your results new to the latest master and also make a note that Beignet 1.2.1 is the one to go for with your device. And then we can close this issue, am I right?

@vi
Copy link
Author

vi commented Jul 31, 2018

After the tuning tests seem to rung longer:

Total Test time (real) = 709.90 sec
Total Test time (real) = 671.09 sec

What about the connections leak? It seems like CLBlast (or Beignet, or at least the tests and tuners) opens something and not closes it properly.

@vi
Copy link
Author

vi commented Jul 31, 2018

So, what I'll do now is add your results new to the latest master and also make a note that Beignet 1.2.1 is the one to go for with your device. And then we can close this issue, am I right?

Seems OK. This issue is a already bit long and takes some browser resources to load and render. New issues would be opened about other problems like connections leak.

@CNugteren
Copy link
Owner

After the tuning tests seem to rung longer:

Could very well be, the tests typically test corner cases and very small matrices, so time is actually mostly taken by CPU reference code, CPU-GPU copy, and a bit by (perhaps slower) kernels.

Since the main issue is solved, I'll close this indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants