Compiling kernels at runtime (with caching) #31

katyo · 2013-02-25T07:50:37Z

In common case I can't build both nvidia and amd compatible hashkill on my buildhost without nvidia hardware. Building all kernels for all hardware platforms from sources is too long process on my laptop (over three hours). I would like to implement runtime compilation of OpenCL programs with caching compiled code as configure option, but I see a few problems here.
First, can compiling flags be moved to shader source?

gat3way · 2013-02-25T08:55:33Z

No, but that's not the biggest problem. Such a change would be a relatively long-time effort, because it would involve changing all the ocl_*.c sources. But that's probably for good. What I am really concerned about is that it would become a problem when I start working on distributed attacks. My idea is to implement something like a lightweight version of VirtualCL, where only the needed subset of the OpenCL functionality is implemented using a custom networking protocol that is optimized for hashkill's needs (mostly rule attack optimizations).

In that case, cached precompiled kernels management becomes problematic. Where do we compile the source? Correct answer would be on each node. This however would involve transferring the .cl sources that are much larger in size as compared to precompiled, compressed binaries. Also the time needed to build them on each host would be high.Another approach would be build them on master host, then transfer the binaries to slaves, but that is harder to implement and still involves some issues.

That's probably the most significant reason I am yet not eager to switch to cached binaries.

katyo · 2013-02-26T06:35:54Z

Today we have many types of OpenCL architectures and platforms, so we must compile all programs for each platform. I made some estimation and concluded that total overhead is too big.

Source (uncompressed):

$ ls src/kernels/*.cl | wc -l
198
$ wc -c src/kernels/*.cl | grep 'total$'
10736247 total

198 sources ~11Mb

Source (compressed with xz):

$ tar -cJf - src/kernels/*.cl | wc -c
118212

~118Kb

Binary (all types of platforms):

$ ls src/kernels/compiler/*.{bin,ptx} | wc -l
2409
$ wc -c src/kernels/compiler/*.{bin,ptx} | grep 'total$'
28532285 total

2400 binaries ~29Mb

Withal, usually we don't need compile for all CL platforms on real workstation, where the program may be run. In common case it's 1-3 different OpenCL platforms according to hardware.

gat3way · 2013-02-26T09:40:56Z

You are talking about disk space overhead or the eventual networking overhead when clustering?

katyo · 2013-02-26T10:45:19Z

It depend from that, what understand a clustering for us. May be we talk about different things…

In my perception, each machine in cluster have its own CL hardware, in common case it different from other. I think, that each element of cluster may wish compile only needed OpenCL programs for hardware platforms, which it has, and only at first time as soon as corresponding programs will required. It isn't a big runtime overhead, if caching used.

gat3way · 2013-02-26T12:49:45Z

Correct and there goes the problem. The VCL approach works by queueing OpenCL commands to remote hosts then receive the function output via the net. The master host "sees" remote hosts' devices as local GPUs and to hashkill that's transparent (there is a "translation" layer which queues requests to remote hosts).

Now the first problem is that remote hosts do not share a common storage with the master host, thus the master host does not "know" whether the kernel binary is cached on the remote host. The protocol needs to be extended t account for that, but whatever you do, hashkill itself would need to be changed to accomodate for such change.

The other problem is related to proper error handling in the context of clustered hashkill. Building from source is always more prone to different OS and driver-related issues as compared to loading precompiled kernels. Nodes are likely going to run different driver versions, linux flavors, etc. What strategy should we take if we have a node that failed to compile a kernel from source? This question is hard even in case precompiled binaries were sent though.

Having prebuilt binaries on master node also guarantees all the nodes will execute one and the same code, which will eliminate some issues originating from differing opencl runtime implementations. The OpenCL compiler frontend itself (at least with AMD) is notably buggy, with crashes and wrong binaries produced occuring quite often. It is hard to base on a stable version cause there is no stable version. What we can do is focus on the best one and try to workaround compiler bugs (which sometimes even require disabling compiler optimizations or changing the code in funny way).

Well unfortunately it gets quite complicated when you try to distribute that. In fact I still have a lot of unresolved design dilemmas regarding that :(

r3mbr4ndt · 2013-09-03T15:18:01Z

There is no gain in "on-demand"-compiling.
Compile once and dsitribute the Kernels is always faster then "on-demand" compiling even if you've a cashing system. Compiling some kernels can take quite a while and gat3way also explained drawbacks in the SDKs. I would vote against "on demand" compiling, it was removed long time ago. You wont wait minutes before hashkill starts just because you like to crack a rar-file... (not knowing if the compiler is stuck or your computer crashed...).

peterclemenko · 2013-11-11T03:32:39Z

You can apparently support offline compiling for all AMD cards, including ones that aren't plugged in, with a build change.

http://devgurus.amd.com/thread/153189
http://devgurus.amd.com/thread/166543
http://developer.amd.com/resources/documentation-articles/knowledge-base/?ID=115

gat3way · 2013-11-11T06:39:23Z

Offline devices compilation is used already - this is how we build for AMD. For NVidia unfortunately it's not that easy and you can compile only for architectures up to that which is already available on system (e.g if you have a sm_20 gpu, you can build sm_1x and sm_20 but not sm_21 and sm_30 obviously).

peterclemenko · 2013-11-11T06:51:18Z

Maybe this will help: https://github.com/ljbade/clcc

gat3way · 2013-11-11T07:44:41Z

Nope, it does the same (limitations are in the NV ocl runtime)

peterclemenko · 2013-11-11T08:56:52Z

Damn, that sucks. Is there a way to do it at runtime?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling kernels at runtime (with caching) #31

Compiling kernels at runtime (with caching) #31

katyo commented Feb 25, 2013

gat3way commented Feb 25, 2013

katyo commented Feb 26, 2013

gat3way commented Feb 26, 2013

katyo commented Feb 26, 2013

gat3way commented Feb 26, 2013

r3mbr4ndt commented Sep 3, 2013

peterclemenko commented Nov 11, 2013

gat3way commented Nov 11, 2013

peterclemenko commented Nov 11, 2013

gat3way commented Nov 11, 2013

peterclemenko commented Nov 11, 2013

Compiling kernels at runtime (with caching) #31

Compiling kernels at runtime (with caching) #31

Comments

katyo commented Feb 25, 2013

gat3way commented Feb 25, 2013

katyo commented Feb 26, 2013

gat3way commented Feb 26, 2013

katyo commented Feb 26, 2013

gat3way commented Feb 26, 2013

r3mbr4ndt commented Sep 3, 2013

peterclemenko commented Nov 11, 2013

gat3way commented Nov 11, 2013

peterclemenko commented Nov 11, 2013

gat3way commented Nov 11, 2013

peterclemenko commented Nov 11, 2013