Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling kernels at runtime (with caching) #31

Open
katyo opened this issue Feb 25, 2013 · 11 comments
Open

Compiling kernels at runtime (with caching) #31

katyo opened this issue Feb 25, 2013 · 11 comments

Comments

@katyo
Copy link
Contributor

katyo commented Feb 25, 2013

In common case I can't build both nvidia and amd compatible hashkill on my buildhost without nvidia hardware. Building all kernels for all hardware platforms from sources is too long process on my laptop (over three hours). I would like to implement runtime compilation of OpenCL programs with caching compiled code as configure option, but I see a few problems here.
First, can compiling flags be moved to shader source?

@gat3way
Copy link
Owner

gat3way commented Feb 25, 2013

No, but that's not the biggest problem. Such a change would be a relatively long-time effort, because it would involve changing all the ocl_*.c sources. But that's probably for good. What I am really concerned about is that it would become a problem when I start working on distributed attacks. My idea is to implement something like a lightweight version of VirtualCL, where only the needed subset of the OpenCL functionality is implemented using a custom networking protocol that is optimized for hashkill's needs (mostly rule attack optimizations).

In that case, cached precompiled kernels management becomes problematic. Where do we compile the source? Correct answer would be on each node. This however would involve transferring the .cl sources that are much larger in size as compared to precompiled, compressed binaries. Also the time needed to build them on each host would be high.Another approach would be build them on master host, then transfer the binaries to slaves, but that is harder to implement and still involves some issues.

That's probably the most significant reason I am yet not eager to switch to cached binaries.

@katyo
Copy link
Contributor Author

katyo commented Feb 26, 2013

Today we have many types of OpenCL architectures and platforms, so we must compile all programs for each platform. I made some estimation and concluded that total overhead is too big.

Source (uncompressed):

$ ls src/kernels/*.cl | wc -l
198
$ wc -c src/kernels/*.cl | grep 'total$'
10736247 total

198 sources ~11Mb

Source (compressed with xz):

$ tar -cJf - src/kernels/*.cl | wc -c
118212

~118Kb

Binary (all types of platforms):

$ ls src/kernels/compiler/*.{bin,ptx} | wc -l
2409
$ wc -c src/kernels/compiler/*.{bin,ptx} | grep 'total$'
28532285 total

2400 binaries ~29Mb

Withal, usually we don't need compile for all CL platforms on real workstation, where the program may be run. In common case it's 1-3 different OpenCL platforms according to hardware.

@gat3way
Copy link
Owner

gat3way commented Feb 26, 2013

You are talking about disk space overhead or the eventual networking overhead when clustering?

@katyo
Copy link
Contributor Author

katyo commented Feb 26, 2013

It depend from that, what understand a clustering for us. May be we talk about different things…

In my perception, each machine in cluster have its own CL hardware, in common case it different from other. I think, that each element of cluster may wish compile only needed OpenCL programs for hardware platforms, which it has, and only at first time as soon as corresponding programs will required. It isn't a big runtime overhead, if caching used.

@gat3way
Copy link
Owner

gat3way commented Feb 26, 2013

Correct and there goes the problem. The VCL approach works by queueing OpenCL commands to remote hosts then receive the function output via the net. The master host "sees" remote hosts' devices as local GPUs and to hashkill that's transparent (there is a "translation" layer which queues requests to remote hosts).

Now the first problem is that remote hosts do not share a common storage with the master host, thus the master host does not "know" whether the kernel binary is cached on the remote host. The protocol needs to be extended t account for that, but whatever you do, hashkill itself would need to be changed to accomodate for such change.

The other problem is related to proper error handling in the context of clustered hashkill. Building from source is always more prone to different OS and driver-related issues as compared to loading precompiled kernels. Nodes are likely going to run different driver versions, linux flavors, etc. What strategy should we take if we have a node that failed to compile a kernel from source? This question is hard even in case precompiled binaries were sent though.

Having prebuilt binaries on master node also guarantees all the nodes will execute one and the same code, which will eliminate some issues originating from differing opencl runtime implementations. The OpenCL compiler frontend itself (at least with AMD) is notably buggy, with crashes and wrong binaries produced occuring quite often. It is hard to base on a stable version cause there is no stable version. What we can do is focus on the best one and try to workaround compiler bugs (which sometimes even require disabling compiler optimizations or changing the code in funny way).

Well unfortunately it gets quite complicated when you try to distribute that. In fact I still have a lot of unresolved design dilemmas regarding that :(

@r3mbr4ndt
Copy link

There is no gain in "on-demand"-compiling.
Compile once and dsitribute the Kernels is always faster then "on-demand" compiling even if you've a cashing system. Compiling some kernels can take quite a while and gat3way also explained drawbacks in the SDKs. I would vote against "on demand" compiling, it was removed long time ago. You wont wait minutes before hashkill starts just because you like to crack a rar-file... (not knowing if the compiler is stuck or your computer crashed...).

@peterclemenko
Copy link

You can apparently support offline compiling for all AMD cards, including ones that aren't plugged in, with a build change.

http://devgurus.amd.com/thread/153189
http://devgurus.amd.com/thread/166543
http://developer.amd.com/resources/documentation-articles/knowledge-base/?ID=115

@gat3way
Copy link
Owner

gat3way commented Nov 11, 2013

Offline devices compilation is used already - this is how we build for AMD. For NVidia unfortunately it's not that easy and you can compile only for architectures up to that which is already available on system (e.g if you have a sm_20 gpu, you can build sm_1x and sm_20 but not sm_21 and sm_30 obviously).

@peterclemenko
Copy link

Maybe this will help: https://github.com/ljbade/clcc

@gat3way
Copy link
Owner

gat3way commented Nov 11, 2013

Nope, it does the same (limitations are in the NV ocl runtime)

@peterclemenko
Copy link

Damn, that sucks. Is there a way to do it at runtime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants