Support is arriving in more and more Windows GPU drivers for rendering to non-window device contexts, such as the 0 (aka Desktop) context. This is good news for 1k democoding, because you no longer have to initialize a window and implement a message pump. However, as with anything, there are pros and cons to this approach:
Pro | Con |
---|---|
Saving hundreds of bytes | Support is very, very sparse |
Small number of variables makes register re-use more viable | Implementations are unstable and incompatible with most screen configs |
Kafka was developed on an AMD A6 APU with both an R5 200-series and R5 engineering sample. Only one display must be connected, or external ones switched to exclusive. This is a walkthrough of how to create the smallest possible framework for clocked GLSL shaders. At the end of this, we'll have an OpenGL context displaying a shader in just under 520 bytes.
Shader examples assume the relaxed AMD GLSL flavor, which usually allows for uninitialized variables and non-explicit constant float expressions.
- A C++ compiler. I'll use MSVC here. Kafka uses a single
*.cpp
file, so Visual Studio (or a VS solution) is not necessary. - The Windows SDK.
- Crinkler (download and put
crinkler.exe
in your project directory, renamed tolink.exe
). - Basic C/++ knowledge and quite a bit Assembly knowledge.
The idea here is to prove that one can create extremely small demos without writing the whole thing in pure Assembly. The basic skill is to know how compilers work in order to write C/++ code that produces the Assembly you want. Inline Assembly is only used when
- we know the compiler will make a mistake1 or
- it is inevitable, because we need to get rid of one more byte.
1 - "Mistake" means the compiler generates overhead in the Assembly (type checks, overflow handling etc.)
We only need these macros:
#define RUNTIME 3
#define native(t) __declspec(dllimport) t __stdcall
#define GLExt(a) a(wglGetProcAddress(#a))
RUNTIME
is the demo runtime in multiples of 64 seconds. Because we need to avoid WinAPI overhead, we abuse some OpenGL functions to pass the current time to the main shader. However, this means we can only pass two bytes. Thus, the timer will wrap at ~64 seconds. To prevent this, we scale the current time on the CPU side and unscale in the shader. This is still smaller than usinguniforms
or GLSL offsets.native()
is just a shorthand to import only the WinAPI functions we need. Never use a standard include file. In fact, never include anything.GLExt()
is a shorthand for accessing GLSL extensions.
Now define all needed types and import the WinAPI functions:
extern "C" {
native(int) wglGetProcAddress(const char*);
native(int) wglMakeCurrent(int, int);
struct BUTTER_PFDC { int a, b; };
native(int) ChoosePixelFormat(int, BUTTER_PFDC*);
native(int) SetPixelFormat(int, int, BUTTER_PFDC*);
native(int) wglCreateContext(int);
native(int) GetDC(int);
native(int) GetAsyncKeyState(int);
native(int) SetCursorPos(int, int);
native(void) SwapBuffers(int);
native(void) ExitProcess(int);
native(void) glColor3us(unsigned short, unsigned short, unsigned short);
native(void) glRects(short, short, short, short);
native(int) GetTickCount();
typedef void(__stdcall*glUseProgram)(int);
typedef int(__stdcall*glCreateShaderProgramv)(int, int, const char**);
static BUTTER_PFDC pfd = { 0, 37 };
}
struct BUTTER_PFDC { int a, b; };
is a well-known alignment hack for the PIXELFORMATDESCRIPTOR
struct. Turns out two int
s is all you need.
Next, declare the entry point:
void entrypoint(void) {
// we are now talking about this part
}
Now the second trick. Instead of creating a Window, we just render directly onto the desktop. Some GPUs do not support this:
const auto hDC = GetDC(0);
SetPixelFormat(hDC, ChoosePixelFormat(hDC, &pfd), &pfd);
Notice how the DC handle is declared as a const
. That means the actual handle won't be stored as a variable, but rather remains in one of the 32bit registers the whole time. Compile the above code and inspect the assembly. You'll see that hDC
will be stored in EDI
. So now we know that EDI
will always contain a positive, non-zero value which is also the DC. This will come in handy.
Using this knowledge, we can spare a few bytes on the actual OpenGL context initialization by re-using the register instead of creating a variable:
_asm {
push edi;
call DWORD PTR wglCreateContext;
push eax;
push edi;
call wglMakeCurrent;
}
That's the OpenGL context done. Now time to write the shader. Use GLSLSandbox or any compatible GLSL editor to create your shader. Then just replace the uniform float time
with float time=gl_Color.r*192
, where 192
is the runtime in seconds (RUNTIME
* 64). For now, let's stick to this minimal shader:
static auto fragmentShader = "void main(){gl_FragColor=vec4(.5);}";
I.e. a grey solid. Make sure you declare the shader source as a static
variable, so it'll be stored as data in the assembly. Compiling and selecting the shader is easy:
GLExt(glUseProgram)(GLExt(glCreateShaderProgramv)(0x8B30, 1, &fragmentShader));
That's the "creative" part done. Now on to the timing routine. First, get the current time after the shader init:
const auto startTime = GetTickCount();
Oh look, another const
! Where this is stored depends on the main loop code. For now, it is stored in EBX
. Now, since we don't have a window, we can't use ShowCursor
, because this function only works on the current window's DC.
But we can move the cursor off-screen. We know that EBX
holds a positive value (the current uptime in milliseconds), which is most likely to be bigger than any screen's vertical resolution. So we use that to clip the cursor to the bottom of the screen, where it is invisible:
_asm {
push ebx;
push 0;
call SetCursorPos;
}
On a 1080p screen, there's only a 1.08 second window every 50 days where this won't work. Now the main loop:
loop:
auto elapsed = GetTickCount() - startTime;
if (RUNTIME * 64000 < elapsed) goto panic;
glColor3us(elapsed / RUNTIME, RUNTIME, RUNTIME);
_asm {
push edi;
push edi;
push - 1;
push - 1;
call glRects;
}
SwapBuffers(hDC);
if(!GetAsyncKeyState(27)) goto loop;
Let's analyze what happens here:
- We use a simple goto label for the loop to prevent the compiler from doing something stupid.
elapsed
holds the elapsed time in ms. This isn't a real variable, it's just stored inEAX
and will be re-used in the following statements.- If the runtime elapsed, we exit. The exit is clean (more on that later). You can always make a graceful exit by jumping to
panic
. If you don't do this (e.g. by crashing the demo), the desktop DC will be fucked. - We pass the time (scaled by the runtime) to the shader using the red color component of
glColor3us
. Pay attention to the other colors. Even if we don't use the other colors, we pass the "same" value three times. This will result in three identicalpush
instructions which will always compress at least one byte better than using other values (e.g.0
). - Remember
EDI
? Again, we make redundantpush
instructions by usingglRects(-1, -1, edi, edi)
to lower the entropy. The last two parameters are required to be greater than or equal to1
. hDC is basically garantueed to be exactly that.
You're thinking ExitProcess(0)
, right? But push 0
is not very common in our code. So what is? push edi
;-)
panic:
_asm {
push edi;
call ExitProcess;
}
Compile:
"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe" /c /O2 /Ob1 /Oi /Os /FAs demo.cpp
Every option here is essential, so don't fool around with this. Now link:
link.exe /OUT:".\demo.exe" opengl32.lib glu32.lib winmm.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /SUBSYSTEM:WINDOWS /ENTRY:"entrypoint" /CRINKLER /HASHTRIES:4096 /COMPMODE:SLOW /ORDERTRIES:8192 /TINYHEADER /TINYIMPORT /UNALIGNCODE /REPORT:.\demo.html /SATURATE /UNSAFEIMPORT /LIBPATH:"C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x86" /RANGE:opengl32 .\demo.obj
This will most likely produce the smallest code no matter what you do. In fact, for our solid grey shader above, the final size (runtime = 3) is 516 bytes:
Linking...
Uncompressed size of code: 280
Uncompressed size of data: 115
|-- Estimating models -------------------------------------|
............................................................ 0m01s
Estimated ideal compressed size: 281.64
Reordering sections...
Iteration: 0 Size: 281.64
Iteration: 15 Size: 280.84
Time spent: 0m08s
|-- Reestimating models -----------------------------------|
............................................................ 0m00s
Reestimated ideal compressed size: 280.81
Output file: .\demo.exe
Final file size: 516 (no change)
time spent: 0m20s
When writing a shader for this framework you want have two goals:
- Make the GLSL code as short as possible. Read more about math (and use Wolfram Alpha to optimize formulas using
simplify[...]
). Don't rely on "minifiers". - At the same time, keep your code as redundant as possible.
Remember to
- Replace
uniform float time;
withfloat time=gl_Color.r*<RUNTIME*64>
. - Replace
uniform vec2 resolution;
with a constantvec2
.
The included demo.cpp
includes this shader (472 bytes):
float b=gl_Color.r*192,g,r,v,q;vec4 s(vec2 v){g=length(v);q=abs(sin((atan(v.g,v.r)-g+b)*9)*.1)+.1;return min(vec4(1),vec4(.05/abs(q-g/3),.04/abs(q-g/2),.03/abs(q-g*.7),1));}float n(vec3 v){return 1-dot(abs(v),vec3(0,1,0))-length(s(v.rb).rgb)/2*sin(b*2)+(sin(5*(v.b+b))+sin(5*(v.r+b)))*.1;}void main(){vec3 m=vec3(-1+2*(gl_FragCoord.rg/vec2(1366,768)),1),a=vec3(0,0,-2);for(;r<60;r++)g=n(a+m*v),v+=g*.125;gl_FragColor=vec4(v/2)*s((v*m+a).rb)+v*.1*vec4(1,2,3,4)/2*n(v*m+a);}
The final size for this (runtime = 3) with vec2(1366,768)
as the resolution (my crappy screen) is 749 bytes:
Linking...
Uncompressed size of code: 280
Uncompressed size of data: 552
Estimated ideal compressed size: 514.05
Reordering sections...
Iteration: 0 Size: 514.06
Iteration: 15 Size: 513.54
Time spent: 0m21s
Reestimated ideal compressed size: 513.53
Output file: .\demo.exe
Final file size: 749
time spent: 0m34s