Fixing some myths around Kawase blur; WAS: Software (xrender) multistage box blur with configurable stage numbers (essentially implements kawase) #954
Replies: 2 comments 7 replies
-
(i haven't read the whole post yet, but judging from your title:) box blur already supports multistage blur. In I am not sure if multistage blur for kawase would make sense at all (cc @tryone144) |
Beta Was this translation helpful? Give feedback.
-
Also, btw, xrender isn't software rendering for most people. On recent enough hardware & software xrender is accelerated too. (well, for most operations it is). And like @tryone144 said, you are not dealing with pixel data using xrender. I am interested what is your end goal with this approach? Is it because blur in xrender is too slow? |
Beta Was this translation helpful? Give feedback.
-
Summary
Edit since 03.12.2022:
It seems, I blindly took as is mistakenly description of Kawase blur as 3x box blur, where main box filter advantage is told to be high effectiveness when implemented as sliding average. That was in context of various requests to return highly efficient blur, previously existing in xcompmgr, to it's further forks.
For summary - kawase filter feature is utilizing hardware features like sampling between texels for deeper acceleration (calculate average of 2 or 4 texels right in the fetch operation). Some tricks playing with scale additionaly multiply efficiency.
There's document with kawase blur explanation.
As for 3x box filtering - sliding average filter just can't be implemented with existing OpenGL / GLSL interfaces as long per-pixel fragment shader invocations has no chance to coordinate, write (not only read) other than current pixels or GL allowes to control texel order to be processed (e.g. require sequentiality in some texel range, like line, row or just block).
Everything below in this post is just for history.
In my opinion kawase blur is better suited for software implementation, than hardware.
Multiple box stages may be piped with only last one writing to destination texture. Vectorization must be easy with gcc vector extension. Each vector place would process own set of lines or columns, thus reaching maximum utilization in comparison with misorganized auto-vectorization, usually done by
-march=native
,-msse{,2,3,4}
and such. I got best performance using 32bit type for samples. Also I'm thinking about multi-threaded approach though never actually wrote such.Details (TLDR)
When I have read about details of kawase blur first time, I only knew, that it's multistage (exactly - 3x) box filter, which is special by being able to be implemented as sliding window, thus being independent on blur size (only window size and stages number).
After I learned GLSL writing first own shader for --window-shader-fg, I don't imagine, how could it allow anything more effective than convolution. Well, even for box - you still have to fetch all pixels in box per each pixel. Shader per-texel invocations are not meaned to ever think about texel evaluation order or even communicate with other per-pixel invocations during pass.
Why configurable stages number is good option.
I tried in gimp to convolute same picture first with just 2x 5x5 (max size unfortunally), then near 15x 3x3. Note - I needed this big number of times with 3x3 to make it looks just like 2x 5x5. After that - I could not even see visual difference between 2 results. At least in detail-reach nature picture (perhaps I would notice if used synthetic test picture, consisting of lines, dots or some shapes at monothonic background).
As result - if this box has configurable passes number, this would also deprecate kawase as well (I don't even mention
dual
, as now imho it's must-have, due to how slow unsplited kernels are).Back to implementations.
When I first read about implementing this in GLSL, i did not even know this language yet, so had no idea, how unreal this. Now I tried first real shader for --window-shader-fg option and was shocked. Shaders are deliberately not allowed neither to deal with pixel evaluation order (neither set nor get), nor even share anything between instances for different pixels (sharing is possible only between shader stage). The only workaround I found is to check coordinates according to input stage number and skip those, which don't match (and still I'm not sure if evaluation overhead will leave this meaningful ever).
In comparison, I guess - xrender implementation should be pure cpu code, as generic as e.g. imlib2 (one time claimed to be more effective than xrender). Sliding box looks pretty easy thing, also multiple filter stages inside one pass could be chained, so that not waste time throwing data multiple times.
Vectorization is mandatory today. I would do it with gcc vector extension. I already had good experience with it, for now with sse (128 byte vectors only). I used it to make software image filtering for lcd subpixel layout (as in freetype), where I got good performance boost by making entire processing sequence into one vector place (rather trying to do in-sequence vectorization, which was probably case using gcc auto-vectorizer). Of course, there could be non-vectorized code for other than gcc compilers (unless someone is willing to spend time for each vector instruction set for diffeerent architectures).
Of course, in this case vectorization can't be done inside the processed line or column, but this could be solved by assigning entire lines or columns to separate vector place. Also I expect data fetching to be especially effective in multiple columns parallel processing.
Processing samples in native int, with size matching image pixel size, I could reach best performance, both separately and in vector. Similar approach is also demostrated (and recommended) by imlib2. Although uint16 enables 2 times more data per vector, they were enough slow to reduce this factor completely (if not make it slightly worse). Attempt to use unions or different pointer types (for different vector types) made only definitely worse (or I just missed some combinations).
Beta Was this translation helpful? Give feedback.
All reactions