Fixing some myths around Kawase blur; WAS: Software (xrender) multistage box blur with configurable stage numbers (essentially implements kawase) #954

nick87720z · 2022-12-02T15:56:15Z

nick87720z
Dec 2, 2022

Summary

Edit since 03.12.2022:
It seems, I blindly took as is mistakenly description of Kawase blur as 3x box blur, where main box filter advantage is told to be high effectiveness when implemented as sliding average. That was in context of various requests to return highly efficient blur, previously existing in xcompmgr, to it's further forks.

For summary - kawase filter feature is utilizing hardware features like sampling between texels for deeper acceleration (calculate average of 2 or 4 texels right in the fetch operation). Some tricks playing with scale additionaly multiply efficiency.
There's document with kawase blur explanation.

As for 3x box filtering - sliding average filter just can't be implemented with existing OpenGL / GLSL interfaces as long per-pixel fragment shader invocations has no chance to coordinate, write (not only read) other than current pixels or GL allowes to control texel order to be processed (e.g. require sequentiality in some texel range, like line, row or just block).

Everything below in this post is just for history.

In my opinion kawase blur is better suited for software implementation, than hardware.
Multiple box stages may be piped with only last one writing to destination texture. Vectorization must be easy with gcc vector extension. Each vector place would process own set of lines or columns, thus reaching maximum utilization in comparison with misorganized auto-vectorization, usually done by -march=native, -msse{,2,3,4} and such. I got best performance using 32bit type for samples. Also I'm thinking about multi-threaded approach though never actually wrote such.

Details (TLDR)

When I have read about details of kawase blur first time, I only knew, that it's multistage (exactly - 3x) box filter, which is special by being able to be implemented as sliding window, thus being independent on blur size (only window size and stages number).

After I learned GLSL writing first own shader for --window-shader-fg, I don't imagine, how could it allow anything more effective than convolution. Well, even for box - you still have to fetch all pixels in box per each pixel. Shader per-texel invocations are not meaned to ever think about texel evaluation order or even communicate with other per-pixel invocations during pass.

Why configurable stages number is good option.

I tried in gimp to convolute same picture first with just 2x 5x5 (max size unfortunally), then near 15x 3x3. Note - I needed this big number of times with 3x3 to make it looks just like 2x 5x5. After that - I could not even see visual difference between 2 results. At least in detail-reach nature picture (perhaps I would notice if used synthetic test picture, consisting of lines, dots or some shapes at monothonic background).

As result - if this box has configurable passes number, this would also deprecate kawase as well (I don't even mention dual, as now imho it's must-have, due to how slow unsplited kernels are).

Back to implementations.

When I first read about implementing this in GLSL, i did not even know this language yet, so had no idea, how unreal this. Now I tried first real shader for --window-shader-fg option and was shocked. Shaders are deliberately not allowed neither to deal with pixel evaluation order (neither set nor get), nor even share anything between instances for different pixels (sharing is possible only between shader stage). The only workaround I found is to check coordinates according to input stage number and skip those, which don't match (and still I'm not sure if evaluation overhead will leave this meaningful ever).

In comparison, I guess - xrender implementation should be pure cpu code, as generic as e.g. imlib2 (one time claimed to be more effective than xrender). Sliding box looks pretty easy thing, also multiple filter stages inside one pass could be chained, so that not waste time throwing data multiple times.

Vectorization is mandatory today. I would do it with gcc vector extension. I already had good experience with it, for now with sse (128 byte vectors only). I used it to make software image filtering for lcd subpixel layout (as in freetype), where I got good performance boost by making entire processing sequence into one vector place (rather trying to do in-sequence vectorization, which was probably case using gcc auto-vectorizer). Of course, there could be non-vectorized code for other than gcc compilers (unless someone is willing to spend time for each vector instruction set for diffeerent architectures).

Of course, in this case vectorization can't be done inside the processed line or column, but this could be solved by assigning entire lines or columns to separate vector place. Also I expect data fetching to be especially effective in multiple columns parallel processing.

Processing samples in native int, with size matching image pixel size, I could reach best performance, both separately and in vector. Similar approach is also demostrated (and recommended) by imlib2. Although uint16 enables 2 times more data per vector, they were enough slow to reduce this factor completely (if not make it slightly worse). Attempt to use unions or different pointer types (for different vector types) made only definitely worse (or I just missed some combinations).

yshui · 2022-12-02T17:53:22Z

yshui
Dec 2, 2022
Maintainer

(i haven't read the whole post yet, but judging from your title:) box blur already supports multistage blur. In blur-kern, it's possible to specify multiple kernels. box blur is already decomposed into two stages (vertical and horizontal) internally.

I am not sure if multistage blur for kawase would make sense at all (cc @tryone144)

6 replies

nick87720z Dec 2, 2022
Author

Title was with little disorder :)

tryone144 Dec 2, 2022
Collaborator

I am not sure if multistage blur for kawase would make sense at all (cc @tryone144)

Well, our implementation of the dual-kawase algorithm is multi-stage by design. Depending on the desired blur-strength an appropriate number of stages gets chosen. The forward direction combines 5 samples into a smaller texture for each stage. The backwards step then expands back to the original texture size by taking 8 samples. Each stage uses intermediate textures quarter the size of the previous stage.

dual-kawase is not to be confused with plain kawase algorithm. The former is the algorithm outlined above that uses standard kawase kernels in the forward pass and modified ones in the backwards pass while simultaneously scaling the intermediate texture size. As far as I know, this algorithm was initially published by Marius Bjørge here.

This algorithm makes heavy use of the hardware interpolation between pixels in its sampling positions. This and the possibly non-continuous sampling positions lend this algorithm more for parallel hardware- than software implementations. Parallelization using vector-instructions might help a little but I don't have high hopes for a real-time implementation at FullHD+. As far as I know, this algorithm isn't easily linearly separable (if at all).

I wrote about software (xrender) implementation, not GLSL one.

@nick87720z Just to clarify: You are mainly concerned about software rendering / the xrender backend? If I understand the extension correctly, we never work with the actual pixel data in picom. This is all handled on the side of the X-server and we only specify composition- or draw-operations with an optional mask. In particular, instruct the X-server to filter the image with two separate convolutions (the linearly separated passes of a box- or gaussian-blur). The actual implementation of that filter is internal to the X-server (and may actually be hardware-accelerated).

box blur already supports multistage blur

Looks like, both need to be documented and (if not yet) configurable.

Documentation about that feature indeed seems to be missing. --blur-kern accepts multiple kernels separated by a semicolon ;. The actual format of the kernels is described in the manpage.

nick87720z Dec 2, 2022
Author

Just to clarify: You are mainly concerned about software rendering / the xrender backend? If I understand the extension correctly, we never work with the actual pixel data in picom. This is all handled on the side of the X-server and we only specify composition- or draw-operations with an optional mask. In particular, instruct the X-server to filter the image with two separate convolutions (the linearly separated passes of a box- or gaussian-blur). The actual implementation of that filter is internal to the X-server (and may actually be hardware-accelerated).

Hm. When I coded to tint2, I learned its window snapshot generation among others. it tried first to use XShm to access window images. Though I know, that xshm content is copy of internal xserver data - also noticed some names like pixmap to xshm. Not sure about xrender speed... well, I just wrote this below, now repearing :) .

nick87720z Dec 2, 2022
Author

As far as I know, this algorithm isn't easily linearly separable (if at all).

Hm, need to look, what's real algorithm. Perhaps I confused it with something. 3x sliding averages is definitely nothing like that.

nick87720z Dec 3, 2022
Author

As far as I know, this algorithm was initially published by Marius Bjørge here.

Thanks for the link, it helped to fix some misunderstanding. It seems, someone in previous discussions around kawase blur made mistake, describing as 3x box in context, where box indeed can be sliding. And I just took it as is. Would be great if that document were free for redistribution, e.g. to make sure that link doesn't expire.

As far as I understand, with regular Kawase (no resampling) - it's posible to reduce read bandwidth by 4x without data skips (or more if skiping, like in example, where sampling is taken from angles only).

About demonstrated mipmaping quality - looks like it's result of multiplying texture resolution with bilinear filter. Relatively big box filter, applied 2 times (as suggested in start of blur section) would make relatively right soft circle, rather than 4-beam star.

P.S. Good idea is to add warnings to topic start where possible on some breaking news during discussion.

yshui · 2022-12-02T21:11:32Z

yshui
Dec 2, 2022
Maintainer

Also, btw, xrender isn't software rendering for most people. On recent enough hardware & software xrender is accelerated too. (well, for most operations it is). And like @tryone144 said, you are not dealing with pixel data using xrender.

I am interested what is your end goal with this approach? Is it because blur in xrender is too slow?

1 reply

nick87720z Dec 2, 2022
Author

Hm, I though that xrender is just some support for composited rgba graphics. But as for checking if it has acceleration - perhaps I did not pay enough time to properly check this. While coding tint2, I did some tests, hoping to speedup text shadow drawing. Tried most lowlevel xrender call "composite", which just took pixmaps. My approach was to use xrender to multiply rendered text to get shadow, while origin just called pango layout drawing each time (to my surprise - pango won, although results different depending on font/render size).

Tint2 uses XShm to obtain window snapshots to create miniatures. At least tries so (fails are still possible) before resorgint to pixel get/set calls. Although I remember, that shm area is rather copy of xserver internal data, but I vaguely remember some xshm calls mentioning pixmaps (so, may be still chance to get more direct way to raster data).

Of course, blur in xrender just can't be fast if it's kernel. Btw, I also tried to get it working (for same text shadow), but just could not. Probably, it was not even available on my system. Would be just interesting to look, how fast is stack of sliding average boxes. I know it was in xcompmgr, but disappeared at some time.

Edit: writing on-place filtering function must be not problem, as I'm already trying. Though never tried XCB (tint2 uses Xlib). Also not sure if parallel line processing good idea (unlike columns), may be just take one line and distribute between vector cells :/ .

Although xrender is not my current config. I just got this interest while tried to learn, how GLSL Kawase ever works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing some myths around Kawase blur; WAS: Software (xrender) multistage box blur with configurable stage numbers (essentially implements kawase) #954

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Fixing some myths around Kawase blur; WAS: Software (xrender) multistage box blur with configurable stage numbers (essentially implements kawase) #954

nick87720z Dec 2, 2022

Summary

Details (TLDR)

Replies: 2 comments · 7 replies

yshui Dec 2, 2022 Maintainer

nick87720z Dec 2, 2022 Author

tryone144 Dec 2, 2022 Collaborator

nick87720z Dec 2, 2022 Author

nick87720z Dec 2, 2022 Author

nick87720z Dec 3, 2022 Author

yshui Dec 2, 2022 Maintainer

nick87720z Dec 2, 2022 Author

nick87720z
Dec 2, 2022

Replies: 2 comments 7 replies

yshui
Dec 2, 2022
Maintainer

nick87720z Dec 2, 2022
Author

tryone144 Dec 2, 2022
Collaborator

nick87720z Dec 2, 2022
Author

nick87720z Dec 2, 2022
Author

nick87720z Dec 3, 2022
Author

yshui
Dec 2, 2022
Maintainer

nick87720z Dec 2, 2022
Author