-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use kitty deflate capability on initial load when using animation protocol #1695
Comments
under the principle that it's better to spend extra cycles in notcurses to save extra data transmitted, perhaps we'd want to convert all frames to PNG upon receipt? this might introduce unacceptable latency, though; given that we would need to know that the input is PNG, and that's not made readily available from either of our multimedia engines, and that we'd want to use this method for most visuals if it does indeed work well, i think the operative method would be an on-the-fly PNG encoding of our RGBA data. i don't have the full PNG specification memorized, but i think you can do some cheap RLE that might get us decent savings. also, PNG admits palette-indexed graphics, which could be very effective compression for images using fewer than 257 colors... |
so if we were to go this route, do we want to rely on multimedia engines to do the encoding, or do we want to add libpng as a base dependency? good: could use this fast method even without multimedia backend, don't have to maintain two encode-to-PNG paths there's also the idea of a third multimedia engine that is only libpng. vomit. oh so shit if we have libpng in notcurses-core, would we then be able to decode png files? hrmmmm. and if so, if we have a multimedia engine, which one would we use for png decoding? |
Before doing all this, I'd like to do a simple benchmark. Take an existing PNG and build a kitty graphics escape out of it. Then take the equivalent RGBA-based graphic from Notcurses. Cat them both through kitty, and ensure that the former is significantly faster (since we'll be adding an encode step). If not, there's no point in doing this. |
|
yeah, that's pretty substantial. of course, we're going to be sending PNG data, not a filename, so we need test with that. |
these numbers were only exacerbated by running remotely |
|
times:
so a healthy savings over rgba, but definitely slower than providing a file reference. |
ssh over local wireless: k.png:
png.png:
so we retain big wins on remote. how the hell is |
so first off, i now know how PNG works, and it's not a super-advanced scheme. there are a few filters, and then things are run through deflate (LZ77). note that this deflate means you have to inflate to do any in-place editing. so i think this is probably not a win for the non-animated case, where we have to edit in-place and rewrite the entire image. for the animated case, we're only writing the entire image a single time; we never edit it in place. from that point on, it's all sending null cells and auxvec-rebuilt cells. so we might as well get the one-time win from deflate, right? not this way, i think. PNG introduces other overheads in both complexity and bandwidth. i'd just as soon use the kitty protocol's built-in deflate and avoid that. so this would only be for the animation case, on the initial load. yeah, let's do it. i bet it'll compress long transparent regions really well, and boost frame rates on e.g. another idea is setting rgb for transparent regions to all 0s, so they deflate better, since we don't need preserve them any further. |
just about got this working. hey @kovidgoyal i assume that all chunks have to be deflated if any are, correct? since |
yes all chunks must use the same encoding. |
hrmm better would be eliminating the whole idea of 768 pixels == chunk when operating in this mode, and instead just stream into the zlib automaton, and take up to 4096 bytes from it at a time. yes, that's the way to do it. it just doesn't go smoothly with the existing |
ok, got it implemented and working. good news: it's effective, at least on certain graphics. here's 2X xray before:
and after:
now admittedly this is a pretty compressible bitmap, but still, a 96% reduction in transmitted bytes is a huge fucking win. the problem lies here: before:
after:
so yeah we're writing tremendously less data, but we're taking about 50% again as much total time, and our FPS have dropped correspondingly. over local wireless before
after
so performance improved in the network case, where bandwidth absolutely dominates delay. i suspect we just have a crappy first implementation, and we can probably speed it up significantly. if so, this ought be a pretty solid win. i went ahead and moved to a chunk-at-end scheme, so we're issuing optimal chunks (i.e. each is exactly 4096 bytes until we get to the end). obviously this has memory cost on the order of the bitmap size, since we're buffering up all the deflate output. |
really not much in terms of zlib, interesting |
there it is Samples: 113K of event 'cycles', Event count (approx.): 116203165085
Children Self Command Shared Object Symbol
28.29% 28.25% notcurses-demo libswscale.so.5.7.100 [.] yuv2rgba32_full_X_c
- 16.80% 0.00% notcurses-demo [unknown] [k] 0000000000000000
- 0
7.96% kitty_blit_core
- 0.99% 0x7f84496ea0b0
0.98% ff_hscale14to15_X4_ssse3.innerloop
- 0.76% 0x7f84496ea018
0.58% rgbaToA_c
- 12.32% 12.30% notcurses-demo libnotcurses-core.so.2.3.11 [.] kitty_blit_core
- 7.95% 0
kitty_blit_core
1.72% kitty_blit_core
- 1.09% 0x24900000000
kitty_blit_core
- 10.70% 10.68% notcurses-demo libz.so.1.2.11 [.] deflate_slow
10.14% deflate_slow
- 8.59% 8.58% notcurses-demo libswscale.so.5.7.100 [.] ff_hscale14to15_X4_ssse3.inner
- 4.33% 0
- 0.98% 0x7f84496ea0b0
ff_hscale14to15_X4_ssse3.innerloop
- 4.24% 0x616c665f73777300
sws_context_to_name
swscale
ff_hscale14to15_X4_ssse3.innerloop
- 7.84% 7.83% notcurses-demo libz.so.1.2.11 [.] fill_window
fill_window
- 6.77% 6.75% notcurses-demo libz.so.1.2.11 [.] longest_match
longest_match
+ 4.34% 0.03% notcurses-demo libswscale.so.5.7.100 [.] swscale
+ 4.34% 0.00% notcurses-demo [unknown] [.] 0x616c665f73777300
+ 4.34% 0.00% notcurses-demo libswscale.so.5.7.100 [.] sws_context_to_name
+ 2.82% 2.81% notcurses-demo libz.so.1.2.11 [.] adler32_z
+ 2.55% 2.54% notcurses-demo libswscale.so.5.7.100 [.] rgbaToA_c
+ 2.44% 0.00% notcurses-demo libavutil.so.56.51.100 [.] av_default_item_name |
ok so this is good; it verifies that the slowdown we're seeing is indeed accountable to zlib. let's see what happens if we tighten up our usage thereof. i definitely want to have this. |
i think we've lost a lot of parallelism in i've taken the zlib level down to 2 from |
Kitty can accept PNG data (in byte form, not from a file, meaning this technique can be used remotely) using the following syntax:
<ESC>_Gf=100;<payload><ESC>\
if we are given a PNG in
ncvisual_from_file()
, it would probably be best to send it this way -- it ought be well-compressed. with that said, as soon as we make any change to the image (wiping etc), we'd need to send along RGBA unless we intend to rebuild the PNG (we don't). we'd also need to verify that the content is indeed PNG, which we can't determine just based off the filename etc. (and it would be nice to be able to feed a PNG which we received from memory rather than the filesystem).as an example, rgb-sponge.png is 67MiB (4096 * 4096 * 4B), but its PNG is only 38MiB, 57.7% of the decoded form. other PNGs are likely to be far better compressed.
The text was updated successfully, but these errors were encountered: