You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
why did the rewriter accepted to rewrite/keep images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg despite this path being missing in the ZIM?
can we imagine warc2zim explores all src and srcset URLs and replace the src with the image present in the WARC/ZIM and drop the whole srcset? we already have a rewrite rule to properly rewrite the URLs in the srcset attribute, but I start to consider this might probably have been a mistake, only one URL has been really load in most (all?) cases and we should have fixed this differently
The text was updated successfully, but these errors were encountered:
why did the rewriter accepted to rewrite/keep images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg despite this path being missing in the ZIM?
Because for images we always rewrite, so that we avoid to load "online" images because the URL has not been rewritten. So this is normal
can we imagine warc2zim explores all src and srcset URLs and replace the src with the image present in the WARC/ZIM and drop the whole srcset?
The issue is a bit more complex.
By default the crawler in zimit runs with autofetch behavior activated. When this behavior is activated, all srcset images are fetched and saved in the WARC, so that the WARC works well at all resolution. However, due to what looks like a crawler bug (webrecorder/wombat#176), the src is not always fetched, even with the autofetch activated.
So without this upstream bug we wouldn't have notified the problem, unless autofetch behavior is deactivated.
I think we should however fix this issue in order to enhance warc2zim support of cases where autofetch is not activated.
I began to work in fixing this, and I'm now not sure anymore we want to fix this.
The philosophy so far has been to try to keep things as much as possible like they were at the source. Here we are beginning to alter quite a lot of things, we many situations to take into account:
what if the src is missing in the WARC, which srcset image should we choose?
what if some of the srcset images are missing in the WARC?
what if the srcset code was already broken online? the src was broken online?
And what makes it even more complex is that we also have to support <picture> with individual source like in
To support these <picture> it means we need to interpret both the img and source tags, which to me is a clear indicator that we are doing to much.
I propose to not fix this issue, and simply clearly document that if autofetch behavior is disabled, we are going to have some issues with "adaptive" img and picture tags. All this assuming that upstream bug is confirmed and fixed.
In banrepcultural enciclopedia ZIM (at https://dev.library.kiwix.org/#lang=&q=enciclopedia+banrepcultural), we have an issue around images which are not displaying unless we use a mobile phone (or maybe a tablet)
Sample page: https://dev.library.kiwix.org/content/banrepcultural.org_es_enciclopedia_2024-09/enciclopedia.banrepcultural.org/index.php%3Ftitle%3DDelcy_Morelos_Sandoval
HTML sample from this page:
What is inside the ZIM related to
Avatar-mujer.jpg
:The
enciclopedia.banrepcultural.org/images/thumb/b/b8/Avatar-mujer.jpg/105px-Avatar-mujer.jpg
seems unrelated to this page, it seems to come from https://dev.library.kiwix.org/content/banrepcultural.org_es_enciclopedia_2024-09/enciclopedia.banrepcultural.org/index.php%3Ftitle%3DArchivo%3AAvatar-mujer.jpg (the thumbnail in the tableFile history
)Two questions:
images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg
despite this path being missing in the ZIM?The text was updated successfully, but these errors were encountered: