Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<img> tags in HTML document are not working probably due to srcset attribute #403

Open
benoit74 opened this issue Sep 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation upstream
Milestone

Comments

@benoit74
Copy link
Collaborator

In banrepcultural enciclopedia ZIM (at https://dev.library.kiwix.org/#lang=&q=enciclopedia+banrepcultural), we have an issue around images which are not displaying unless we use a mobile phone (or maybe a tablet)

Sample page: https://dev.library.kiwix.org/content/banrepcultural.org_es_enciclopedia_2024-09/enciclopedia.banrepcultural.org/index.php%3Ftitle%3DDelcy_Morelos_Sandoval

HTML sample from this page:

<img src="images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg" decoding="async" width="300" height="341" srcset="images/b/b8/Avatar-mujer.jpg 1.5x" />

What is inside the ZIM related to Avatar-mujer.jpg:

enciclopedia.banrepcultural.org/images/b/b8/Avatar-mujer.jpg
enciclopedia.banrepcultural.org/images/thumb/b/b8/Avatar-mujer.jpg/105px-Avatar-mujer.jpg

The enciclopedia.banrepcultural.org/images/thumb/b/b8/Avatar-mujer.jpg/105px-Avatar-mujer.jpg seems unrelated to this page, it seems to come from https://dev.library.kiwix.org/content/banrepcultural.org_es_enciclopedia_2024-09/enciclopedia.banrepcultural.org/index.php%3Ftitle%3DArchivo%3AAvatar-mujer.jpg (the thumbnail in the table File history)

Two questions:

  • why did the rewriter accepted to rewrite/keep images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg despite this path being missing in the ZIM?
  • can we imagine warc2zim explores all src and srcset URLs and replace the src with the image present in the WARC/ZIM and drop the whole srcset? we already have a rewrite rule to properly rewrite the URLs in the srcset attribute, but I start to consider this might probably have been a mistake, only one URL has been really load in most (all?) cases and we should have fixed this differently
@benoit74 benoit74 added the bug Something isn't working label Sep 26, 2024
@benoit74 benoit74 added this to the 2.2.0 milestone Sep 26, 2024
@benoit74 benoit74 self-assigned this Sep 26, 2024
@kelson42
Copy link
Contributor

Once the version of the image chosen, it seems to me that - indeed - the whole "srcset" should be replaced by a simple "src".

@benoit74
Copy link
Collaborator Author

why did the rewriter accepted to rewrite/keep images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg despite this path being missing in the ZIM?

Because for images we always rewrite, so that we avoid to load "online" images because the URL has not been rewritten. So this is normal

can we imagine warc2zim explores all src and srcset URLs and replace the src with the image present in the WARC/ZIM and drop the whole srcset?

The issue is a bit more complex.

By default the crawler in zimit runs with autofetch behavior activated. When this behavior is activated, all srcset images are fetched and saved in the WARC, so that the WARC works well at all resolution. However, due to what looks like a crawler bug (webrecorder/wombat#176), the src is not always fetched, even with the autofetch activated.

So without this upstream bug we wouldn't have notified the problem, unless autofetch behavior is deactivated.

I think we should however fix this issue in order to enhance warc2zim support of cases where autofetch is not activated.

@benoit74
Copy link
Collaborator Author

I began to work in fixing this, and I'm now not sure anymore we want to fix this.

The philosophy so far has been to try to keep things as much as possible like they were at the source. Here we are beginning to alter quite a lot of things, we many situations to take into account:

  • what if the src is missing in the WARC, which srcset image should we choose?
  • what if some of the srcset images are missing in the WARC?
  • what if the srcset code was already broken online? the src was broken online?

And what makes it even more complex is that we also have to support <picture> with individual source like in


    <picture>
      <source
        srcset="./images/image3-high.png"
        media="all and (min-width: 1280px)"
        type="image/png"
      />
      <source
        srcset="./images/image3-medium.png"
        media="all and (min-width: 600px)"
        type="image/png"
      />
      <source
        srcset="./images/image3-small.png"
        media="all and (min-width: 0px)"
        type="image/png"
      />
      <img src="./images/image3.png" alt="an image" />
    </picture>

or a bit simpler

    <picture>
      <source
        srcset="
          ./images/image2.png,
          ./images/image2-1x.png   1x,
          ./images/image2-2x.png   2x
        "
        type="image/png"
      />
      <img src="./images/image2.png" alt="an image" />
    </picture>

To support these <picture> it means we need to interpret both the img and source tags, which to me is a clear indicator that we are doing to much.

I propose to not fix this issue, and simply clearly document that if autofetch behavior is disabled, we are going to have some issues with "adaptive" img and picture tags. All this assuming that upstream bug is confirmed and fixed.

@benoit74 benoit74 added upstream documentation Improvements or additions to documentation labels Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation upstream
Projects
None yet
Development

No branches or pull requests

2 participants