Wombat never fetches `<img>` source image in `src` when `srcset` is activated #176

benoit74 · 2024-09-30T08:52:38Z

By default (when the autofetch behavior is activated if I'm not mistaken), the crawler automatically fetches images from srcset of <img> tags so that all resolutions are available in the WARC.

However, this seems to not take into account the situation where the srcset condition is activated, and it is hence the src of the image which is never fetched (and break under some conditions).

Sample website: https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval

Sample WARC: crawl-enciclopedia-banrep-onepage-20240930.warc.gz (this WARC has images displayed only a DPR 1.5 or above, with DPR 1 all images are broken)

HTML source code causing the issue:

<img src="images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg" decoding="async" width="300" height="341" srcset="images/b/b8/Avatar-mujer.jpg 1.5x">

Since I crawled with --mobileDevice "Pixel 2", images/b/b8/Avatar-mujer.jpg has automatically been fetched by the browser, but the autoFetch behavior seems to never have fetched images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg.

Full crawl command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0 crawl --url "https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval" --cwd /output --combineWARC --depth 0 --mobileDevice "Pixel 2"

Nota: I'm not sure this website HTML code is 100% valid to the spec, in general I see that img src is repeated in srcset as well, but I didn't find any spec around this (is this just a good practice - to avoid situation like this one - or a spec?).

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-10-04T07:17:03Z

@ikreymer sorry, this is not at all a wombat issue, I don't know what happened in my mind when opening this issue. Can you move this to webrecorder/browsertrix-crawler and fix the title which is wrong?

Or should I reopen this?

benoit74 mentioned this issue Sep 30, 2024

<img> tags in HTML document are not working probably due to srcset attribute openzim/warc2zim#403

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wombat never fetches `<img>` source image in `src` when `srcset` is activated #176

Wombat never fetches `<img>` source image in `src` when `srcset` is activated #176

benoit74 commented Sep 30, 2024

benoit74 commented Oct 4, 2024

Wombat never fetches <img> source image in src when srcset is activated #176

Wombat never fetches <img> source image in src when srcset is activated #176

Comments

benoit74 commented Sep 30, 2024

benoit74 commented Oct 4, 2024

Wombat never fetches `<img>` source image in `src` when `srcset` is activated #176

Wombat never fetches `<img>` source image in `src` when `srcset` is activated #176