Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wombat never fetches <img> source image in src when srcset is activated #176

Open
benoit74 opened this issue Sep 30, 2024 · 1 comment
Open

Comments

@benoit74
Copy link

By default (when the autofetch behavior is activated if I'm not mistaken), the crawler automatically fetches images from srcset of <img> tags so that all resolutions are available in the WARC.

However, this seems to not take into account the situation where the srcset condition is activated, and it is hence the src of the image which is never fetched (and break under some conditions).

Sample website: https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval

Sample WARC: crawl-enciclopedia-banrep-onepage-20240930.warc.gz (this WARC has images displayed only a DPR 1.5 or above, with DPR 1 all images are broken)

HTML source code causing the issue:

<img src="images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg" decoding="async" width="300" height="341" srcset="images/b/b8/Avatar-mujer.jpg 1.5x">

Since I crawled with --mobileDevice "Pixel 2", images/b/b8/Avatar-mujer.jpg has automatically been fetched by the browser, but the autoFetch behavior seems to never have fetched images/thumb/b/b8/Avatar-mujer.jpg/300px-Avatar-mujer.jpg.

Full crawl command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.0 crawl --url "https://enciclopedia.banrepcultural.org/index.php?title=Delcy_Morelos_Sandoval" --cwd /output --combineWARC --depth 0 --mobileDevice "Pixel 2" 

Nota: I'm not sure this website HTML code is 100% valid to the spec, in general I see that img src is repeated in srcset as well, but I didn't find any spec around this (is this just a good practice - to avoid situation like this one - or a spec?).

@benoit74
Copy link
Author

benoit74 commented Oct 4, 2024

@ikreymer sorry, this is not at all a wombat issue, I don't know what happened in my mind when opening this issue. Can you move this to webrecorder/browsertrix-crawler and fix the title which is wrong?

Or should I reopen this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant