Microsites from the same owner return different results #232

Popolechien · 2023-10-31T07:27:05Z

I ran youzim.it on a series of microsites that all belong to the National Bank of Colombia

Impresiones de un viaje a América: https://www.banrepcultural.org/impresiones-de-un-viaje/index.php
Candelario Obeso: https://www.banrepcultural.org/candelario-obeso/index.html
Ferrocarriles de Colombia: https://www.banrepcultural.org/ferrocarriles/
90 años del Banco de la República: https://www.banrepcultural.org/ferrocarriles/
Escarabajos: https://www.banrepcultural.org/escarabajos/
Visor de Colecciones: https://www.banrepcultural.org/visor-colecciones/
Alejandro de Humboldt: viajes por Colombia: https://www.banrepcultural.org/humboldt/home.htm
Luis Caballero: https://www.banrepcultural.org/luis-caballero/

all but one failed and were blocked outright by some captcha security. The one that passed was not the first but the sixth one (ie it's not like multiple requests triggered anything).

Would there be any reason for this, and what could we do about it?

(note: this is a request from the Bank itself so we may have some wiggle room in asking them to toggle things).

benoit74 · 2023-10-31T16:49:31Z

I had a look and the issue comes from the check_url operation we do at startup (in pure Python) to confirm the URL is correct. The website detects that we are a bot (sic) and blocks us. I tried (locally) to remove the check_url operation and browsertrix crawler behaves correctly.

I'm a bit puzzled on how to move forward on this, because while the check_url operation is important, it is also an important source of issues recently.

On the short term, is this for an important client for which we would like to get the ZIM asap? If yes, I can create the ZIMs locally quite easily especially if we are speaking about micro websites.

Popolechien · 2023-10-31T17:43:39Z

A bit surprised the check_url does not trigger a random one, but so be it. Curious to know how much of an effort it would be to remove that parameter.

benoit74 · 2023-10-31T19:47:00Z

I ran the scraper without check_url for one site mentioned above and it allowed the crawler to proceed a bit but at some point it got detected as a bot and all subsequents requests got redirected to the "captcha" page. ZIM quality is hence not good at all.

Would it be possible to request the client to disable the anti-bot protection for (at least one of ) our IP(s)?

Why the sixth one succeeded is very unclear for me, it should have fail as well.

benoit74 · 2023-10-31T19:55:06Z

A bit surprised the check_url does not trigger a random one, but so be it. Curious to know how much of an effort it would be to remove that parameter.

This should not be removed, this has been implemented on purpose to check validity / quality of URL ; is it really mandatory, I don't know, but usually there is no useless code in our scraper ; but I agree that unfortunately it has some "side-effects" in our situation.

It is however fast to bypass the check_url code on my machine, just one line of code to comment out, as mentioned above

benoit74 · 2023-10-31T19:57:23Z

Why check_url is trigering the bot detection is probably linked to the removal of headers when redirects are followed, see psf/requests#2949 and https://stackoverflow.com/questions/60423439/python-requests-adding-referer-header-to-redirected-requests (not exactly our situation but close enough to have a good probability this is the explanation) ; anyway, we got blocked later on in the crawl.

benoit74 · 2023-11-01T09:10:11Z

I trying again this morning with a --delay of 5 seconds between pages, but it still trigerred bot detection at some point. I will try again with a higher delay later once my IP is not blocked anymore.

benoit74 · 2023-11-27T10:10:37Z

I just checked web.archive.org and they face the issue as well: https://web.archive.org/web/20231114112051/http://www.banrepcultural.org/

The captcha is coming from ShieldSquare protection, and I don't think it will be possible to circumvent this (it is their job to stop systems like us, so they are putting lot's of effort, so we would need to make lot's of effort as well). It is even questionable if we should (after all, it is a way to say "you are not welcomed", even if in current situation we have a query from site owner to create the ZIM).

Maybe one solution to explore could be to create the WARC files more or less manually with https://archiveweb.page/ since we are speaking about microsites, but even for that I'm not sure it is feasible. For instance I checked https://www.banrepcultural.org/luis-caballero/ and there are many (~250) pages to visit for each drawing fullscreen image. Is there any Chrome extension which could visit all links on a page like a spider / crawler does, but client-side?

Popolechien · 2023-11-27T13:34:32Z

Ok so just to sum up we can:

ask them to provide a full dump of their microsites;
ask them to turn ShieldSquare off;
(semi-)Manually crawl and scrape.
Correct?

benoit74 · 2023-11-27T13:42:07Z

Yep, correct. Just note that "semi-" option is not identified so far or might not exist at all. And as usual, these are just the options to move forward, no guarantee it will work (zimit / warc2zim are known to have limitations, especially for very interactive sites like "Visor de Colecciones")

Popolechien · 2023-12-04T14:04:17Z

Question answered I guess. Up to the bank to see if they want to do anything about it.

Popolechien added the question label Oct 31, 2023

kelson42 added this to the 1.7.0 milestone Nov 4, 2023

benoit74 mentioned this issue Dec 4, 2023

Initial check of URL is causing issues #256

Closed

Popolechien closed this as completed Dec 4, 2023

benoit74 modified the milestones: 1.7.0, 1.6.3 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microsites from the same owner return different results #232

Microsites from the same owner return different results #232

Popolechien commented Oct 31, 2023 •

edited

Loading

benoit74 commented Oct 31, 2023

Popolechien commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Nov 1, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Dec 4, 2023

Microsites from the same owner return different results #232

Microsites from the same owner return different results #232

Comments

Popolechien commented Oct 31, 2023 • edited Loading

benoit74 commented Oct 31, 2023

Popolechien commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Oct 31, 2023

benoit74 commented Nov 1, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Nov 27, 2023

benoit74 commented Nov 27, 2023

Popolechien commented Dec 4, 2023

Popolechien commented Oct 31, 2023 •

edited

Loading