Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsites from the same owner return different results #232

Closed
Popolechien opened this issue Oct 31, 2023 · 10 comments
Closed

Microsites from the same owner return different results #232

Popolechien opened this issue Oct 31, 2023 · 10 comments
Labels

Comments

@Popolechien
Copy link
Contributor

Popolechien commented Oct 31, 2023

I ran youzim.it on a series of microsites that all belong to the National Bank of Colombia

  1. Impresiones de un viaje a América: https://www.banrepcultural.org/impresiones-de-un-viaje/index.php
  2. Candelario Obeso: https://www.banrepcultural.org/candelario-obeso/index.html
  3. Ferrocarriles de Colombia: https://www.banrepcultural.org/ferrocarriles/
  4. 90 años del Banco de la República: https://www.banrepcultural.org/ferrocarriles/
  5. Escarabajos: https://www.banrepcultural.org/escarabajos/
  6. Visor de Colecciones: https://www.banrepcultural.org/visor-colecciones/
  7. Alejandro de Humboldt: viajes por Colombia: https://www.banrepcultural.org/humboldt/home.htm
  8. Luis Caballero: https://www.banrepcultural.org/luis-caballero/

Capture d’écran 2023-10-31 à 08 22 27

all but one failed and were blocked outright by some captcha security. The one that passed was not the first but the sixth one (ie it's not like multiple requests triggered anything).

Would there be any reason for this, and what could we do about it?

(note: this is a request from the Bank itself so we may have some wiggle room in asking them to toggle things).

@benoit74
Copy link
Collaborator

I had a look and the issue comes from the check_url operation we do at startup (in pure Python) to confirm the URL is correct. The website detects that we are a bot (sic) and blocks us. I tried (locally) to remove the check_url operation and browsertrix crawler behaves correctly.

I'm a bit puzzled on how to move forward on this, because while the check_url operation is important, it is also an important source of issues recently.

On the short term, is this for an important client for which we would like to get the ZIM asap? If yes, I can create the ZIMs locally quite easily especially if we are speaking about micro websites.

@Popolechien
Copy link
Contributor Author

A bit surprised the check_url does not trigger a random one, but so be it. Curious to know how much of an effort it would be to remove that parameter.

@benoit74
Copy link
Collaborator

I ran the scraper without check_url for one site mentioned above and it allowed the crawler to proceed a bit but at some point it got detected as a bot and all subsequents requests got redirected to the "captcha" page. ZIM quality is hence not good at all.

Would it be possible to request the client to disable the anti-bot protection for (at least one of ) our IP(s)?

Why the sixth one succeeded is very unclear for me, it should have fail as well.

@benoit74
Copy link
Collaborator

A bit surprised the check_url does not trigger a random one, but so be it. Curious to know how much of an effort it would be to remove that parameter.

This should not be removed, this has been implemented on purpose to check validity / quality of URL ; is it really mandatory, I don't know, but usually there is no useless code in our scraper ; but I agree that unfortunately it has some "side-effects" in our situation.

It is however fast to bypass the check_url code on my machine, just one line of code to comment out, as mentioned above

@benoit74
Copy link
Collaborator

Why check_url is trigering the bot detection is probably linked to the removal of headers when redirects are followed, see psf/requests#2949 and https://stackoverflow.com/questions/60423439/python-requests-adding-referer-header-to-redirected-requests (not exactly our situation but close enough to have a good probability this is the explanation) ; anyway, we got blocked later on in the crawl.

@benoit74
Copy link
Collaborator

benoit74 commented Nov 1, 2023

I trying again this morning with a --delay of 5 seconds between pages, but it still trigerred bot detection at some point. I will try again with a higher delay later once my IP is not blocked anymore.

@kelson42 kelson42 added this to the 1.7.0 milestone Nov 4, 2023
@benoit74
Copy link
Collaborator

I just checked web.archive.org and they face the issue as well: https://web.archive.org/web/20231114112051/http://www.banrepcultural.org/

The captcha is coming from ShieldSquare protection, and I don't think it will be possible to circumvent this (it is their job to stop systems like us, so they are putting lot's of effort, so we would need to make lot's of effort as well). It is even questionable if we should (after all, it is a way to say "you are not welcomed", even if in current situation we have a query from site owner to create the ZIM).

Maybe one solution to explore could be to create the WARC files more or less manually with https://archiveweb.page/ since we are speaking about microsites, but even for that I'm not sure it is feasible. For instance I checked https://www.banrepcultural.org/luis-caballero/ and there are many (~250) pages to visit for each drawing fullscreen image. Is there any Chrome extension which could visit all links on a page like a spider / crawler does, but client-side?

@Popolechien
Copy link
Contributor Author

Ok so just to sum up we can:

  • ask them to provide a full dump of their microsites;
  • ask them to turn ShieldSquare off;
  • (semi-)Manually crawl and scrape.
    Correct?

@benoit74
Copy link
Collaborator

Yep, correct. Just note that "semi-" option is not identified so far or might not exist at all. And as usual, these are just the options to move forward, no guarantee it will work (zimit / warc2zim are known to have limitations, especially for very interactive sites like "Visor de Colecciones")

@Popolechien
Copy link
Contributor Author

Question answered I guess. Up to the bank to see if they want to do anything about it.

@benoit74 benoit74 modified the milestones: 1.7.0, 1.6.3 Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants