-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Microsites from the same owner return different results #232
Comments
I had a look and the issue comes from the I'm a bit puzzled on how to move forward on this, because while the On the short term, is this for an important client for which we would like to get the ZIM asap? If yes, I can create the ZIMs locally quite easily especially if we are speaking about micro websites. |
A bit surprised the |
I ran the scraper without Would it be possible to request the client to disable the anti-bot protection for (at least one of ) our IP(s)? Why the sixth one succeeded is very unclear for me, it should have fail as well. |
This should not be removed, this has been implemented on purpose to check validity / quality of URL ; is it really mandatory, I don't know, but usually there is no useless code in our scraper ; but I agree that unfortunately it has some "side-effects" in our situation. It is however fast to bypass the |
Why |
I trying again this morning with a |
I just checked web.archive.org and they face the issue as well: https://web.archive.org/web/20231114112051/http://www.banrepcultural.org/ The captcha is coming from ShieldSquare protection, and I don't think it will be possible to circumvent this (it is their job to stop systems like us, so they are putting lot's of effort, so we would need to make lot's of effort as well). It is even questionable if we should (after all, it is a way to say "you are not welcomed", even if in current situation we have a query from site owner to create the ZIM). Maybe one solution to explore could be to create the WARC files more or less manually with https://archiveweb.page/ since we are speaking about microsites, but even for that I'm not sure it is feasible. For instance I checked https://www.banrepcultural.org/luis-caballero/ and there are many (~250) pages to visit for each drawing fullscreen image. Is there any Chrome extension which could visit all links on a page like a spider / crawler does, but client-side? |
Ok so just to sum up we can:
|
Yep, correct. Just note that "semi-" option is not identified so far or might not exist at all. And as usual, these are just the options to move forward, no guarantee it will work (zimit / warc2zim are known to have limitations, especially for very interactive sites like "Visor de Colecciones") |
Question answered I guess. Up to the bank to see if they want to do anything about it. |
I ran youzim.it on a series of microsites that all belong to the National Bank of Colombia
all but one failed and were blocked outright by some captcha security. The one that passed was not the first but the sixth one (ie it's not like multiple requests triggered anything).
Would there be any reason for this, and what could we do about it?
(note: this is a request from the Bank itself so we may have some wiggle room in asking them to toggle things).
The text was updated successfully, but these errors were encountered: