Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't Zim a Wiki #255

Closed
pfspace opened this issue Dec 3, 2023 · 17 comments
Closed

Can't Zim a Wiki #255

pfspace opened this issue Dec 3, 2023 · 17 comments
Assignees
Labels
anti_bot_issue The issue is linked to an anti-bot protection of target website (Cloudflare, Imperva, ...)

Comments

@pfspace
Copy link

pfspace commented Dec 3, 2023

Hi there,

I'm developing a simple 2D platformer game and due to it's poor performance on low end hardware I decided to take a break from the engine I was building my game on and build my own engine from scratch in C and SDL.

Then I learned about that community game called Sonic Robo Blast 2, which is based on the Doom Legacy engine and decided to learn more about this project - which is really impressive stuff, by the way - and keep an offline copy of their wiki for studying and reference.

I tried to zim their wiki from youzim.it, but the site fails and tells me to open an issue. Here I am.
Here is the wiki link: wiki.srb2.org

Thank you in advance.

@benoit74
Copy link
Collaborator

benoit74 commented Dec 3, 2023

We will have a look, sorry about the inconvenience and thank you very much for your interest and support

@benoit74 benoit74 self-assigned this Dec 3, 2023
@benoit74 benoit74 added the scraping_issue Issue occured while using the scraper label Dec 3, 2023
@pfspace
Copy link
Author

pfspace commented Dec 4, 2023

There is nothing to apologize for.
Thank you very much for your attention and congratulations for your great work on ZimIt.

@benoit74
Copy link
Collaborator

benoit74 commented Dec 4, 2023

I confirm this is a scraper issue. This is not the first occurrence, so I've opened #256 to track and solve this issue. I will keep you informed here as well once we've made progress.

@pfspace
Copy link
Author

pfspace commented Dec 5, 2023

Thank you very much.

@benoit74
Copy link
Collaborator

In fact the problem is that the wiki is protected by Cloudflare.
And Cloudflare consider we are a bot scraping the website (which is not that wrong).
Unless you know the site admin and can to them ask to whitelist of our IP, there is probably not much we can do.

@benoit74 benoit74 added anti_bot_issue The issue is linked to an anti-bot protection of target website (Cloudflare, Imperva, ...) and removed scraping_issue Issue occured while using the scraper labels Jan 31, 2024
@pfspace
Copy link
Author

pfspace commented Feb 5, 2024

I don't know them. I'm not a member of the community, just learned of this project recently. Anyway, I understand the situation. Thank you very much for your efforts and attention.

@Popolechien
Copy link
Contributor

@pfspace Drop them an email explaining the issue? We're actually looking for someone to work with to develop a proper whitelisting procedure, and people who start a wiki are usually collaborative-minded.

@pfspace
Copy link
Author

pfspace commented Feb 5, 2024

Sure, I can try.

@Logan-A
Copy link

Logan-A commented Feb 5, 2024

Hello, I am one of the people that run the wiki at https://wiki.srb2.org/

I am looking into our logs

@alama
Copy link

alama commented Feb 5, 2024

Sorry, but we have blocked AS12876 due to forum spam going to https://mb.srb2.org/ coming that that datacenter and I am not going to remove this blockage.

@benoit74
Copy link
Collaborator

benoit74 commented Feb 6, 2024

@alama @Logan-A

We have various workers, donated by various volunteers across various machines all around the globe (most of these machines are not ours), so it is true that removing the whole AS12876 is really not appropriate (we do not control the whole AS at all, we do not even have full control on the machine in same cases) and not sufficient (next time we might run the task on a different worker, probably on a different AS).

I would prefer that we test (if possible for you, of course) the whitelisting of one single worker IP for now, on a machine we have full control over (so that I can guarantee you won't get other traffic from this IP) and I will ensure the next job is ran on this machine.

Is it correct that it is a configuration you do in Cloudflare? How do you do this, in the WAF? It is not that important for this specific test, but we would like to gain knowledge on what is possible with the various WAF / protection systems of the market (at least main ones like Cloudflare) so that we have clear procedures of what to do for next cases.

Thank you anyway for your cooperation on this, much appreciated!

@benoit74
Copy link
Collaborator

benoit74 commented Feb 6, 2024

PS: is it an issue if the IP I give you is an the AS12876? This is both a technical question (give a higher priority to the ALLOW rule or something like that) and a non-technical one (is it ok for you). The AS12876 is used by a French Cloud provider (Scaleway) at which we rent a machine, but for sure you have very varied traffic / stuff running on their 475,136 IPv4 (not speaking about IPv6 ...).

@Logan-A
Copy link

Logan-A commented Feb 6, 2024

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

@pfspace
Copy link
Author

pfspace commented Feb 6, 2024

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

Thank you very much for your help and attention.

@benoit74
Copy link
Collaborator

benoit74 commented Feb 8, 2024

@Logan-A
Your youzim.it task has been successful : https://farm.youzim.it/pipeline/7f652142-39fe-4228-9678-8550c726c44d

It took a bit of time because there was quite a lot of jobs in the pipe when you requested the job (pipe is now empty ATM).

Unfortunately the ZIM is not complete because has been throttled after 2 hours of scraping. Only ~1400 pages have been scrapped out of ~18000 pages discovered by the scraper so far.

You might want to apply for a zim-request (open an Issue in https://github.com/openzim/zim-requests) so that we create the ZIM on our regular workers (but IP and AS will probably change) which have no time or size limit, plus we will update the ZIM regularly. We have some policies around which content we consider for inclusion in our set of ZIMs, but I think you might qualify since we already have done a ZIM of a pokemon wiki.

@pfspace
Copy link
Author

pfspace commented Feb 8, 2024

Thank you all for your support, efforts and attention.

@benoit74
Copy link
Collaborator

Nothing left to do on scraper side, closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
anti_bot_issue The issue is linked to an anti-bot protection of target website (Cloudflare, Imperva, ...)
Projects
None yet
Development

No branches or pull requests

5 participants