Can't Zim a Wiki #255

pfspace · 2023-12-03T05:35:12Z

Hi there,

I'm developing a simple 2D platformer game and due to it's poor performance on low end hardware I decided to take a break from the engine I was building my game on and build my own engine from scratch in C and SDL.

Then I learned about that community game called Sonic Robo Blast 2, which is based on the Doom Legacy engine and decided to learn more about this project - which is really impressive stuff, by the way - and keep an offline copy of their wiki for studying and reference.

I tried to zim their wiki from youzim.it, but the site fails and tells me to open an issue. Here I am.
Here is the wiki link: wiki.srb2.org

Thank you in advance.

benoit74 · 2023-12-03T08:31:47Z

We will have a look, sorry about the inconvenience and thank you very much for your interest and support

pfspace · 2023-12-04T01:54:31Z

There is nothing to apologize for.
Thank you very much for your attention and congratulations for your great work on ZimIt.

benoit74 · 2023-12-04T09:05:12Z

I confirm this is a scraper issue. This is not the first occurrence, so I've opened #256 to track and solve this issue. I will keep you informed here as well once we've made progress.

pfspace · 2023-12-05T03:01:10Z

Thank you very much.

benoit74 · 2024-01-31T14:05:15Z

In fact the problem is that the wiki is protected by Cloudflare.
And Cloudflare consider we are a bot scraping the website (which is not that wrong).
Unless you know the site admin and can to them ask to whitelist of our IP, there is probably not much we can do.

pfspace · 2024-02-05T11:36:21Z

I don't know them. I'm not a member of the community, just learned of this project recently. Anyway, I understand the situation. Thank you very much for your efforts and attention.

Popolechien · 2024-02-05T11:38:05Z

@pfspace Drop them an email explaining the issue? We're actually looking for someone to work with to develop a proper whitelisting procedure, and people who start a wiki are usually collaborative-minded.

pfspace · 2024-02-05T11:53:18Z

Sure, I can try.

Logan-A · 2024-02-05T22:28:29Z

Hello, I am one of the people that run the wiki at https://wiki.srb2.org/

I am looking into our logs

alama · 2024-02-05T22:58:30Z

Sorry, but we have blocked AS12876 due to forum spam going to https://mb.srb2.org/ coming that that datacenter and I am not going to remove this blockage.

benoit74 · 2024-02-06T07:23:10Z

@alama @Logan-A

We have various workers, donated by various volunteers across various machines all around the globe (most of these machines are not ours), so it is true that removing the whole AS12876 is really not appropriate (we do not control the whole AS at all, we do not even have full control on the machine in same cases) and not sufficient (next time we might run the task on a different worker, probably on a different AS).

I would prefer that we test (if possible for you, of course) the whitelisting of one single worker IP for now, on a machine we have full control over (so that I can guarantee you won't get other traffic from this IP) and I will ensure the next job is ran on this machine.

Is it correct that it is a configuration you do in Cloudflare? How do you do this, in the WAF? It is not that important for this specific test, but we would like to gain knowledge on what is possible with the various WAF / protection systems of the market (at least main ones like Cloudflare) so that we have clear procedures of what to do for next cases.

Thank you anyway for your cooperation on this, much appreciated!

benoit74 · 2024-02-06T07:29:41Z

PS: is it an issue if the IP I give you is an the AS12876? This is both a technical question (give a higher priority to the ALLOW rule or something like that) and a non-technical one (is it ok for you). The AS12876 is used by a French Cloud provider (Scaleway) at which we rent a machine, but for sure you have very varied traffic / stuff running on their 475,136 IPv4 (not speaking about IPv6 ...).

Logan-A · 2024-02-06T16:21:32Z

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

pfspace · 2024-02-06T21:39:40Z

I have unblocked AS12876, and am now trying to zim wiki.srb2.org via youzim.it

Thank you very much for your help and attention.

benoit74 · 2024-02-08T07:28:53Z

@Logan-A
Your youzim.it task has been successful : https://farm.youzim.it/pipeline/7f652142-39fe-4228-9678-8550c726c44d

It took a bit of time because there was quite a lot of jobs in the pipe when you requested the job (pipe is now empty ATM).

Unfortunately the ZIM is not complete because has been throttled after 2 hours of scraping. Only ~1400 pages have been scrapped out of ~18000 pages discovered by the scraper so far.

You might want to apply for a zim-request (open an Issue in https://github.com/openzim/zim-requests) so that we create the ZIM on our regular workers (but IP and AS will probably change) which have no time or size limit, plus we will update the ZIM regularly. We have some policies around which content we consider for inclusion in our set of ZIMs, but I think you might qualify since we already have done a ZIM of a pokemon wiki.

pfspace · 2024-02-08T08:54:40Z

Thank you all for your support, efforts and attention.

benoit74 · 2024-05-28T12:15:41Z

Nothing left to do on scraper side, closing this

benoit74 self-assigned this Dec 3, 2023

benoit74 added the scraping_issue Issue occured while using the scraper label Dec 3, 2023

benoit74 mentioned this issue Dec 4, 2023

Initial check of URL is causing issues #256

Closed

benoit74 added anti_bot_issue The issue is linked to an anti-bot protection of target website (Cloudflare, Imperva, ...) and removed scraping_issue Issue occured while using the scraper labels Jan 31, 2024

benoit74 closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't Zim a Wiki #255

Can't Zim a Wiki #255

pfspace commented Dec 3, 2023

benoit74 commented Dec 3, 2023

pfspace commented Dec 4, 2023 •

edited

Loading

benoit74 commented Dec 4, 2023

pfspace commented Dec 5, 2023

benoit74 commented Jan 31, 2024

pfspace commented Feb 5, 2024

Popolechien commented Feb 5, 2024

pfspace commented Feb 5, 2024

Logan-A commented Feb 5, 2024

alama commented Feb 5, 2024

benoit74 commented Feb 6, 2024

benoit74 commented Feb 6, 2024

Logan-A commented Feb 6, 2024

pfspace commented Feb 6, 2024

benoit74 commented Feb 8, 2024

pfspace commented Feb 8, 2024

benoit74 commented May 28, 2024

Can't Zim a Wiki #255

Can't Zim a Wiki #255

Comments

pfspace commented Dec 3, 2023

benoit74 commented Dec 3, 2023

pfspace commented Dec 4, 2023 • edited Loading

benoit74 commented Dec 4, 2023

pfspace commented Dec 5, 2023

benoit74 commented Jan 31, 2024

pfspace commented Feb 5, 2024

Popolechien commented Feb 5, 2024

pfspace commented Feb 5, 2024

Logan-A commented Feb 5, 2024

alama commented Feb 5, 2024

benoit74 commented Feb 6, 2024

benoit74 commented Feb 6, 2024

Logan-A commented Feb 6, 2024

pfspace commented Feb 6, 2024

benoit74 commented Feb 8, 2024

pfspace commented Feb 8, 2024

benoit74 commented May 28, 2024

pfspace commented Dec 4, 2023 •

edited

Loading