-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New request: shamela.ws المكتبة الشاملة #1172
Comments
Recipe created |
We've received an answer from the team that all the books in the website are more than 100 years old. In 20 years (their operation time) they've received only 2 claims of books copyrights and they've deleted the books immediately as per their website policy. |
is that means it won't get crawled?! I have requested a zim file related to this website here[#986] I think it's public domain. |
@hamoudak no no, this means we're good. You can follow the task on the link given above. |
thank you, it's an valuable website for reading and studying. |
After 3 days, crawler progress is 3% (100753 / 2859505). 2.8 million links to explore is way too much. |
The idea of dividing the ZIM per category as on the website is a good one. And looking a bit more into it, I don't get why we ended-up with 2M links. Anyway, I've started a first sub-recipe of category 34: https://farm.openzim.org/recipes/shamela.ws_ar_34 In this new recipe, ZIM name, title and description are very bad, this will have to be fixed, but at least let's see how it goes. |
I can give you the names of these categories in arabic ; I know arabic very well. why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links. |
OK, so what I did for If I get you correctly, what we would like is tell the scraper to:
For instance, for https://shamela.ws/category/4, we want the book https://shamela.ws/book/23622 but also https://shamela.ws/book/23622/1, https://shamela.ws/book/23622/2 and so on, and also https://shamela.ws/author/263 ; and so on for all other books of the category. Is this correct? Do we have other links / pages which would be needed in each ZIM per category? All that being said, I don't know yet how to do it with zimit, but at least it is important to understand what we would like to achieve ^^ |
Glad you can help on this, thank a lot. Once we have a working plan, I will come back to you about what we need precisely. |
first it will only explore the links but not their sub-pages [the books themselves are volumes] . and you are absolutely right in all the three points you gave with examples . I have made over 200 (highly important) books of this domain with youzimit , when I did a basic crawl. I got just the titles (the contents of the book) not the sub-pages. so I went to the custom scope and gave it the right parameters to the sub-pages links. I got it work then.
here's the zim file of a book I made. another complete one: |
I think I've achieved to build a pretty good ZIM of category 34. You can see preview at https://dev.library.kiwix.org/#lang=&q=34 (this is not the final URL, and never guaranteed to work, this is just dev server). I'm currently running again the recipe to update the icon (which is blank) and to update the CSS of HTML pages inside the ZIM (to hide useless things when offline). My main concern is that it took 7 hours, which is not that bad given the scraper had to explore 7966 links, which gives us an average of 3 secs per link, but this was for only 25 books. I don't know what this will mean for a huge category like category 6 with 1227 books and very huge ones like https://shamela.ws/book/13174. For the new task which is currently processing, I've increased the number of parallel worker to 4, let's hope it will not trigger something bad on the upstream server. I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM. Do you have any idea of how often we should update the ZIM? E.g. how often are they adding new books, or how many books are they adding per month / quarter / year? Since it looks like we will finally have a plan, it is now time to ask for your help regarding ZIM metadata. For every ZIM (and hence category for now), we will need:
Could you propose something for category 34 first? Please do not hesitate to ask friends for feedback as well on these, it is hard work and often good ideas might come from interactions with others. |
you actually made it very good and a complete one. I 've download it from the farm before you post this comment. everything work as intended. for the things you'll hide it, I don't know much a bout it but the zim file as it is shown on the website is good enough. I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones. now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could . note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use. |
OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.
Thank you !
I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.
On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK. |
|
The text was updated successfully, but these errors were encountered: