New request: shamela.ws المكتبة الشاملة #1172

RavanJAltaie · 2024-09-30T07:59:01Z

Website URL: https://shamela.ws/
License: Open free content
Desired ZIM Title: المكتبة الشاملة
Desired ZIM Description: مشروع يهدف لجمع ما يحتاجه طالب العلم من كتب وبحوث
Desired ZIM Icon –png (URL or attach one):
Language (ISO 639-3): ara
Is this a MediaWiki?: no

RavanJAltaie · 2024-09-30T08:04:17Z

Recipe created
https://farm.openzim.org/recipes/shamela.ws_ar_all
I'll update the library link once ready.
I already sent them an email to double check if any of the books in the library has a copyright.

RavanJAltaie · 2024-10-01T12:56:52Z

We've received an answer from the team that all the books in the website are more than 100 years old. In 20 years (their operation time) they've received only 2 claims of books copyrights and they've deleted the books immediately as per their website policy.

hamoudak · 2024-10-01T14:15:40Z

is that means it won't get crawled?! I have requested a zim file related to this website here[#986] I think it's public domain.
could this be made?

Popolechien · 2024-10-01T14:16:53Z

@hamoudak no no, this means we're good. You can follow the task on the link given above.

hamoudak · 2024-10-01T14:19:53Z

thank you, it's an valuable website for reading and studying.
sorry I was confused, so their website policy to be free.

benoit74 · 2024-10-05T12:27:06Z

After 3 days, crawler progress is 3% (100753 / 2859505). 2.8 million links to explore is way too much.
I cancelled the task and disabled the recipe. We need to find another way of ZIMing this website, this is not feasible with zimit, at least as-is.

hamoudak · 2024-10-05T14:51:20Z

@benoit74 could my request [#986] be created; its one of five archives related to this domain, or I have to wait for some reason.

Also, I may suggest for the library to continue scraping it with Zimit but to be divided into 40 categories as it is on the website or to be divided by your side.

benoit74 · 2024-10-05T20:38:49Z

The idea of dividing the ZIM per category as on the website is a good one.

And looking a bit more into it, I don't get why we ended-up with 2M links.

Anyway, I've started a first sub-recipe of category 34: https://farm.openzim.org/recipes/shamela.ws_ar_34

In this new recipe, ZIM name, title and description are very bad, this will have to be fixed, but at least let's see how it goes.

hamoudak · 2024-10-05T21:46:57Z

I can give you the names of these categories in arabic ; I know arabic very well.
this category called: [ al-shir-wa-dawawinu] peotry diwans.
arabic : الشعر ودواوينه

why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.

benoit74 · 2024-10-06T06:32:41Z

why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.

OK, so what I did for [ al-shir-wa-dawawinu] peotry diwans is not going to work for all categories. I basically asked to explore only links listed in the category page, so if I understand you well, it will explore only the books but not their volumes.

If I get you correctly, what we would like is tell the scraper to:

find all books URLs listed on the category page
for all these URLs, explore the book URL and all its sub-URLs
also explore all authors URLs since it (it might probably contain some "external links" because some authors will probably have books in another category which will hence not end inside the ZIM).

For instance, for https://shamela.ws/category/4, we want the book https://shamela.ws/book/23622 but also https://shamela.ws/book/23622/1, https://shamela.ws/book/23622/2 and so on, and also https://shamela.ws/author/263 ; and so on for all other books of the category.

Is this correct? Do we have other links / pages which would be needed in each ZIM per category?

All that being said, I don't know yet how to do it with zimit, but at least it is important to understand what we would like to achieve ^^

benoit74 · 2024-10-06T06:50:12Z

I can give you the names of these categories in arabic ; I know arabic very well.
this category called: [ al-shir-wa-dawawinu] peotry diwans.
arabic : الشعر ودواوينه

Glad you can help on this, thank a lot. Once we have a working plan, I will come back to you about what we need precisely.

hamoudak · 2024-10-06T11:18:10Z

first it will only explore the links but not their sub-pages [the books themselves are volumes] . and you are absolutely right in all the three points you gave with examples . I have made over 200 (highly important) books of this domain with youzimit , when I did a basic crawl. I got just the titles (the contents of the book) not the sub-pages. so I went to the custom scope and gave it the right parameters to the sub-pages links. I got it work then.

no you don't have any other links or pages .

here's the zim file of a book I made.
I was limited in this file with the resources, so its not complete.
https://archive.org/download/sirat-ibn-hisham_904a56f0
sirat-ibn-hisham_log.txt

another complete one:
https://archive.org/download/shan-al-dua_3182b2f0

benoit74 · 2024-10-06T19:51:52Z

I think I've achieved to build a pretty good ZIM of category 34. You can see preview at https://dev.library.kiwix.org/#lang=&q=34 (this is not the final URL, and never guaranteed to work, this is just dev server).

I'm currently running again the recipe to update the icon (which is blank) and to update the CSS of HTML pages inside the ZIM (to hide useless things when offline).
What do you think?

My main concern is that it took 7 hours, which is not that bad given the scraper had to explore 7966 links, which gives us an average of 3 secs per link, but this was for only 25 books. I don't know what this will mean for a huge category like category 6 with 1227 books and very huge ones like https://shamela.ws/book/13174.

For the new task which is currently processing, I've increased the number of parallel worker to 4, let's hope it will not trigger something bad on the upstream server.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

Do you have any idea of how often we should update the ZIM? E.g. how often are they adding new books, or how many books are they adding per month / quarter / year?

Since it looks like we will finally have a plan, it is now time to ask for your help regarding ZIM metadata.

For every ZIM (and hence category for now), we will need:

a selection: this is what will go into the ZIM name, which will be named shamela.wa_ar_<selection> (without the < and >). Since we are doing one ZIM per category, the selection should be more or less the category. It should be as short as possible, but also as expressive as possible. It can contain only alphanumeric characters and the dash. If I understand you well, I imagine that for category 34 it should be 34-al-shir-wa-dawawinu (I'm not sure adding the category number will help ... it would help us to maintain the ZIM at least ^^)
a title: this is the ZIM title, displayed in all readers. It is limited to 30 characters. It should help to identify which ZIM the user is going to open. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English, I would for instance consider to use shamela.ws books: category 34 since it is difficult to be more expressive in only 30 characters. I don't really like it to be honest, it is a bit ugly, but at the same time I don't find how to fit more in 30 chars. Maybe shamela poetry diwans? Not sure it will be possible for all categories, and it is less precise than the first alternative I proposed
a description: this is the ZIM description, displayed in all readers. It is limited to 80 characters. It should help understand what is inside the ZIM, as a complement to the title. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English and I understand you well, I would probably use something like Books of shamela.ws collection, category 34 of poetry diwans (maybe adding something about what these books are would be interesting, are they about art, religion, daily life, law, mechanics, technology, ... didn't understood this so far)

Could you propose something for category 34 first? Please do not hesitate to ask friends for feedback as well on these, it is hard work and often good ideas might come from interactions with others.

hamoudak · 2024-10-06T21:16:20Z

you actually made it very good and a complete one. I 've download it from the farm before you post this comment. everything work as intended. for the things you'll hide it, I don't know much a bout it but the zim file as it is shown on the website is good enough.

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

benoit74 · 2024-10-07T06:26:53Z

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

Thank you !

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.

hamoudak · 2024-10-07T11:59:59Z

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

Thank you !

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.

RavanJAltaie added Zimit Library Collections of books mostly labels Sep 30, 2024

benoit74 added Bug Something isn't working Scraper Needed We need to build a dedicated scraper for this website labels Oct 5, 2024

benoit74 mentioned this issue Oct 6, 2024

New request: https://al-maktaba.org/book/31617 #986

Open

benoit74 mentioned this issue Oct 7, 2024

Create a new platform for shamela.ws openzim/zimfarm#1023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New request: shamela.ws المكتبة الشاملة #1172

New request: shamela.ws المكتبة الشاملة #1172

RavanJAltaie commented Sep 30, 2024 •

edited

Loading

RavanJAltaie commented Sep 30, 2024

RavanJAltaie commented Oct 1, 2024

hamoudak commented Oct 1, 2024

Popolechien commented Oct 1, 2024

hamoudak commented Oct 1, 2024 •

edited

Loading

benoit74 commented Oct 5, 2024

hamoudak commented Oct 5, 2024 •

edited

Loading

benoit74 commented Oct 5, 2024

hamoudak commented Oct 5, 2024 •

edited

Loading

benoit74 commented Oct 6, 2024

benoit74 commented Oct 6, 2024

hamoudak commented Oct 6, 2024 •

edited

Loading

benoit74 commented Oct 6, 2024

hamoudak commented Oct 6, 2024 •

edited

Loading

benoit74 commented Oct 7, 2024

hamoudak commented Oct 7, 2024

New request: shamela.ws المكتبة الشاملة #1172

New request: shamela.ws المكتبة الشاملة #1172

Comments

RavanJAltaie commented Sep 30, 2024 • edited Loading

RavanJAltaie commented Sep 30, 2024

RavanJAltaie commented Oct 1, 2024

hamoudak commented Oct 1, 2024

Popolechien commented Oct 1, 2024

hamoudak commented Oct 1, 2024 • edited Loading

benoit74 commented Oct 5, 2024

hamoudak commented Oct 5, 2024 • edited Loading

benoit74 commented Oct 5, 2024

hamoudak commented Oct 5, 2024 • edited Loading

benoit74 commented Oct 6, 2024

benoit74 commented Oct 6, 2024

hamoudak commented Oct 6, 2024 • edited Loading

benoit74 commented Oct 6, 2024

hamoudak commented Oct 6, 2024 • edited Loading

benoit74 commented Oct 7, 2024

hamoudak commented Oct 7, 2024

RavanJAltaie commented Sep 30, 2024 •

edited

Loading

hamoudak commented Oct 1, 2024 •

edited

Loading

hamoudak commented Oct 5, 2024 •

edited

Loading

hamoudak commented Oct 5, 2024 •

edited

Loading

hamoudak commented Oct 6, 2024 •

edited

Loading

hamoudak commented Oct 6, 2024 •

edited

Loading