Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New request: shamela.ws المكتبة الشاملة #1172

Open
RavanJAltaie opened this issue Sep 30, 2024 · 16 comments
Open

New request: shamela.ws المكتبة الشاملة #1172

RavanJAltaie opened this issue Sep 30, 2024 · 16 comments
Labels
Bug Something isn't working Library Collections of books mostly Scraper Needed We need to build a dedicated scraper for this website Zimit

Comments

@RavanJAltaie
Copy link
Contributor

RavanJAltaie commented Sep 30, 2024

  • Website URL: https://shamela.ws/
  • License: Open free content
  • Desired ZIM Title: المكتبة الشاملة
  • Desired ZIM Description: مشروع يهدف لجمع ما يحتاجه طالب العلم من كتب وبحوث
  • Desired ZIM Icon –png (URL or attach one):
  • Language (ISO 639-3): ara
  • Is this a MediaWiki?: no
@RavanJAltaie RavanJAltaie added Zimit Library Collections of books mostly labels Sep 30, 2024
@RavanJAltaie
Copy link
Contributor Author

Recipe created
https://farm.openzim.org/recipes/shamela.ws_ar_all
I'll update the library link once ready.
I already sent them an email to double check if any of the books in the library has a copyright.

@RavanJAltaie
Copy link
Contributor Author

We've received an answer from the team that all the books in the website are more than 100 years old. In 20 years (their operation time) they've received only 2 claims of books copyrights and they've deleted the books immediately as per their website policy.

@hamoudak
Copy link

hamoudak commented Oct 1, 2024

is that means it won't get crawled?! I have requested a zim file related to this website here[#986] I think it's public domain.
could this be made?

@Popolechien
Copy link
Collaborator

@hamoudak no no, this means we're good. You can follow the task on the link given above.

@hamoudak
Copy link

hamoudak commented Oct 1, 2024

thank you, it's an valuable website for reading and studying.
sorry I was confused, so their website policy to be free.

@benoit74
Copy link
Contributor

benoit74 commented Oct 5, 2024

After 3 days, crawler progress is 3% (100753 / 2859505). 2.8 million links to explore is way too much.
I cancelled the task and disabled the recipe. We need to find another way of ZIMing this website, this is not feasible with zimit, at least as-is.

@benoit74 benoit74 added Bug Something isn't working Scraper Needed We need to build a dedicated scraper for this website labels Oct 5, 2024
@hamoudak
Copy link

hamoudak commented Oct 5, 2024

@benoit74 could my request [#986] be created; its one of five archives related to this domain, or I have to wait for some reason.

Also, I may suggest for the library to continue scraping it with Zimit but to be divided into 40 categories as it is on the website or to be divided by your side.

@benoit74
Copy link
Contributor

benoit74 commented Oct 5, 2024

The idea of dividing the ZIM per category as on the website is a good one.

And looking a bit more into it, I don't get why we ended-up with 2M links.

Anyway, I've started a first sub-recipe of category 34: https://farm.openzim.org/recipes/shamela.ws_ar_34

In this new recipe, ZIM name, title and description are very bad, this will have to be fixed, but at least let's see how it goes.

@hamoudak
Copy link

hamoudak commented Oct 5, 2024

I can give you the names of these categories in arabic ; I know arabic very well.
this category called: [ al-shir-wa-dawawinu] peotry diwans.
arabic : الشعر ودواوينه

why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.

@benoit74
Copy link
Contributor

benoit74 commented Oct 6, 2024

why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.

OK, so what I did for [ al-shir-wa-dawawinu] peotry diwans is not going to work for all categories. I basically asked to explore only links listed in the category page, so if I understand you well, it will explore only the books but not their volumes.

If I get you correctly, what we would like is tell the scraper to:

  • find all books URLs listed on the category page
  • for all these URLs, explore the book URL and all its sub-URLs
  • also explore all authors URLs since it (it might probably contain some "external links" because some authors will probably have books in another category which will hence not end inside the ZIM).

For instance, for https://shamela.ws/category/4, we want the book https://shamela.ws/book/23622 but also https://shamela.ws/book/23622/1, https://shamela.ws/book/23622/2 and so on, and also https://shamela.ws/author/263 ; and so on for all other books of the category.

Is this correct? Do we have other links / pages which would be needed in each ZIM per category?

All that being said, I don't know yet how to do it with zimit, but at least it is important to understand what we would like to achieve ^^

@benoit74
Copy link
Contributor

benoit74 commented Oct 6, 2024

I can give you the names of these categories in arabic ; I know arabic very well.
this category called: [ al-shir-wa-dawawinu] peotry diwans.
arabic : الشعر ودواوينه

Glad you can help on this, thank a lot. Once we have a working plan, I will come back to you about what we need precisely.

@hamoudak
Copy link

hamoudak commented Oct 6, 2024

first it will only explore the links but not their sub-pages [the books themselves are volumes] . and you are absolutely right in all the three points you gave with examples . I have made over 200 (highly important) books of this domain with youzimit , when I did a basic crawl. I got just the titles (the contents of the book) not the sub-pages. so I went to the custom scope and gave it the right parameters to the sub-pages links. I got it work then.

  • no you don't have any other links or pages .

here's the zim file of a book I made.
I was limited in this file with the resources, so its not complete.
https://archive.org/download/sirat-ibn-hisham_904a56f0
sirat-ibn-hisham_log.txt

another complete one:
https://archive.org/download/shan-al-dua_3182b2f0

Image

@benoit74
Copy link
Contributor

benoit74 commented Oct 6, 2024

I think I've achieved to build a pretty good ZIM of category 34. You can see preview at https://dev.library.kiwix.org/#lang=&q=34 (this is not the final URL, and never guaranteed to work, this is just dev server).

I'm currently running again the recipe to update the icon (which is blank) and to update the CSS of HTML pages inside the ZIM (to hide useless things when offline).
What do you think?

My main concern is that it took 7 hours, which is not that bad given the scraper had to explore 7966 links, which gives us an average of 3 secs per link, but this was for only 25 books. I don't know what this will mean for a huge category like category 6 with 1227 books and very huge ones like https://shamela.ws/book/13174.

For the new task which is currently processing, I've increased the number of parallel worker to 4, let's hope it will not trigger something bad on the upstream server.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

Do you have any idea of how often we should update the ZIM? E.g. how often are they adding new books, or how many books are they adding per month / quarter / year?

Since it looks like we will finally have a plan, it is now time to ask for your help regarding ZIM metadata.

For every ZIM (and hence category for now), we will need:

  • a selection: this is what will go into the ZIM name, which will be named shamela.wa_ar_<selection> (without the < and >). Since we are doing one ZIM per category, the selection should be more or less the category. It should be as short as possible, but also as expressive as possible. It can contain only alphanumeric characters and the dash. If I understand you well, I imagine that for category 34 it should be 34-al-shir-wa-dawawinu (I'm not sure adding the category number will help ... it would help us to maintain the ZIM at least ^^)
  • a title: this is the ZIM title, displayed in all readers. It is limited to 30 characters. It should help to identify which ZIM the user is going to open. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English, I would for instance consider to use shamela.ws books: category 34 since it is difficult to be more expressive in only 30 characters. I don't really like it to be honest, it is a bit ugly, but at the same time I don't find how to fit more in 30 chars. Maybe shamela poetry diwans? Not sure it will be possible for all categories, and it is less precise than the first alternative I proposed
  • a description: this is the ZIM description, displayed in all readers. It is limited to 80 characters. It should help understand what is inside the ZIM, as a complement to the title. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English and I understand you well, I would probably use something like Books of shamela.ws collection, category 34 of poetry diwans (maybe adding something about what these books are would be interesting, are they about art, religion, daily life, law, mechanics, technology, ... didn't understood this so far)

Could you propose something for category 34 first? Please do not hesitate to ask friends for feedback as well on these, it is hard work and often good ideas might come from interactions with others.

@hamoudak
Copy link

hamoudak commented Oct 6, 2024

you actually made it very good and a complete one. I 've download it from the farm before you post this comment. everything work as intended. for the things you'll hide it, I don't know much a bout it but the zim file as it is shown on the website is good enough.

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

@benoit74
Copy link
Contributor

benoit74 commented Oct 7, 2024

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

Thank you !

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.

@hamoudak
Copy link

hamoudak commented Oct 7, 2024

I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.

OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.

now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .

Thank you !

note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.

I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.

I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.

On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Library Collections of books mostly Scraper Needed We need to build a dedicated scraper for this website Zimit
Projects
None yet
Development

No branches or pull requests

4 participants