-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zimit does not really support setting the --collection
CLI argument
#252
Comments
This is not expected, but at the same time we have zimit task which have started and succeeded today in our infrastructure. Could you please:
I will have a look into it as well, but if it is not the worker parameter is not the problem and since we do not have some of your settings, I'm afraid it might not be easy to reproduce. If the worker thing is not the problem, I would suggest that you try on your side to reproduce the issue with a website you are open to disclose in a Github issue. |
Thanks for you quick response Benoit. Version: docker run ghcr.io/openzim/zimit zimit --version (base) Testing warc2zim args Is that the version you asked for ? For the website, it happend several times one i'm sure of is when crawling https://reference.wolfram.com/language/ And I just relaunched a crawl with only one worker and I'll let you know when it finishes. |
Thank you for these first details. The version I'm looking for is
But given your confirmation of the command you are using, you quite certainly use the Note that for tests you might as well use the |
Problem is that you set the |
--collection
CLI argument
Real issue is that for now zimit expects the collected WARC files to be located under {temp_root_dir}/collections/crawl-*/archive/ while when the We should either:
And we should display a nicer error when no WARCs are found stating that this is the issue + search after WARCs everywhere in the tmp folder and display them - but not process them - so that it will be easier to identify the issue without needing to debug. |
Thanks a lot Benoit you're very useful, The explanation seems reasonable to me, i will try it as soon as I can and confirm if it works. |
As discussed today in Kiwix weekly with @kelson42 , we will:
|
Hello,
I'm trying to get my hands on zimit. I've tryed a few time always ending the same way.
here is the parameters I use:
docker run -v /output:/output --shm-size=1gb ghcr.io/openzim/zimit zimit --url https://[Website to crowl]/ --name [name for the file to be generated] --title "Title of the zim file" --description "short description" --long-description "Long description" --collection "tag" --zim-lang en --workers 20 --waitUntil domcontentloaded --scopeType prefix
the cralw run fine until i get a message "Exiting, crawl status: done" and just after the program crashes with and "IndexError: list index out of range" message.
The full Trace reads
File "/usr/bin/zimit", line 546, in
zimiz()
File "/usr/bin/zimit", line 440, in zimit
warc_files = list(temp_root_dir.rglog("collections/crawl-*/archive/"))[-1]
IndexError: list index out of range
Any idea why?
Feel free to ask if you need more info
Have a nice day
The text was updated successfully, but these errors were encountered: