Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zimit does not really support setting the --collection CLI argument #252

Closed
dee-sea opened this issue Nov 17, 2023 · 7 comments · Fixed by #254
Closed

Zimit does not really support setting the --collection CLI argument #252

dee-sea opened this issue Nov 17, 2023 · 7 comments · Fixed by #254
Assignees
Labels
Milestone

Comments

@dee-sea
Copy link

dee-sea commented Nov 17, 2023

Hello,

I'm trying to get my hands on zimit. I've tryed a few time always ending the same way.

here is the parameters I use:
docker run -v /output:/output --shm-size=1gb ghcr.io/openzim/zimit zimit --url https://[Website to crowl]/ --name [name for the file to be generated] --title "Title of the zim file" --description "short description" --long-description "Long description" --collection "tag" --zim-lang en --workers 20 --waitUntil domcontentloaded --scopeType prefix

the cralw run fine until i get a message "Exiting, crawl status: done" and just after the program crashes with and "IndexError: list index out of range" message.

The full Trace reads
File "/usr/bin/zimit", line 546, in
zimiz()
File "/usr/bin/zimit", line 440, in zimit
warc_files = list(temp_root_dir.rglog("collections/crawl-*/archive/"))[-1]
IndexError: list index out of range

Any idea why?

Feel free to ask if you need more info

Have a nice day

@dee-sea dee-sea changed the title Getting a 'IndexError: list index out of range' when crawlind Getting a 'IndexError: list index out of range' when crawling Nov 17, 2023
@kelson42 kelson42 added this to the 1.7.0 milestone Nov 17, 2023
@benoit74
Copy link
Collaborator

This is not expected, but at the same time we have zimit task which have started and succeeded today in our infrastructure.

Could you please:

  • confirm which zimit version you are using? Unfortunately this is not displayed in the logs, if you don't know just report Browsertrix-Crawler version it will be sufficient to infer zimit version
  • test with only one worker (this is the most obvious difference I see compared to what we are running)

I will have a look into it as well, but if it is not the worker parameter is not the problem and since we do not have some of your settings, I'm afraid it might not be easy to reproduce. If the worker thing is not the problem, I would suggest that you try on your side to reproduce the issue with a website you are open to disclose in a Github issue.

@dee-sea
Copy link
Author

dee-sea commented Nov 17, 2023

Thanks for you quick response Benoit.

Version: docker run ghcr.io/openzim/zimit zimit --version (base)

Testing warc2zim args
Running: warc2zim --version --output /output
1.5.4

Is that the version you asked for ?

For the website, it happend several times one i'm sure of is when crawling https://reference.wolfram.com/language/

And I just relaunched a crawl with only one worker and I'll let you know when it finishes.

@benoit74
Copy link
Collaborator

Thank you for these first details.

The version I'm looking for is Browsertrix-Crawler, if you search for this at the beginning of the log you will have something like that:

{"timestamp":"2023-11-17T15:16:54.113Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 0.12.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}

But given your confirmation of the command you are using, you quite certainly use the latest version of zimit docker image, which is currently 1.6.2

Note that for tests you might as well use the --limit setting to limit the crawling to only few pages (e.g. 2, 10, ...), the ZIM will still be created with the crawled pages.

@benoit74
Copy link
Collaborator

Problem is that you set the --collection flag. Is there a reason you need this? Otherwise just remove it, it conflicts with zimit/warc2zim expectations (it shouldn't, I will have a look into it)

@benoit74 benoit74 changed the title Getting a 'IndexError: list index out of range' when crawling Zimit does not really support setting the --collection CLI argument Nov 18, 2023
@benoit74
Copy link
Collaborator

Real issue is that for now zimit expects the collected WARC files to be located under {temp_root_dir}/collections/crawl-*/archive/ while when the --collection setting is used, they are placed under {temp_root_dir}/collections/{collection}/archive/

We should either:

  • remove the support for the --collection setting (current behavior prove that it has never been used)
  • or look after all archives under the collections folder (no matter the folder name) ; this is very permissive but would avoid other issues should browsertrix decide to change the default folder name when --collection is not set
  • or use the --collection (when set) setting value to look in the right directory

And we should display a nicer error when no WARCs are found stating that this is the issue + search after WARCs everywhere in the tmp folder and display them - but not process them - so that it will be easier to identify the issue without needing to debug.

@dee-sea
Copy link
Author

dee-sea commented Nov 18, 2023

Thanks a lot Benoit you're very useful,

The explanation seems reasonable to me, i will try it as soon as I can and confirm if it works.

@benoit74
Copy link
Collaborator

As discussed today in Kiwix weekly with @kelson42 , we will:

  • check that it is possible to set WARCs location in zimit (probably doable by tmp-dir, to check if this is also used for something else which might need to be created somewhere else)
  • if check is ok, remove the --collection zimit arg since it has not been used so far (it can't have worked if set) and it is not really useful from an end-user perspective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants