Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to output stats file live, i.e. after each page crawled #374

Merged
merged 3 commits into from
Sep 14, 2023

Conversation

benoit74
Copy link
Contributor

Fix #358

Changes

  • new CLI parameter liveStatsFile
  • when CLI parameter is True, stats file is updated after each page crawled

To be discussed

  • do we really need this new CLI parameter ? This is has been done so to ensure backward compatibility but we could also revert to previous crawler behavior where if --statsFilename is set, the stats file is updated after each page crawled. As mentioned in the issue, it is not clear for us @kiwix why this behavior has been changed

@benoit74
Copy link
Contributor Author

@Chickensoupwithrice I hope you did not already started to work on this one, I just realized you assign the issue to yourself, sorry if this is duplicate work

@benoit74
Copy link
Contributor Author

FYI, this has been manually tested locally with success:

--statsFilename --liveStatsFile result
not set not set OK, stats file is not written
set not set OK, stats file is written only once crawling is done
set set OK, stats file is written after each page crawled
not set set OK, stats file is not written

@Chickensoupwithrice
Copy link
Contributor

I had not started work on this, so it's not duplicate! Looks good, will test locally and review! :)

@ikreymer
Copy link
Member

ikreymer commented Sep 13, 2023

do we really need this new CLI parameter ? This is has been done so to ensure backward compatibility but we could also revert to previous crawler behavior where if --statsFilename is set, the stats file is updated after each page crawled. As mentioned in the issue, it is not clear for us https://github.com/kiwix why this behavior has been changed

Hi @benoit74 thanks for the PR and tracking this down! You are right, this was an accidental regression, as we don't use the write stats after every page feature ourselves as much and don't have a proper test for it.
I think we can revert to the previous functionality with --statsFilename and just remove the toFile flag there.
If you have a chance to add a test for this, that would be even better, otherwise just reverting to previous functionality should work. Thanks again!

@benoit74
Copy link
Contributor Author

benoit74 commented Sep 13, 2023 via email

@benoit74
Copy link
Contributor Author

Code is ready, I added a test which checks that the stat file is written and contains proper data.

Unfortunately I wasn't able to write a test that checks that stat file is written after only one page crawled (and not only at the end), I have no idea about how to "pause" the crawl after only some page is crawled (and do not consider that crawl is finished), so that one can check that stat file is there.

While running the tests I also realized that there was a small inversion between expect and test values in extra_hop_depth test suite so I fixed this (test was failing because I did not realized that I must clean the test-crawls folder before starting Jest.

Btw, I don't know if this is intentional or not, and I don't want to be pushy on this, it took us a while to decide on this point for our repos, but your CI workflow is configured to trigger only on "push" events, meaning that it does not trigger on PRs like this one where the push is made on a fork. If you would like to fix this, you could have a look at our convention at https://github.com/openzim/_python-bootstrap/blob/main/.github/workflows/QA.yaml ; workflow is triggered on PRs (because it is usually mandatory in our process) and push on main branch (to "catch them all").

@ikreymer
Copy link
Member

Thanks! Yes, we can enable CI on PR as well, think it was also an oversight on this repo.

For testing, what you could do is use exec instead of execSync, and check the contents of the file to see if it changes while a small crawl is running. If the file changes at least once, while crawl is running, we know its being updated..

We can merge this now, if you'd like additional testing, feel free to open another PR with improved test. (This feature is really only used by Kiwix so up to you if you want more extensive testing but we're having to accept it).

@ikreymer ikreymer merged commit d72443c into webrecorder:main Sep 14, 2023
@benoit74
Copy link
Contributor Author

Great, thank you !
Do you have any idea when the next release including this change will happen? (no real hurry, it is more a planning question for us)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stats file cannot be updated at each page crawled
3 participants