-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to output stats file live, i.e. after each page crawled #374
Conversation
@Chickensoupwithrice I hope you did not already started to work on this one, I just realized you assign the issue to yourself, sorry if this is duplicate work |
FYI, this has been manually tested locally with success:
|
I had not started work on this, so it's not duplicate! Looks good, will test locally and review! :) |
Hi @benoit74 thanks for the PR and tracking this down! You are right, this was an accidental regression, as we don't use the write stats after every page feature ourselves as much and don't have a proper test for it. |
Thank you for the feedback, I will do it tomorrow (remove the `toFile`
flag, revert the additional CLI flag and add a test).
… Message ID: ***@***.***
com>
|
c17b50c
to
9554c0f
Compare
Code is ready, I added a test which checks that the stat file is written and contains proper data. Unfortunately I wasn't able to write a test that checks that stat file is written after only one page crawled (and not only at the end), I have no idea about how to "pause" the crawl after only some page is crawled (and do not consider that crawl is finished), so that one can check that stat file is there. While running the tests I also realized that there was a small inversion between expect and test values in Btw, I don't know if this is intentional or not, and I don't want to be pushy on this, it took us a while to decide on this point for our repos, but your CI workflow is configured to trigger only on "push" events, meaning that it does not trigger on PRs like this one where the push is made on a fork. If you would like to fix this, you could have a look at our convention at https://github.com/openzim/_python-bootstrap/blob/main/.github/workflows/QA.yaml ; workflow is triggered on PRs (because it is usually mandatory in our process) and push on main branch (to "catch them all"). |
Thanks! Yes, we can enable CI on PR as well, think it was also an oversight on this repo. For testing, what you could do is use We can merge this now, if you'd like additional testing, feel free to open another PR with improved test. (This feature is really only used by Kiwix so up to you if you want more extensive testing but we're having to accept it). |
Great, thank you ! |
Fix #358
Changes
liveStatsFile
To be discussed
--statsFilename
is set, the stats file is updated after each page crawled. As mentioned in the issue, it is not clear for us @kiwix why this behavior has been changed