Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue linked to new "extensible" HTML rewriting rules #370

Closed
benoit74 opened this issue Aug 5, 2024 · 0 comments · Fixed by #375
Closed

Performance issue linked to new "extensible" HTML rewriting rules #370

benoit74 opened this issue Aug 5, 2024 · 0 comments · Fixed by #375
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Aug 5, 2024

For a very small WARC like https://github.com/openzim/warc2zim/blob/main/tests/data-special/qsl.net-encoding-alias.warc.gz, it takes more than 2 minutes to build the ZIM.

A flamegraph shows that most of the time is spent in the rewrite_html (expected since the HTML page in this WARC is huge) but inside this most time is spent in inspect.signature function.

qsl_flame

This signature information should in fact be cached since it is not going to change during a warc2zim execution.

A quick change (tbc in a PR) confirms that caching this information allows to return to coherent timings (less than 20 secs, with lot of time spent parsing the HTML which is expected since HTML is huge).

qsl_flame_cached

@benoit74 benoit74 added the bug Something isn't working label Aug 5, 2024
@benoit74 benoit74 added this to the 2.1.0 milestone Aug 5, 2024
@benoit74 benoit74 self-assigned this Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant