Block Parser: Explore a streaming lazy interface #5705
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Augmented but not replaced entirely by the Unified Block Parser in #6381
Alternatively provided by the block-delimiter-finder in #6760
For a 3 MB document which took 5 seconds and 14 GB to parse, this version of the parser parsed it in 27 ms and 20 MB.
Initial testing
This version is slower for the home page render of `twentytwentyfour`. While unfortunate, this is not entirely surprising as this was designed to fix the catastrophically bad cases.However, in catastrophic cases it's wildly better than
trunk
. The following was tested for a 15 KB / 400 line chunk of the 3 MB post mentioned above.The algorithm has wild complexity too. For the same post, including the first 599 lines (only 23 KB of HTML),
trunk
consumes 520 MB or memory while this branch only consumes 10 MB. With no more than 15 samples the data is extremely significant.Testing Results
This may be slightly slower for a number of normal posts. For the home page render of
twentytwentyfour
it rendered 3.7 ms slower thantrunk
. However, for my catastrophically-broken test post, the impact of the lazy parsing is dramatic and significant after only a single request.The lazy parser is still slow for really pathological cases, but unlike
trunk
it runs within a mostly bounded memory footprint. The more pathological the post, the more dramatic the improvement in both runtime and memory use becomes. Below is a chart comparing slices of my test file against both parsers. Between each test run the database is reset. The number of lines reported is the count of how many of the original 3 MB document lines were extracted as the test post.Of particular note is that this lazy parser allows for more control over the performance threshold still. Further expansion would allow setting a time limit, an upper bound on m emory usage, and a content length threshold after which the parser could pause and/or collapse the remainder of the post into a single unparsed block, essentially turning everything after the limit into a chunk of raw HTML (the static fallback render).
trunk
mstrunk
MBWith
memory_limit=55G
on a60 GB
system I was unable to create the post viawp_insert_post()
and it failed after some number of tens of minutes.on this branch the post inserted after a few seconds and used a peak memory of 64 MB