Overview of Short-term Parsing Enhancements #8244

mcsf · 2018-07-27T05:53:33Z

One of the foundational pieces of this project, the block-based language of Gutenberg has so far been implemented in the system by a set of parsers generated from a PEG grammar describing the language.

Two motivating factors to back this emerged early on in the development cycle: i) a PEG grammar is easy to understand, review and maintain; ii) the parsers generated from it, one in JS and one in PHP, worked out of the box, and — with caveats — this proved adequate even as the project grew: for instance, the parsing stack easily accommodated block nesting.

This intelligibility and quasi-self-documenting character of the PEG grammar are what led to establishing it as the official specification of the language of Gutenberg in the form of its own package. This is an important step for the long-term success of Gutenberg as not just an editor, but a multi-environment ecosystem bound by the same language.

Opportunities

As Gutenberg braces for merging into core, all the tooling surrounding blocks needs to be as robust and versatile as any other part of WordPress. Among other things, this means that webmasters, plugin authors and platform engineers should have the power to hot-swap parsers to optimize for given requirements — whether requirements concern block analysis, hardware constraints, runtime, etc. Past experiments have confirmed that the PEG grammar is easily tweaked to output specific block information and can be optimized if told to ignore other aspects.

Furthermore, we have a duty to perform as well as possible. Our auto-generated parsers are known not to perform as well as other systems, a trade-off we made for clarity. Client-side parser performance has so far been satisfying, but the server-side parser is a known bottleneck and may not be suitable for demanding posts. The server-side issue was mitigated by altogether eschewing the PHP parser when rendering front-end pages in favor of a simpler parse, but the fact is that obtaining a full parse tree of a post on the server should be performant and robust.

Finally, Gutenberg blocks are to reach well beyond WP-Admin and base WordPress installations. Namely, any third-party clients should be able to work with Gutenberg blocks, from the mobile apps to any community tool. This requirement implies that the language of Gutenberg will need to be parsed in diverse environments and runtimes, well beyond the scope of the base Gutenberg project.

Thus, there is the opportunity to establish the language of Gutenberg as a de facto standard for semantic content formatting. Perhaps choosing parsers will be similar to choosing database systems: most site owners will go with the default, such as MySQL, but competing software abounds for those with different needs.

Parser tooling

Closes #6994.

Before we look at parsing itself, we need to be able to develop, validate and compare implementations.

Develop. Any party should be able to understand the specified language. For this, it should be clear that the PEG grammar is intended to be used as a reference. Tasks:

Offer PEG grammar and its parser as its own spec-parser package. Packages: Create new spec-parser package #7664
Auto-generate human-readable version of block grammar. Docs: Auto-generate human-readable version of Gutenberg block grammar #6116
Increase visibility of this reference, perhaps via a standalone documentation page.

Validate. Developers should be provided official tools that can, as best as possible, test for correctness of parsers. Tasks:

Develop a large and diverse corpus, comprising posts long and short, with many and few blocks, different kinds of content, written in different locales, and with documents full of various kinds of invalid blocks.
- @dmsnell has started this in his repository gutenberg-document-library.
Provide easy-to-use validators that test against the corpus.
- Add a JEST test to build and compare language-agnostic parsers. Parser: Add a JEST test to build and compare language agnostic parsers #6030

Compare. A parser doesn't just need to work, it likely needs to perform well. To help inform development (and instigate a healthy sense of competition?), it should be possible to compare parser performance side-by-side or multilaterally. Here, performance encompasses both space and time. Tasks:

Parser: Build system to compare alternative parser implementations. Parser: Build system to compare alternative parser implementations #6831

Ultimately, all tasks for the above paragraphs should converge towards providing:

A standalone reference Web page providing spec documentation and validation-comparison tools. Tentatively: https://wordpress.github.io/gutenberg/

Related open issues:

Check support for /u flag in installed PCRE library Check support for /u flag in installed PCRE library #4852
Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 PCRE infinite recursion segmentation fault after activating gutenberg plugin #8671

Exploring new parsers

The past months have seen fascinating and encouraging developments in the development of competing parsers.

@Hywan developed a Rust parser that compiles to a number of different targets, and notably to those that core Gutenberg cares about: the server, in the form of a PHP extension; the client, via WebAssembly binary and an ASM.js fallback. Through use of the Hoa compiler, native PHP bindings are also available, including an experimental PCRE parser.
@dmsnell wrote a hand-coded parser Parser: Propose new hand-coded parser #8083, working with the PHP interpreter to seek optimal performance.
@pento asked the question of "if we start with the spec grammar, how much can we gain from fine-tuning it?" in Improve Parser Performance Improve Parser Performance #8044.

These experiments have been benchmarked using gutenberg-document-library, and results have been extremely encouraging.

to do: insert benchmark tables for CPU and memory

The text was updated successfully, but these errors were encountered:

danielbachhuber · 2018-08-07T21:10:06Z

Updated issue description to include:

Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 #8671

Hywan · 2018-08-17T10:17:28Z

Thank you for the very detailed post.

The project I develop is the Rust Gutenberg parser. Before posting the benchmark results, let me quickly talk about the project.

Presentation

The goal of this parser is to provide a single implementation with multiple bindings to reach different environments and languages. Thus, we have only one codebase to maintain and to keep updated. This implementation is written in Rust, with the help of the nom library. It is one of the fastest parser (combinator) in the world, with a strong community, and is very popular (823k downloads at the time of writing).

Bindings are very light. For instance, the WebAssembly binding is only 300 LOC (with comments). The ASM.js binding is only 30 LOC (with comments). The C binding is free (no code to write). The PHP binding is 300 LOC (with comments). The NodeJS binding is 150 LOC (with comments). They are tiny surfaces of code to review.

As you can see, Rust can be embedded almost everywhere: On the Web through WASM and ASM.js, in PHP as an extension, in NodeJS as a native module or through WASM/ASM.js, in Java through FFI (as a POC now) etc. This is the motivation behind Rust: A single implementation usable in multiple environments and languages. Also Rust is very safe and fast, safest to any solution we came up with today. Finally, Rust memory usage is predictable and efficient, which also motivates the choice for this language.

Rust provides solid types. The parser then produces an AST that is a single type named Node. It is memory compact and strict.

Documentation and specification

Rust has rustdoc, a tool to test the documentation and all the examples, but also to publish the documentation as HTML files. Here is a (outdated) version of the Rust Gutenberg parser.

Here is the documentation for the root function, which is the entry point of the parser. We can see a description, an example, and a code.

I would like to point out that the grammar is specified with the EBNF format. So we have the specification here.

But more importantly, just like the root function, each rule of the parser is documented and contains an example, e.g. with the block rule. We are sure that this example always runs, and is always valid thanks to rustdoc. Thus it increases the onboarding for new users or contributors.

Tests

Since the parser is written with a parser combinator framework, each parser is unit testable. There are also integration tests. And finally, each bindings have its own unit and integration tests.

Bindings

As I said, there is a single implementation written in Rust, and then we embed this implementation into multiple environments and languages. So far, the following bindings are implemented, tested, documented and benched:

WebAssembly (as a binary): To use the parser mostly in a Web environment,
ASM.js (as a JavaScript code): As a fallback of the WebAssembly binary in a Web environment (for instance for Internet Explorer),
C (as a static library + a header file): To include the Rust parser in every C compatible project,
PHP (as a native extension): To use the parser in PHP natively,
NodeJS (as a native module): To use the parser in NodeJS natively (note that NodeJS supports WebAssembly and ASM.js too).

They have been extensively tested. The PHP extension has been checked against many tools (like valgrind) to verify there is no memory leaks (and there isn't).

Benchmarks

We use benchmarks as a factor of the success of this project. The problems that need to be addressed are:

The PEG.js parser is slow when dealing with large inputs (aka blog posts), like 2.5s to parse moby-dick-parsed.html,
The PEG.php parser is slow with medium inputs, and quickly hits the default memory limit, like 5.4s to parse moby-dick-parsed.html.

Benchmarking with `gutenberg-parser-comparator`

PEG.js parser vs. Rust parser as a WebAssembly binary

I used gutenberg-parser-comparator to compare the PEG.js parser against the WebAssembly binary in Firefox nightly:

file	Javascript parser (ms)	Rust parser as a WASM binary (ms)	speedup
`demo-post.html`	13.167	0.137	× 96
`shortcode-shortcomings.html`	26.784	0.225	× 119
`redesigning-chrome-desktop.html`	75.500	0.905	× 83
`web-at-maximum-fps.html`	88.118	0.698	× 126
`early-adopting-the-future.html`	201.011	0.927	× 217
`pygmalian-raw-html.html`	311.416	1.016	× 307
`moby-dick-parsed.html`	2,466.533	14.673	× 168

The WASM binary of the Rust parser is in average 159 times faster than
the actual Javascript implementation. The median of the speedup is 126.

PEG.js parser vs. Rust parser as an ASM.js module

I used gutenberg-parser-comparator to compare the PEG.js parser against the ASM.js fallback module (fallback of the WASM binary) in Firefox nightly:

file	Javascript parser (ms)	Rust parser as an ASM.js module (ms)	speedup
`demo-post.html`	15.368	2.718	× 6
`shortcode-shortcomings.html`	31.022	8.004	× 4
`redesigning-chrome-desktop.html`	106.416	19.223	× 6
`web-at-maximum-fps.html`	82.920	27.197	× 3
`early-adopting-the-future.html`	119.880	38.321	× 3
`pygmalian-raw-html.html`	349.075	23.656	× 15
`moby-dick-parsed.html`	2,543.750	361.423	× 7

The ASM.js module version of the Rust parser is in average 6 times faster than the actual Javascript implementation. The median of the speedup is 6.

Benchmarking with `benchmark.js`

I used benchmark.js to get high-resolution timers and statistically significant results.

WebAssembly

I used benchmark.js on the WebAssembly binary to check its own performance (with no comparison) in NodeJS latest stable:

file	throughput/time	number of runs sampled
autoclosing_block	118,970 ops/sec (0.008ms) ±1.58%	78
early_adopting_the_future	380 ops/sec (2.633ms) ±14.04%	54
gutenberg_demo	2,777 ops/sec (0.360ms) ±2.83%	74
moby_dick_parsed	37 ops/sec (22.292ms) ±5.32%	57
pygmalian_raw_html	700 ops/sec (1.492ms) ±1.33%	79
redesigning_chrome_desktop	370 ops/sec (2.705ms) ±18.71%	64
shortcode_shortcomings	1,680 ops/sec (0.595ms) ±2.57%	73
web_at_maximum_fps	681 ops/sec (1.469ms) ±5.50%	72

Benchmarking with `PHPBench`

I used PHPBench to get statistically signifiant results.

PHP native extension

I used PHPBench on the PHP native extension to check its own performance (with no comparison) in PHP 7.2 with other default extensions enabled:

params	revs	its	mem_peak	mean	mode	best	worst	rstdev
{"subject":"autoclosing-block"}	1000	10	472,240b	1.164μs	1.174μs	0.974μs	1.259μs	6.26%
{"subject":"early-adopting-the-future"}	1000	10	760,888b	305.396μs	303.148μs	296.700μs	328.691μs	2.81%
{"subject":"gutenberg-demo"}	1000	10	487,928b	58.186μs	57.660μs	55.948μs	63.120μs	3.20%
{"subject":"moby-dick-parsed"}	1000	10	5,826,088b	5,058.416μs	4,962.539μs	4,726.858μs	5,474.889μs	4.46%
{"subject":"pygmalian-raw-html"}	1000	10	925,256b	51.074μs	50.928μs	48.920μs	54.716μs	2.96%
{"subject":"redesigning-chrome-desktop"}	1000	10	790,712b	387.499μs	383.050μs	379.184μs	409.578μs	2.34%
{"subject":"shortcode-shortcomings"}	1000	10	532,488b	92.731μs	93.002μs	89.908μs	95.436μs	1.68%
{"subject":"web-at-maximum-fps"}	1000	10	686,784b	264.698μs	266.231μs	256.919μs	269.058μs	1.44%

Note that times are reported as µs (10e-6), not ms (10e-3). There is a difference of 3 order of magnitude.

Benchmarking with `Criterion`

I used Criterion to get statistically signifiant results.

Rust parser

Because the Rust parser needs to be benchmarked too!

subject	time (mean)
autoclosing_block	0.182µs
early_adopting_the_future	92.498µs
gutenberg_demo	15.513µs
moby_dick_parsed	1348.9µs
pygmalian_raw_html	34.027µs
redesigning_chrome_desktop	117.630µs
shortcode_shortcomings	27.605µs
web_at_maximum_fps	80.677µs

Note that times are reported as µs (10e-6). Also note that autoclosing_block is parsed in 0.182µs, so 182ns (yep, nanosecond 😛), and that moby_dick_parsed is parsed in 1.3ms.

This is roughly the C performance I guess, since it uses the same code.

Conclusion

The Rust parser provides a single, safe, and very fast implementation.

We have successfully embeded the Rust parser in multiple environments and languages through bindings, like WebAssembly, ASM.js, C, PHP, and NodeJS.

The WebAssembly binding is safer and faster than the PEG.js implementation. It shows a speedup of 159. The binding is written in 300 LOC.

A large post like moby-dick-parsed.html takes 2466ms to parse with PEG.js, while it takes 14ms to parse with WebAssembly.
A medium post like early-adopting-the-future.html takes 201ms to parse with PEG.js, while it takes 0.9ms to parse with WebAssembly.
A post with no Gutenberg blocks like pygmalian-raw-html.html takes 311ms to parse with PEG.js, while it takes 1ms to parse with WebAssembly.

The ASM.js binding is a fallback of the WebAssembly binary for environments that do not support WebAssembly. It is faster on all documents, and it shows a speedup of 6. The binding is written in 30 LOC.

The C binding is almost free. It is as fast as the Rust parser.

The PHP binding is safer, faster and radically less memory gourmand than the PEG.php parser. It shows a speedup of 5230. The binding is written in 300 LOC.

A large post like moby-dick-parsed.html takes 5437ms to parse with PEG.php, while it takes 5ms to parse with the PHP extension,
A medium post like early-adopting-the-future.html takes 280ms to parse with PEG.php, while it takes 0.2ms to parse with the PHP extension.
A post with no Gutenberg blocks like pygmalian-raw-html.html takes 377ms to parse with PEG.php, while it takes 0.05ms to parse with the PHP extension.

The NodeJS binding is working, but I didn't had time to benchmark it yet. The WebAssembly benchmark was done in NodeJS, and also in Firefox. We can hope that the NodeJS native module is faster than WebAssembly for now (because WebAssembly does not support SIMD, nor tail calls etc yet). So in all cases, it is faster and safer than the PEG.js parser, since the binding is written in Rust and does not involve any C code. The binding is written in 150 LOC.

We can conclude than this project addresses the limitations of memory usages and speed, and provides a neat way to maintain a single implementation that can be used in many environments almost for free compared to a situation where we have to re-implement the parser for every environment. Rust provides the tooling to write, to test and to publish the documentation, and so the specification of the Gutenberg block language, which ease the onboarding of new users or contributors.

Finally, the WebAssembly and ASM.js bindings can be distributed in a NPM package. It does not break the way we want to distribute Gutenberg packages. Rust is not required to use the NPM package, only the WASM binary (which is not platform dependent) and the ASM.js file are distributed as is. The PHP extension has already been installed in 2 environments (macOS and Debian) successfully. It follows the classical extension compilation and installation process. It is possible to automate everything with Pickle. Rust is required to install the PHP extension.

Thank you for your reading 🙂.

Edit: I've found another easy low-hanging fruit to optimise the WASM binding. Then I've updated the WASM and ASM.js benchmark results (resp. 73 to 86 times faster, and 4 to 6 times faster 🎉).

Edit 2: I've found another optimisation for the WASM binding. Then I've updated the WASM benchmark results, it is 159 times faster (instead of 86 times) 🎉 🎉!

dmsnell · 2018-08-25T17:02:58Z

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

@mcsf, @aduth, and @mtias I have updated #8083 and all tests are passing there. since I'm not familiar with the packing issues I'd appreciate some help from someone who does - maybe @gziolo or @youknowriad.

Performance-wise the hand-written parser seems to be performing well enough that we may be able to lift some demands that we have so-far placed on the project, such as not performing the parse on the server. For example, just as @Hywan used the moby-dick-parsed.html file as a test this parser on my laptop compares at 6.03s for the auto-generated parser and 16ms for the hand-written one. 16ms may have some room for improvement but clearly it's a different ballgame for server-side parsing than five or six seconds is.

There's parallel work going on to make the parse asynchronous but personally I'd love to see us get a "default" parser in sooner than that work so that we can get past the performance issues that have been creeping up. I wanted to setup some mechanism to load custom parsers in #8083 but I realized it would be too much for a single PR and wanted to wait until we had two actual choices in the project before making the code to pick one.

That is, I'd love to see us push out a parser that's at least ready for production performance issues, continue making the parse asynchronous in the meantime, add the code to allow overwriting the parse implementation via a plugin, and fill up the tooling to compare parsers.

Previously I pushed harder for the tooling but a couple things have changed since then:

the merge is coming soon
we have primitive "stupid" tooling available from the work I did in Parser: Build system to compare alternative parser implementations #6831
performance appears to be becoming a bigger issue, especially on larger more complicated documents

Hywan · 2018-08-29T12:03:42Z

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

Sure. I wait on #7970 to be merged before starting a PR. If someone wants to try the WASM binary, artefacts are attached to releases (https://github.com/Hywan/gutenberg-parser-rs/releases). I'm working on a Docker image, even if the justfile file guides the user to install all the required tools.

the moby-dick-parsed.html file as a test [for] this parser on my laptop […] 16ms for the hand-written one.

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

dmsnell · 2018-08-29T17:42:59Z

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

actually I reran the tests while I had the heat sink off of my CPU and it took 418ms but then when I chilled it with liquid nitrogen it only took 7ms 🙃

I'm working on a Docker image

awesome!

Hywan · 2018-09-04T10:59:55Z

Side note: I've found another optimisation for the WASM binding. The average speedup is now 159 instead of 86 🎉. And I achieved my goal with moby-dick-parsed.html which is parsed in 14ms, and pygmalian-raw-html in 1ms.

My detailed comment has been updated accordingly.

mcsf · 2019-11-21T21:16:14Z

This issue was about short-term enhancements, and it's been a year and a half. :)

Since then, we've made great strides both in this issue and in other spaces, even though some of the plans — e.g. a system to compare competing parsers, integrating asynchronous parsers — didn't come to fruition. And I think that's fine. It's a reflection of evolving priorities.

Ultimately, parsing has greatly improved thanks to all those involved. And for this I'll close this issue in favour of the broader #8352, which also encompasses parsing-related ideas such as the exciting #18414. Any new specific parser-related tasks can be tracked in new issues. Thank you, everyone.

pento mentioned this issue Jul 30, 2018

Pass content to render callback #5760

Closed

mtias mentioned this issue Aug 1, 2018

PHP APIs Overview #8352

Closed

16 tasks

mcsf mentioned this issue Aug 6, 2018

Ability to swap parsers #6994

Closed

mtias assigned mcsf and mtias Aug 6, 2018

danielbachhuber mentioned this issue Aug 7, 2018

PCRE infinite recursion segmentation fault after activating gutenberg plugin #8671

Closed

swissspidy mentioned this issue Oct 21, 2018

Extensive articles with over 1700 words and various blocks make Gutenberg very slow. #10418

Closed

mtias added the [Type] Overview Comprehensive, high level view of an area of focus often with multiple tracking issues label Nov 17, 2019

mcsf closed this as completed Nov 21, 2019

mcsf mentioned this issue May 30, 2023

Consider removing "Block Grammar" documentation page #51067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview of Short-term Parsing Enhancements #8244

Overview of Short-term Parsing Enhancements #8244

mcsf commented Jul 27, 2018 •

edited

Loading

danielbachhuber commented Aug 7, 2018

Hywan commented Aug 17, 2018 •

edited

Loading

dmsnell commented Aug 25, 2018

Hywan commented Aug 29, 2018

dmsnell commented Aug 29, 2018 •

edited

Loading

Hywan commented Sep 4, 2018

mcsf commented Nov 21, 2019

Overview of Short-term Parsing Enhancements #8244

Overview of Short-term Parsing Enhancements #8244

Comments

mcsf commented Jul 27, 2018 • edited Loading

Opportunities

Parser tooling

Exploring new parsers

danielbachhuber commented Aug 7, 2018

Hywan commented Aug 17, 2018 • edited Loading

Presentation

Documentation and specification

Tests

Bindings

Benchmarks

Benchmarking with gutenberg-parser-comparator

PEG.js parser vs. Rust parser as a WebAssembly binary

PEG.js parser vs. Rust parser as an ASM.js module

Benchmarking with benchmark.js

WebAssembly

Benchmarking with PHPBench

PHP native extension

Benchmarking with Criterion

Rust parser

Conclusion

dmsnell commented Aug 25, 2018

Hywan commented Aug 29, 2018

dmsnell commented Aug 29, 2018 • edited Loading

Hywan commented Sep 4, 2018

mcsf commented Nov 21, 2019

mcsf commented Jul 27, 2018 •

edited

Loading

Hywan commented Aug 17, 2018 •

edited

Loading

Benchmarking with `gutenberg-parser-comparator`

Benchmarking with `benchmark.js`

Benchmarking with `PHPBench`

Benchmarking with `Criterion`

dmsnell commented Aug 29, 2018 •

edited

Loading