Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview of Short-term Parsing Enhancements #8244

Closed
5 of 11 tasks
mcsf opened this issue Jul 27, 2018 · 7 comments
Closed
5 of 11 tasks

Overview of Short-term Parsing Enhancements #8244

mcsf opened this issue Jul 27, 2018 · 7 comments
Assignees
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f Framework Issues related to broader framework topics, especially as it relates to javascript [Type] Overview Comprehensive, high level view of an area of focus often with multiple tracking issues [Type] Task Issues or PRs that have been broken down into an individual action to take

Comments

@mcsf
Copy link
Contributor

mcsf commented Jul 27, 2018

One of the foundational pieces of this project, the block-based language of Gutenberg has so far been implemented in the system by a set of parsers generated from a PEG grammar describing the language.

Two motivating factors to back this emerged early on in the development cycle: i) a PEG grammar is easy to understand, review and maintain; ii) the parsers generated from it, one in JS and one in PHP, worked out of the box, and — with caveats — this proved adequate even as the project grew: for instance, the parsing stack easily accommodated block nesting.

This intelligibility and quasi-self-documenting character of the PEG grammar are what led to establishing it as the official specification of the language of Gutenberg in the form of its own package. This is an important step for the long-term success of Gutenberg as not just an editor, but a multi-environment ecosystem bound by the same language.

Opportunities

As Gutenberg braces for merging into core, all the tooling surrounding blocks needs to be as robust and versatile as any other part of WordPress. Among other things, this means that webmasters, plugin authors and platform engineers should have the power to hot-swap parsers to optimize for given requirements — whether requirements concern block analysis, hardware constraints, runtime, etc. Past experiments have confirmed that the PEG grammar is easily tweaked to output specific block information and can be optimized if told to ignore other aspects.

Furthermore, we have a duty to perform as well as possible. Our auto-generated parsers are known not to perform as well as other systems, a trade-off we made for clarity. Client-side parser performance has so far been satisfying, but the server-side parser is a known bottleneck and may not be suitable for demanding posts. The server-side issue was mitigated by altogether eschewing the PHP parser when rendering front-end pages in favor of a simpler parse, but the fact is that obtaining a full parse tree of a post on the server should be performant and robust.

Finally, Gutenberg blocks are to reach well beyond WP-Admin and base WordPress installations. Namely, any third-party clients should be able to work with Gutenberg blocks, from the mobile apps to any community tool. This requirement implies that the language of Gutenberg will need to be parsed in diverse environments and runtimes, well beyond the scope of the base Gutenberg project.

Thus, there is the opportunity to establish the language of Gutenberg as a de facto standard for semantic content formatting. Perhaps choosing parsers will be similar to choosing database systems: most site owners will go with the default, such as MySQL, but competing software abounds for those with different needs.

Parser tooling

Closes #6994.

Before we look at parsing itself, we need to be able to develop, validate and compare implementations.

Develop. Any party should be able to understand the specified language. For this, it should be clear that the PEG grammar is intended to be used as a reference. Tasks:

Validate. Developers should be provided official tools that can, as best as possible, test for correctness of parsers. Tasks:

Compare. A parser doesn't just need to work, it likely needs to perform well. To help inform development (and instigate a healthy sense of competition?), it should be possible to compare parser performance side-by-side or multilaterally. Here, performance encompasses both space and time. Tasks:

Ultimately, all tasks for the above paragraphs should converge towards providing:

Related open issues:

Exploring new parsers

The past months have seen fascinating and encouraging developments in the development of competing parsers.

  • @Hywan developed a Rust parser that compiles to a number of different targets, and notably to those that core Gutenberg cares about: the server, in the form of a PHP extension; the client, via WebAssembly binary and an ASM.js fallback. Through use of the Hoa compiler, native PHP bindings are also available, including an experimental PCRE parser.
  • @dmsnell wrote a hand-coded parser Parser: Propose new hand-coded parser #8083, working with the PHP interpreter to seek optimal performance.
  • @pento asked the question of "if we start with the spec grammar, how much can we gain from fine-tuning it?" in Improve Parser Performance Improve Parser Performance #8044.

These experiments have been benchmarked using gutenberg-document-library, and results have been extremely encouraging.

to do: insert benchmark tables for CPU and memory

@mcsf mcsf added [Type] Task Issues or PRs that have been broken down into an individual action to take Framework Issues related to broader framework topics, especially as it relates to javascript [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Jul 27, 2018
@mtias mtias mentioned this issue Aug 1, 2018
16 tasks
@mtias mtias assigned mcsf and mtias Aug 6, 2018
@danielbachhuber
Copy link
Member

Updated issue description to include:

Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 #8671

@Hywan
Copy link

Hywan commented Aug 17, 2018

Thank you for the very detailed post.

The project I develop is the Rust Gutenberg parser. Before posting the benchmark results, let me quickly talk about the project.

Presentation

The goal of this parser is to provide a single implementation with multiple bindings to reach different environments and languages. Thus, we have only one codebase to maintain and to keep updated. This implementation is written in Rust, with the help of the nom library. It is one of the fastest parser (combinator) in the world, with a strong community, and is very popular (823k downloads at the time of writing).

Bindings are very light. For instance, the WebAssembly binding is only 300 LOC (with comments). The ASM.js binding is only 30 LOC (with comments). The C binding is free (no code to write). The PHP binding is 300 LOC (with comments). The NodeJS binding is 150 LOC (with comments). They are tiny surfaces of code to review.

As you can see, Rust can be embedded almost everywhere: On the Web through WASM and ASM.js, in PHP as an extension, in NodeJS as a native module or through WASM/ASM.js, in Java through FFI (as a POC now) etc. This is the motivation behind Rust: A single implementation usable in multiple environments and languages. Also Rust is very safe and fast, safest to any solution we came up with today. Finally, Rust memory usage is predictable and efficient, which also motivates the choice for this language.

Rust provides solid types. The parser then produces an AST that is a single type named Node. It is memory compact and strict.

Documentation and specification

Rust has rustdoc, a tool to test the documentation and all the examples, but also to publish the documentation as HTML files. Here is a (outdated) version of the Rust Gutenberg parser.

Here is the documentation for the root function, which is the entry point of the parser. We can see a description, an example, and a code.

I would like to point out that the grammar is specified with the EBNF format. So we have the specification here.

But more importantly, just like the root function, each rule of the parser is documented and contains an example, e.g. with the block rule. We are sure that this example always runs, and is always valid thanks to rustdoc. Thus it increases the onboarding for new users or contributors.

Tests

Since the parser is written with a parser combinator framework, each parser is unit testable. There are also integration tests. And finally, each bindings have its own unit and integration tests.

Bindings

As I said, there is a single implementation written in Rust, and then we embed this implementation into multiple environments and languages. So far, the following bindings are implemented, tested, documented and benched:

  • WebAssembly (as a binary): To use the parser mostly in a Web environment,
  • ASM.js (as a JavaScript code): As a fallback of the WebAssembly binary in a Web environment (for instance for Internet Explorer),
  • C (as a static library + a header file): To include the Rust parser in every C compatible project,
  • PHP (as a native extension): To use the parser in PHP natively,
  • NodeJS (as a native module): To use the parser in NodeJS natively (note that NodeJS supports WebAssembly and ASM.js too).

They have been extensively tested. The PHP extension has been checked against many tools (like valgrind) to verify there is no memory leaks (and there isn't).

Benchmarks

We use benchmarks as a factor of the success of this project. The problems that need to be addressed are:

  • The PEG.js parser is slow when dealing with large inputs (aka blog posts), like 2.5s to parse moby-dick-parsed.html,
  • The PEG.php parser is slow with medium inputs, and quickly hits the default memory limit, like 5.4s to parse moby-dick-parsed.html.

Benchmarking with gutenberg-parser-comparator

PEG.js parser vs. Rust parser as a WebAssembly binary

I used gutenberg-parser-comparator to compare the PEG.js parser against the WebAssembly binary in Firefox nightly:

file Javascript parser (ms) Rust parser as a WASM binary (ms) speedup
demo-post.html 13.167 0.137 × 96
shortcode-shortcomings.html 26.784 0.225 × 119
redesigning-chrome-desktop.html 75.500 0.905 × 83
web-at-maximum-fps.html 88.118 0.698 × 126
early-adopting-the-future.html 201.011 0.927 × 217
pygmalian-raw-html.html 311.416 1.016 × 307
moby-dick-parsed.html 2,466.533 14.673 × 168

The WASM binary of the Rust parser is in average 159 times faster than
the actual Javascript implementation. The median of the speedup is 126.

PEG.js parser vs. Rust parser as an ASM.js module

I used gutenberg-parser-comparator to compare the PEG.js parser against the ASM.js fallback module (fallback of the WASM binary) in Firefox nightly:

file Javascript parser (ms) Rust parser as an ASM.js module (ms) speedup
demo-post.html 15.368 2.718 × 6
shortcode-shortcomings.html 31.022 8.004 × 4
redesigning-chrome-desktop.html 106.416 19.223 × 6
web-at-maximum-fps.html 82.920 27.197 × 3
early-adopting-the-future.html 119.880 38.321 × 3
pygmalian-raw-html.html 349.075 23.656 × 15
moby-dick-parsed.html 2,543.750 361.423 × 7

The ASM.js module version of the Rust parser is in average 6 times faster than the actual Javascript implementation. The median of the speedup is 6.

Benchmarking with benchmark.js

I used benchmark.js to get high-resolution timers and statistically significant results.

WebAssembly

I used benchmark.js on the WebAssembly binary to check its own performance (with no comparison) in NodeJS latest stable:

file throughput/time number of runs sampled
autoclosing_block 118,970 ops/sec (0.008ms) ±1.58% 78
early_adopting_the_future 380 ops/sec (2.633ms) ±14.04% 54
gutenberg_demo 2,777 ops/sec (0.360ms) ±2.83% 74
moby_dick_parsed 37 ops/sec (22.292ms) ±5.32% 57
pygmalian_raw_html 700 ops/sec (1.492ms) ±1.33% 79
redesigning_chrome_desktop 370 ops/sec (2.705ms) ±18.71% 64
shortcode_shortcomings 1,680 ops/sec (0.595ms) ±2.57% 73
web_at_maximum_fps 681 ops/sec (1.469ms) ±5.50% 72

Benchmarking with PHPBench

I used PHPBench to get statistically signifiant results.

PHP native extension

I used PHPBench on the PHP native extension to check its own performance (with no comparison) in PHP 7.2 with other default extensions enabled:

params revs its mem_peak mean mode best worst rstdev
{"subject":"autoclosing-block"} 1000 10 472,240b 1.164μs 1.174μs 0.974μs 1.259μs 6.26%
{"subject":"early-adopting-the-future"} 1000 10 760,888b 305.396μs 303.148μs 296.700μs 328.691μs 2.81%
{"subject":"gutenberg-demo"} 1000 10 487,928b 58.186μs 57.660μs 55.948μs 63.120μs 3.20%
{"subject":"moby-dick-parsed"} 1000 10 5,826,088b 5,058.416μs 4,962.539μs 4,726.858μs 5,474.889μs 4.46%
{"subject":"pygmalian-raw-html"} 1000 10 925,256b 51.074μs 50.928μs 48.920μs 54.716μs 2.96%
{"subject":"redesigning-chrome-desktop"} 1000 10 790,712b 387.499μs 383.050μs 379.184μs 409.578μs 2.34%
{"subject":"shortcode-shortcomings"} 1000 10 532,488b 92.731μs 93.002μs 89.908μs 95.436μs 1.68%
{"subject":"web-at-maximum-fps"} 1000 10 686,784b 264.698μs 266.231μs 256.919μs 269.058μs 1.44%

Note that times are reported as µs (10e-6), not ms (10e-3). There is a difference of 3 order of magnitude.

Benchmarking with Criterion

I used Criterion to get statistically signifiant results.

Rust parser

Because the Rust parser needs to be benchmarked too!

subject time (mean)
autoclosing_block 0.182µs
early_adopting_the_future 92.498µs
gutenberg_demo 15.513µs
moby_dick_parsed 1348.9µs
pygmalian_raw_html 34.027µs
redesigning_chrome_desktop 117.630µs
shortcode_shortcomings 27.605µs
web_at_maximum_fps 80.677µs

Note that times are reported as µs (10e-6). Also note that autoclosing_block is parsed in 0.182µs, so 182ns (yep, nanosecond 😛), and that moby_dick_parsed is parsed in 1.3ms.

This is roughly the C performance I guess, since it uses the same code.

Conclusion

The Rust parser provides a single, safe, and very fast implementation.

We have successfully embeded the Rust parser in multiple environments and languages through bindings, like WebAssembly, ASM.js, C, PHP, and NodeJS.

The WebAssembly binding is safer and faster than the PEG.js implementation. It shows a speedup of 159. The binding is written in 300 LOC.

  • A large post like moby-dick-parsed.html takes 2466ms to parse with PEG.js, while it takes 14ms to parse with WebAssembly.
  • A medium post like early-adopting-the-future.html takes 201ms to parse with PEG.js, while it takes 0.9ms to parse with WebAssembly.
  • A post with no Gutenberg blocks like pygmalian-raw-html.html takes 311ms to parse with PEG.js, while it takes 1ms to parse with WebAssembly.

The ASM.js binding is a fallback of the WebAssembly binary for environments that do not support WebAssembly. It is faster on all documents, and it shows a speedup of 6. The binding is written in 30 LOC.

The C binding is almost free. It is as fast as the Rust parser.

The PHP binding is safer, faster and radically less memory gourmand than the PEG.php parser. It shows a speedup of 5230. The binding is written in 300 LOC.

  • A large post like moby-dick-parsed.html takes 5437ms to parse with PEG.php, while it takes 5ms to parse with the PHP extension,
  • A medium post like early-adopting-the-future.html takes 280ms to parse with PEG.php, while it takes 0.2ms to parse with the PHP extension.
  • A post with no Gutenberg blocks like pygmalian-raw-html.html takes 377ms to parse with PEG.php, while it takes 0.05ms to parse with the PHP extension.

The NodeJS binding is working, but I didn't had time to benchmark it yet. The WebAssembly benchmark was done in NodeJS, and also in Firefox. We can hope that the NodeJS native module is faster than WebAssembly for now (because WebAssembly does not support SIMD, nor tail calls etc yet). So in all cases, it is faster and safer than the PEG.js parser, since the binding is written in Rust and does not involve any C code. The binding is written in 150 LOC.

We can conclude than this project addresses the limitations of memory usages and speed, and provides a neat way to maintain a single implementation that can be used in many environments almost for free compared to a situation where we have to re-implement the parser for every environment. Rust provides the tooling to write, to test and to publish the documentation, and so the specification of the Gutenberg block language, which ease the onboarding of new users or contributors.

Finally, the WebAssembly and ASM.js bindings can be distributed in a NPM package. It does not break the way we want to distribute Gutenberg packages. Rust is not required to use the NPM package, only the WASM binary (which is not platform dependent) and the ASM.js file are distributed as is. The PHP extension has already been installed in 2 environments (macOS and Debian) successfully. It follows the classical extension compilation and installation process. It is possible to automate everything with Pickle. Rust is required to install the PHP extension.

Thank you for your reading 🙂.

Edit: I've found another easy low-hanging fruit to optimise the WASM binding. Then I've updated the WASM and ASM.js benchmark results (resp. 73 to 86 times faster, and 4 to 6 times faster 🎉).

Edit 2: I've found another optimisation for the WASM binding. Then I've updated the WASM benchmark results, it is 159 times faster (instead of 86 times) 🎉 🎉!

@dmsnell
Copy link
Member

dmsnell commented Aug 25, 2018

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

@mcsf, @aduth, and @mtias I have updated #8083 and all tests are passing there. since I'm not familiar with the packing issues I'd appreciate some help from someone who does - maybe @gziolo or @youknowriad.

Performance-wise the hand-written parser seems to be performing well enough that we may be able to lift some demands that we have so-far placed on the project, such as not performing the parse on the server. For example, just as @Hywan used the moby-dick-parsed.html file as a test this parser on my laptop compares at 6.03s for the auto-generated parser and 16ms for the hand-written one. 16ms may have some room for improvement but clearly it's a different ballgame for server-side parsing than five or six seconds is.

There's parallel work going on to make the parse asynchronous but personally I'd love to see us get a "default" parser in sooner than that work so that we can get past the performance issues that have been creeping up. I wanted to setup some mechanism to load custom parsers in #8083 but I realized it would be too much for a single PR and wanted to wait until we had two actual choices in the project before making the code to pick one.

That is, I'd love to see us push out a parser that's at least ready for production performance issues, continue making the parse asynchronous in the meantime, add the code to allow overwriting the parse implementation via a plugin, and fill up the tooling to compare parsers.

Previously I pushed harder for the tooling but a couple things have changed since then:

@Hywan
Copy link

Hywan commented Aug 29, 2018

@Hywan would be great to get this into a state where we can easily build and test the Rust-based parser and also to where we have a PR for Gutenberg that uses it.

Sure. I wait on #7970 to be merged before starting a PR. If someone wants to try the WASM binary, artefacts are attached to releases (https://github.com/Hywan/gutenberg-parser-rs/releases). I'm working on a Docker image, even if the justfile file guides the user to install all the required tools.

the moby-dick-parsed.html file as a test [for] this parser on my laptop […] 16ms for the hand-written one.

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

@dmsnell
Copy link
Member

dmsnell commented Aug 29, 2018

I think you're using Chrome to run the tests. I tried with Safari (because I don't have Chrome nor Chromium), and the moby-dick-parsed.html test runs in 19ms 😄.

actually I reran the tests while I had the heat sink off of my CPU and it took 418ms but then when I chilled it with liquid nitrogen it only took 7ms 🙃

I'm working on a Docker image

awesome!

@Hywan
Copy link

Hywan commented Sep 4, 2018

Side note: I've found another optimisation for the WASM binding. The average speedup is now 159 instead of 86 🎉. And I achieved my goal with moby-dick-parsed.html which is parsed in 14ms, and pygmalian-raw-html in 1ms.

My detailed comment has been updated accordingly.

@mtias mtias added the [Type] Overview Comprehensive, high level view of an area of focus often with multiple tracking issues label Nov 17, 2019
@mcsf
Copy link
Contributor Author

mcsf commented Nov 21, 2019

This issue was about short-term enhancements, and it's been a year and a half. :)

Since then, we've made great strides both in this issue and in other spaces, even though some of the plans — e.g. a system to compare competing parsers, integrating asynchronous parsers — didn't come to fruition. And I think that's fine. It's a reflection of evolving priorities.

Ultimately, parsing has greatly improved thanks to all those involved. And for this I'll close this issue in favour of the broader #8352, which also encompasses parsing-related ideas such as the exciting #18414. Any new specific parser-related tasks can be tracked in new issues. Thank you, everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f Framework Issues related to broader framework topics, especially as it relates to javascript [Type] Overview Comprehensive, high level view of an area of focus often with multiple tracking issues [Type] Task Issues or PRs that have been broken down into an individual action to take
Projects
None yet
Development

No branches or pull requests

5 participants