Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation: Detect block validity by tokenizing content #2210

Merged
merged 5 commits into from
Aug 4, 2017

Conversation

aduth
Copy link
Member

@aduth aduth commented Aug 4, 2017

Related: #2197, #2132, #1929

This pull request seeks to try an alternate strategy to block validation. Previously we would compare a "beautified" copy of the block from post content when run again through its serialized using parsed attributes. The approach here instead tries to find whether there is an effective difference between two HTML strings by using an HTML tokenizer and comparing the token objects generated by each.

In general, this improves leniency of the comparison, particularly around whitespace, self-closing tags, whitespace within tag names, class attribute value ordering, style attribute values and value ordering.

Some false negatives can occur, largely around whitespace in HTML because it is complicated.

The tokenizer is not a spec-compliant validator, nor do I think we need one. It is relatively light-weight (especially compared to other options like parse5), and can be extended to observe events while tokenizing. Future refactoring could introduce optimizations such as early bailing on mismatched tokens (i.e. fail early).

Testing instructions:

Verify that all blocks in demo content are valid.

Verify that modifying post-content.js to cause a block to become invalid (e.g. adding/remove class or attribute) still causes "Invalid Block" warning to appear. See also #1929.

Verify that some previously erroneously-invalid content is now considered valid:

Ensure unit tests pass:

npm test

Caveats:

Differences in content caused by removep (autop) behavior is still not accounted for. We may want to consider building this into the parser?

@aduth aduth added [Feature] Blocks Overall functionality of blocks Framework Issues related to broader framework topics, especially as it relates to javascript [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Aug 4, 2017
@aduth aduth requested a review from nylen August 4, 2017 02:17
@aduth aduth added this to the Beta 0.7.0 milestone Aug 4, 2017
Copy link
Contributor

@youknowriad youknowriad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this branch.

It had the two first use cases fixed:

The third one, there are missing p tags, technically the HTML are different.

Let's ship it

@aduth
Copy link
Member Author

aduth commented Aug 4, 2017

The third one, there are missing p tags, technically the HTML are different.

Yep, noted as well in Caveats.

@codecov
Copy link

codecov bot commented Aug 4, 2017

Codecov Report

Merging #2210 into master will increase coverage by 0.79%.
The diff coverage is 94.33%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2210      +/-   ##
=========================================
+ Coverage   23.11%   23.9%   +0.79%     
=========================================
  Files         141     142       +1     
  Lines        4405    4484      +79     
  Branches      747     763      +16     
=========================================
+ Hits         1018    1072      +54     
- Misses       2856    2877      +21     
- Partials      531     535       +4
Impacted Files Coverage Δ
blocks/api/parser.js 100% <ø> (+2.85%) ⬆️
blocks/api/validation.js 94.33% <94.33%> (ø)
editor/sidebar/post-taxonomies/tags-selector.js 0% <0%> (ø) ⬆️
editor/sidebar/post-visibility/index.js 0% <0%> (ø) ⬆️
editor/modes/visual-editor/block-list.js 0% <0%> (ø) ⬆️
editor/modes/visual-editor/block.js 0% <0%> (ø) ⬆️
editor/sidebar/table-of-contents/index.js 0% <0%> (ø) ⬆️
...tor/sidebar/post-taxonomies/categories-selector.js 0% <0%> (ø) ⬆️
editor/settings/provider.js 0% <0%> (ø) ⬆️
editor/block-mover/index.js 0% <0%> (ø) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad030a1...e54dd1c. Read the comment docs.

@aduth aduth merged commit af6b374 into master Aug 4, 2017
@aduth aduth deleted the update/block-validation branch August 4, 2017 16:43
@nylen nylen mentioned this pull request Aug 4, 2017
5 tasks
@dmsnell
Copy link
Member

dmsnell commented Aug 7, 2017

@aduth did you compare this again lexing the HTML tags in the parser?

@aduth
Copy link
Member Author

aduth commented Aug 7, 2017

@dmsnell This was on my mind over the weekend. I think certainly there's some overlap in effort between the parser and this, and hopefully we can make some strides in simplifying the two. Your advice is welcome! (As this is an area I've generally allowed others with a better understanding of the fundamentals take the helm 😉 )

Speaking strictly of the requirements of this implementation:

  • It needs to find a difference between inputs (actual post content, post content reserialized from parsed attributes)
    • Notably, a parse needs to have already occurred by the time this step is reached
  • It needs to be relatively forgiving (whitespace, attribute order, class order, style order)
    • We care only if differences have a meaningful impact on browser interpretation
  • Not implemented yet, but ideally it could fail fast (abandon validation early as soon as it identifies an issue)

To the last point, it seemed in order to achieve this, validation would need to occur very early in a parse, during lexical analysis / tokenization ("give me the next relevant token of each input, and let's compare"). My assumption was that we would not need the full syntax tree (children, balanced nodes), and that in fact this could be a hindrance to comparison (instead compare tokens on a flat array). Although in retrospect, I'm curious if there are valid failing cases because we don't perform any checks on the closing tag.

Do you envision any flaws with this approach? Any ways we could leverage the existing grammar to achieve these goals?

@dmsnell
Copy link
Member

dmsnell commented Aug 7, 2017

Well the PEG parser doesn't have separate lexing and parsing stages so it would come altogether.

The original parser did this lexing for this and other reasons. I think if we build it there we automatically get this notion of equivalence: syntax changes are irrelevant while text nodes and attribute values constitute actual changes.

One of the other reasons I've wanted to put it back in for so long is that we can finally break out of text-only storage and choose to "serialize" by saving the JSON data directly in some other mechanism. For example, maybe a we write a plugin to store content in Simperium or the mobile apps don't want to mess with HTML.

Validation in the app can look at structure directly and know that if we get an extra space it was a space as a "text node" and not spacing between HTML tags, etc… Also, because of this lexing all of the self-closing/self-closable tags can be normalized in the Gutenberg data structure and not introduce odd cases to validate. E.g., is "" equal to ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Blocks Overall functionality of blocks [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f Framework Issues related to broader framework topics, especially as it relates to javascript
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants