Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CommonMark mixed-marker list compliance #439

Merged
merged 23 commits into from
Mar 27, 2018

Conversation

aidantwoods
Copy link
Collaborator

These changes aren't fit for merge, nor do they work correctly (yet)

Refer to #437 for context on the purpose of these changes.

The current changes echo out the following:
screen shot 2016-10-11 at 12 06 09

Given the following data:

* unordered, item 1 (new list 1)
  * sub-unordered item 1 (new list 1.1)
  * sub-unordered item 2 
  + sub-unordered item 1 (new list 1.2)
  - sub-unordered item 1 (new list 1.3)
  1. sub-ordered item 1 (new list 1.4)
  2. sub-ordered item 2
  * sub-unordered item 4 (new list 1.5)
    1. sub-sub ordered item 1 (new list 1.5.1)
* unordered, item 2

@PhrozenByte

I may be wrong in assuming that a return null; statement is the correct thing to do in order to start a new list block at the current nest depth. But I'm getting some strange behaviour, where lists of different depth end up having identical indentation (according to $Block['indent'] === $Line['indent'])

Also don't worry too much about the ugly looking number of variables under Block['data'] – they're there for debugging really – so I can see more info about the 'parent' list when the elseif is triggered. Most of that data shouldn't be too necessary, and can certainly be cleaned up when moving towards an actual solution here.

These changes aren't fit for merge, nor do they work correctly (yet)
@aidantwoods aidantwoods changed the title (beginning to) improve commonmark compliance:lists (Beginning to) improve CommonMark compliance: Lists Oct 11, 2016
Parsedown.php Outdated
$Block['indent'] === $Line['indent']
and $Block['data']['type'] === 'ul'
and preg_match('/^'.preg_quote($Block['data']['matchText']).'(?:[ ]+(.*)|$)/', $Line['text'], $matches)
)
Copy link
Collaborator Author

@aidantwoods aidantwoods Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be summarised as the following:

  • If either of the following renders true, then add the item as normal

    • If the list is the same indent as the current block
    • and the type is ordered
    • check to see if the line matches the expected pattern (Parsedown doesn't yet support multiple ordered list markers, so comparing the pattern is the only thing we can do here)

    or

    • If the list is the same indent as the current block
    • and the type is un-ordered
    • check to see if the line has the same list marker (in accordance with CommonMark spec)
  • otherwise return null if the following is true

    • the line has the same indent as the current block

@PhrozenByte
Copy link
Contributor

PhrozenByte commented Oct 11, 2016

You can find the issue on line 588: This line is responsible for trimming the list item (i.e. removing the marker or matching indents from the beginning of the line) and re-issuing it to Parsedown::lines() (via Parsedown::li()) for deeper inspect (including parsing sublists). However, Parsedown doesn't actually compare the indent with the list item's width (-·text is 2 chars long, -···text is 4 chars long, 1.·text is 3 chars long, 123)·text is 5 chars long (btw: ) is a valid ordered list marker); · are whitespaces, GitHub removes them otherwise...), but trims always up to four whitespaces. Thus the following works:

* unordered, item 1 (new list 1)
    * sub-unordered item 1 (new list 1.1)
    * sub-unordered item 2
    + sub-unordered item 1 (new list 1.2)
    - sub-unordered item 1 (new list 1.3)
    1. sub-ordered item 1 (new list 1.4)
    2. sub-ordered item 2
    * sub-unordered item 4 (new list 1.5)
        1. sub-sub ordered item 1 (new list 1.5.1)
* unordered, item 2

Have a look at the CommonMark specs about how to parse list items: http://spec.commonmark.org/0.26/#list-items

@aidantwoods
Copy link
Collaborator Author

I see, I wonder how appropriate just swapping out 4 for the current indent (plus one) is in this scenario.
I'll have a proper look to see if that's likely to break anything, but let's see if it passes the tests

@aidantwoods
Copy link
Collaborator Author

@PhrozenByte okay cool, so I'd still like to add support for the ) list marker, and then add a proper check on for that.
I've updated the lazy list test with mixed markers to comply with CommonMark, but I'll add some more on that for ordered list markers too.

As I said, I'll take a better look through the code to try assess that indentation stripping change. Anything that you can spot immediately that might cause trouble? (I'm still yet to clean up the code to remove the data I've passed on for debugging).

@PhrozenByte
Copy link
Contributor

At the moment not, no. As said, you can use test/CommonMarkTest.php or test/CommonMarkTestWeak.php from #423 to test whether your changes comply to the CommonMark specs or not.

@aidantwoods
Copy link
Collaborator Author

@PhrozenByte yup, I'll run it through those tests once I've added support for the alternate ordered list marker :)

@aidantwoods aidantwoods changed the title (Beginning to) improve CommonMark compliance: Lists Improve CommonMark compliance: Lists Oct 11, 2016
@aidantwoods aidantwoods changed the title Improve CommonMark compliance: Lists Improve CommonMark list compliance Oct 11, 2016
Also added marker check to ordered list case when deciding to continue the current list
@aidantwoods
Copy link
Collaborator Author

aidantwoods commented Oct 11, 2016

@PhrozenByte I've run it through the common mark test, and we get 207 failures on both the current master, and this commit.

FFFFFFFFFFFFFFFF..FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  63 / 209 ( 30%)
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 126 / 209 ( 60%)
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 189 / 209 ( 90%)
FFFFFFFFFFFFFFFFFFFF                                            209 / 209 (100%)

Thought that was a bit strange, so I tried running an old master through the test (prior to my adding the list start attribute in #431)

Looking at the test log:

76) CommonMarkTest::test_ with data set #77 ('List items', '<ol start="123456789">\n<li>ok...ot ok\n', '<p>1234567890. not ok</p>\n```...\n0. ok')
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'<p>1234567890. not ok</p>
-````````````````````````````````
-A start number may begin with 0s:
-```````````````````````````````` example
-0. ok'
+'<ol start="123456789">
+<li>ok</li></ol>
+<pre><code>```````````````````````````````` example
+1234567890. not ok</code></pre>'

Apparently Parsedown is able to produce the the start attribute prior to getting the functionality to support it.

It looks like the CommonMark test is broken, and is in-fact giving Parsedown some of the expected HTML (which would be the only way Parsedown could echo back that start attribute in an old version), and causing quite a few failures, where there perhaps shouldn't be?

@PhrozenByte
Copy link
Contributor

You must use test/CommonMarkTest.php or test/CommonMarkTestWeak.php from #423, not from the master branch - master is broken, that's the reason why #423 exists.

@aidantwoods
Copy link
Collaborator Author

Ah I see – apologies, completely read over that!

@aidantwoods
Copy link
Collaborator Author

aidantwoods commented Oct 11, 2016

@PhrozenByte Yeah these are looking a little more sane now ;p

Base line (at master) and this commit are identical in tests at 328 failures

FFF.FFFFF........FF.....F.F..F...F....FF....FF..F....FFF.F..FFF  63 / 621 ( 10%)
....FFFF....F..FFF.FFFFFFFFFFF.FFF.FFFFFFFFFFFFFF.F.FFFFFFFFF.. 126 / 621 ( 20%)
FF.FFFF...FF...F.FFFFFF.F.F.FF.FFFF.F.F.F.FF..FFF.FF......FF... 189 / 621 ( 30%)
.F..FFFFF...F...F..F..FFFF.F.....FF.F.F.FFFFF.F.FFFFFFF.FFFFF.. 252 / 621 ( 40%)
.FFFF..FFF.F.FF.FFFFFFFFFFF.FF.F.FFF.F.FFFFFFFFFFF.FFFFF.F...FF 315 / 621 ( 50%)
...FFF..FF..FFF...FFF.F...FFFF.F.F..F..FF..FFFFFFF.FF....FF.FFF 378 / 621 ( 60%)
....FFF.....FF.....FFF...............F..FF.......F..FF......F.. 441 / 621 ( 71%)
.FFFFFFFFFFFF..FF..FFF.FFFF.FFFFFF.FF....F..FFFF..FFF..F..FFF.F 504 / 621 ( 81%)
FF.FFFFF.FFF.......F...F...F..F.....FFFFF.FF....F.F.FF.......FF 567 / 621 ( 91%)
.F.F..FF.FF....FF.F..FF.....FFF..F.F..F.F.....F.......          621 / 621 (100%)

I'm just skimming over the list tests, and I can't actually see a test for mixed markers.

So I guess no improvement should be expected in that case?

@aidantwoods
Copy link
Collaborator Author

I've added some items to the test that do a little more to verify mixed markers indeed do start a new list

@aidantwoods aidantwoods changed the title Improve CommonMark list compliance Improve CommonMark mixed-marker list compliance Oct 11, 2016
@PhrozenByte
Copy link
Contributor

PhrozenByte commented Oct 11, 2016

Yeah, as said in #437 (comment), there's afaik no example for that at the moment (one of the reasons why passing all CommonMark tests doesn't necessarily mean that the parser is fully CommonMark compliant). However, you can (and should) add a new (!) test to Parsedown's own test suite (8965c78). My intention by referencing test/CommonMarkTestWeak.php is, that you ensure, that any change you make doesn't break a previously passed test (and it might give some incitation to fix other similar/related issues).

btw: I recommend using test/CommonMarkTestWeak.php instead of test/CommonMarkTest.php for now (see #423).

@aidantwoods
Copy link
Collaborator Author

aidantwoods commented Oct 12, 2016

@PhrozenByte I've got this to a state where (except data set #77), new failed tests don't look too hard to fix: e.g.

87) CommonMarkTestWeak::testExample with data set #222 ('List items', '1.  foo\n\n    ```\n    bar\n    ... > bam', '<ol>\n<li>\n<p>foo</p>\n<pre><co...\n</ol>')
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
 '<ol><li><p>foo</p>
-<pre><code>bar</code></pre>
+<pre><code>    bar</code></pre>
 <p>baz</p>
 <blockquote><p>bam</p></blockquote></li></ol>'

I'll try get #77 working next, which should solve the failed Parsedown checks too, and then this commit will have passed a few more, while not degrading compliance (hopefully).

Here's the git diff of the master vs. this commit in CommonMarkWeak (take care reading the double git diff notation (+ and -), since the test outputs the differences in git style).

(Also note I've only included the additional tests that my commit has caused to fail here – it did actually improve results in some places too! ;-) )

screen shot 2016-10-12 at 17 58 23

screen shot 2016-10-12 at 17 58 45

screen shot 2016-10-12 at 17 59 10

screen shot 2016-10-12 at 18 00 36

screen shot 2016-10-12 at 18 00 53

@aidantwoods
Copy link
Collaborator Author

Just to talk a little bit about the reasoning behind the change I made in the latest commit.

The root issue that this PR solves, is the ability to create a new type of list, within a list, without nesting within an li.
That type of behaviour requires that the lines handler be informed that the current block type is not continuable.
When this information is passed to the lines handler, it treats the current line as a new block (which is what we want). Unfortunately there are some very specific CommonMark specifications that are violated when this new type of behaviour plays out.
In particular, we have:

screen shot 2016-10-13 at 14 33 33

This test enforces that a list item take precedence over code when there is ambiguity – which is fine a change I already solved when we're talking about a list item of the same depth.
See changes on lines 199–213 (https://github.com/erusev/parsedown/pull/439/files?diff=unified#diff-ed0d3da57330712681e64f838087ea47R199)
The lines handler will 'prefer' the current block type.

The code requires a little more editing for nested lists though, we want to prefer to start a list over code, but only if the line is a list. If it isn't we want the code priority to sit before everything else to avoid breaking cases like:

screen shot 2016-10-13 at 14 49 19

In the above, a Rule must super-seed a List, and Code must super-seed both a List and a Rule. However, not if we're already in a list. In that case List must super-seed Code.

Okay, but again: we're okay, and can keep all the priorities in place, except if the current line is a list, and the current block is a list.

However, passing the tests get a bit more complicated when encounting tests like this one, and also simultaneously having to prioritise a nested list that is within a list:

screen shot 2016-10-13 at 14 33 03

This behaviour is the complete opposite to that of the previous. With the exception that it can only occur at the "root" block. Parsedown didn't currently store information about the 'parent' block when invoking the lines handler from the function li.
So I've added it.

Parsedown can now differentiate between the two cases and assign priorities accordingly.

(Let me know if you can see a nicer way of doing it!)

@aidantwoods
Copy link
Collaborator Author

Okay, so this should conclude the fixes for any tests that were previously broken by the added functionality.

Here's the git diff of master vs. this commit for the CommonMarkWeak test, showing the failures my commit removes (i.e. improvements). This commit no longer breaks any previously passed CommonMarkWeak tests 😃 (cc: @PhrozenByte @erusev )

screen shot 2016-10-13 at 15 50 45

screen shot 2016-10-13 at 15 50 59

screen shot 2016-10-13 at 15 52 45

screen shot 2016-10-13 at 15 51 19

screen shot 2016-10-13 at 15 51 30

@aidantwoods
Copy link
Collaborator Author

aidantwoods commented Oct 13, 2016

Here might be a good point to note that the code I've added to prevent breakages is far from ideal. There's nothing wrong with the actual code, per say – more that the method in which it has been added could be better (but would require code changes in all the handler functions' input structure to facilitate).

Simply put: because the CommonMark spec requires awareness of parent blocks and their properties, (or lack of any parent blocks) in some places (as discussed above), Parsedown should also retain this information so that the handler functions may use information from previous blocks to decide what to do with the current one. In these changes I've only caused awareness of the parent block type in a few cases, but there are cases in which knowing the parent's indent would be useful too. And certainly there may be other properties that a handler may require about the parent when trying to be CommonMark compliant that I haven't run into here.

@PhrozenByte
Copy link
Contributor

To be honest, I'd a bit trouble to track this down... Code comments would have been helpful 😆

The parsing strategy of Parsedown is to let the list handler take responsibility of all lines which are supposed to be part of a list item, collecting them as li and finally independently parsing them with decremented indent - and this is exactly what we need for CommonMark. As far as I understand this whole issue right, the only problem you've is that you're not decrementing the indent properly. That's what I was trying to say with

However, Parsedown doesn't actually compare the indent with the list item's width (-·text is 2 chars long, -···text is 4 chars long, 1.·text is 3 chars long, 123)·text is 5 chars long […]; · are whitespaces, GitHub removes them otherwise...), but trims always up to four whitespaces.

However, you're still making (wrong) assumptions about the indent with

preg_replace('/^[ ]{0,'.min(4, $Block['indent'] + 1).'}/', '', $Line['body'])

The CommonMark specs are very clear about this: Don't assume anything, the indent must match exactly the list item's width. Any additional whitespace is treated the same way as when it is at the beginning of any other line. A missing whitespace leads to not recognizing the line as part of the list item. The only exception is a multi-line paragraph.

I highly recommend you to work on the basis of the CommonMark specs. Don't just look at a failing example and try fix it, read through the specs to find out why the result should be like this and then try to implement it that way, completely independent of the failed tests you were looking at.

So, instead of your massive changes in 2db3199, 0a43799 and 6973302, you just need some minor tweaks to Parsedown::blockList() and Parsedown::blockListContinue().

I've opened a PR on your fork (see aidantwoods#2; here's the diff excluding the above three commits) to fix this. Simply merge my PR and the commits will automatically show up in this PR. Since I was working on it anyway, I've also fixed various other small issues. The only "real" list-related CommonMark example that still fails is issue #427, everything else is actually related to something else (like code block or blockquote parsing) or are crazy exception rules which IMHO don't make much sense and appear as if they have been added to match their reference implementation.

@aidantwoods aidantwoods merged commit 2f291e0 into erusev:master Mar 27, 2018
@aidantwoods aidantwoods added this to the 1.8.0 milestone Mar 28, 2018
@aidantwoods aidantwoods deleted the patch-4 branch March 28, 2018 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants