Improve CommonMark mixed-marker list compliance #439

aidantwoods · 2016-10-11T11:13:30Z

These changes aren't fit for merge, nor do they work correctly (yet)

Refer to #437 for context on the purpose of these changes.

The current changes echo out the following:

Given the following data:

* unordered, item 1 (new list 1)
  * sub-unordered item 1 (new list 1.1)
  * sub-unordered item 2 
  + sub-unordered item 1 (new list 1.2)
  - sub-unordered item 1 (new list 1.3)
  1. sub-ordered item 1 (new list 1.4)
  2. sub-ordered item 2
  * sub-unordered item 4 (new list 1.5)
    1. sub-sub ordered item 1 (new list 1.5.1)
* unordered, item 2

@PhrozenByte

I may be wrong in assuming that a return null; statement is the correct thing to do in order to start a new list block at the current nest depth. But I'm getting some strange behaviour, where lists of different depth end up having identical indentation (according to $Block['indent'] === $Line['indent'])

Also don't worry too much about the ugly looking number of variables under Block['data'] – they're there for debugging really – so I can see more info about the 'parent' list when the elseif is triggered. Most of that data shouldn't be too necessary, and can certainly be cleaned up when moving towards an actual solution here.

These changes aren't fit for merge, nor do they work correctly (yet)

aidantwoods · 2016-10-11T11:29:08Z

Parsedown.php

+                    $Block['indent'] === $Line['indent']
+                and $Block['data']['type'] === 'ul'
+                and preg_match('/^'.preg_quote($Block['data']['matchText']).'(?:[ ]+(.*)|$)/', $Line['text'], $matches)
+            )


This can be summarised as the following:

If either of the following renders true, then add the item as normal

If the list is the same indent as the current block

and the type is ordered

check to see if the line matches the expected pattern (Parsedown doesn't yet support multiple ordered list markers, so comparing the pattern is the only thing we can do here)

or

If the list is the same indent as the current block

and the type is un-ordered

check to see if the line has the same list marker (in accordance with CommonMark spec)

otherwise return null if the following is true

the line has the same indent as the current block

PhrozenByte · 2016-10-11T12:17:27Z

You can find the issue on line 588: This line is responsible for trimming the list item (i.e. removing the marker or matching indents from the beginning of the line) and re-issuing it to Parsedown::lines() (via Parsedown::li()) for deeper inspect (including parsing sublists). However, Parsedown doesn't actually compare the indent with the list item's width (-·text is 2 chars long, -···text is 4 chars long, 1.·text is 3 chars long, 123)·text is 5 chars long (btw: ) is a valid ordered list marker); · are whitespaces, GitHub removes them otherwise...), but trims always up to four whitespaces. Thus the following works:

* unordered, item 1 (new list 1)
    * sub-unordered item 1 (new list 1.1)
    * sub-unordered item 2
    + sub-unordered item 1 (new list 1.2)
    - sub-unordered item 1 (new list 1.3)
    1. sub-ordered item 1 (new list 1.4)
    2. sub-ordered item 2
    * sub-unordered item 4 (new list 1.5)
        1. sub-sub ordered item 1 (new list 1.5.1)
* unordered, item 2

Have a look at the CommonMark specs about how to parse list items: http://spec.commonmark.org/0.26/#list-items

aidantwoods · 2016-10-11T12:40:34Z

I see, I wonder how appropriate just swapping out 4 for the current indent (plus one) is in this scenario.
I'll have a proper look to see if that's likely to break anything, but let's see if it passes the tests

aidantwoods · 2016-10-11T13:01:40Z

@PhrozenByte okay cool, so I'd still like to add support for the ) list marker, and then add a proper check on for that.
I've updated the lazy list test with mixed markers to comply with CommonMark, but I'll add some more on that for ordered list markers too.

As I said, I'll take a better look through the code to try assess that indentation stripping change. Anything that you can spot immediately that might cause trouble? (I'm still yet to clean up the code to remove the data I've passed on for debugging).

PhrozenByte · 2016-10-11T13:14:17Z

At the moment not, no. As said, you can use test/CommonMarkTest.php or test/CommonMarkTestWeak.php from #423 to test whether your changes comply to the CommonMark specs or not.

aidantwoods · 2016-10-11T13:39:27Z

@PhrozenByte yup, I'll run it through those tests once I've added support for the alternate ordered list marker :)

Also added marker check to ordered list case when deciding to continue the current list

aidantwoods · 2016-10-11T19:36:26Z

@PhrozenByte I've run it through the common mark test, and we get 207 failures on both the current master, and this commit.

FFFFFFFFFFFFFFFF..FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  63 / 209 ( 30%)
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 126 / 209 ( 60%)
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 189 / 209 ( 90%)
FFFFFFFFFFFFFFFFFFFF                                            209 / 209 (100%)

Thought that was a bit strange, so I tried running an old master through the test (prior to my adding the list start attribute in #431)

Looking at the test log:

76) CommonMarkTest::test_ with data set #77 ('List items', '<ol start="123456789">\n<li>ok...ot ok\n', '<p>1234567890. not ok</p>\n```...\n0. ok')
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'<p>1234567890. not ok</p>
-````````````````````````````````
-A start number may begin with 0s:
-```````````````````````````````` example
-0. ok'
+'<ol start="123456789">
+<li>ok</li></ol>
+<pre><code>```````````````````````````````` example
+1234567890. not ok</code></pre>'

Apparently Parsedown is able to produce the the start attribute prior to getting the functionality to support it.

It looks like the CommonMark test is broken, and is in-fact giving Parsedown some of the expected HTML (which would be the only way Parsedown could echo back that start attribute in an old version), and causing quite a few failures, where there perhaps shouldn't be?

PhrozenByte · 2016-10-11T19:40:08Z

You must use test/CommonMarkTest.php or test/CommonMarkTestWeak.php from #423, not from the master branch - master is broken, that's the reason why #423 exists.

aidantwoods · 2016-10-11T19:41:28Z

Ah I see – apologies, completely read over that!

aidantwoods · 2016-10-11T19:51:46Z

@PhrozenByte Yeah these are looking a little more sane now ;p

Base line (at master) and this commit are identical in tests at 328 failures

FFF.FFFFF........FF.....F.F..F...F....FF....FF..F....FFF.F..FFF  63 / 621 ( 10%)
....FFFF....F..FFF.FFFFFFFFFFF.FFF.FFFFFFFFFFFFFF.F.FFFFFFFFF.. 126 / 621 ( 20%)
FF.FFFF...FF...F.FFFFFF.F.F.FF.FFFF.F.F.F.FF..FFF.FF......FF... 189 / 621 ( 30%)
.F..FFFFF...F...F..F..FFFF.F.....FF.F.F.FFFFF.F.FFFFFFF.FFFFF.. 252 / 621 ( 40%)
.FFFF..FFF.F.FF.FFFFFFFFFFF.FF.F.FFF.F.FFFFFFFFFFF.FFFFF.F...FF 315 / 621 ( 50%)
...FFF..FF..FFF...FFF.F...FFFF.F.F..F..FF..FFFFFFF.FF....FF.FFF 378 / 621 ( 60%)
....FFF.....FF.....FFF...............F..FF.......F..FF......F.. 441 / 621 ( 71%)
.FFFFFFFFFFFF..FF..FFF.FFFF.FFFFFF.FF....F..FFFF..FFF..F..FFF.F 504 / 621 ( 81%)
FF.FFFFF.FFF.......F...F...F..F.....FFFFF.FF....F.F.FF.......FF 567 / 621 ( 91%)
.F.F..FF.FF....FF.F..FF.....FFF..F.F..F.F.....F.......          621 / 621 (100%)

I'm just skimming over the list tests, and I can't actually see a test for mixed markers.

So I guess no improvement should be expected in that case?

aidantwoods · 2016-10-11T20:00:55Z

I've added some items to the test that do a little more to verify mixed markers indeed do start a new list

PhrozenByte · 2016-10-11T20:08:28Z

Yeah, as said in #437 (comment), there's afaik no example for that at the moment (one of the reasons why passing all CommonMark tests doesn't necessarily mean that the parser is fully CommonMark compliant). However, you can (and should) add a new (!) test to Parsedown's own test suite (8965c78). My intention by referencing test/CommonMarkTestWeak.php is, that you ensure, that any change you make doesn't break a previously passed test (and it might give some incitation to fix other similar/related issues).

btw: I recommend using test/CommonMarkTestWeak.php instead of test/CommonMarkTest.php for now (see #423).

aidantwoods · 2016-10-12T17:25:11Z

@PhrozenByte I've got this to a state where (except data set #77), new failed tests don't look too hard to fix: e.g.

87) CommonMarkTestWeak::testExample with data set #222 ('List items', '1.  foo\n\n    ```\n    bar\n    ... > bam', '<ol>\n<li>\n<p>foo</p>\n<pre><co...\n</ol>')
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
 '<ol><li><p>foo</p>
-<pre><code>bar</code></pre>
+<pre><code>    bar</code></pre>
 <p>baz</p>
 <blockquote><p>bam</p></blockquote></li></ol>'

I'll try get #77 working next, which should solve the failed Parsedown checks too, and then this commit will have passed a few more, while not degrading compliance (hopefully).

Here's the git diff of the master vs. this commit in CommonMarkWeak (take care reading the double git diff notation (+ and -), since the test outputs the differences in git style).

(Also note I've only included the additional tests that my commit has caused to fail here – it did actually improve results in some places too! ;-) )

aidantwoods · 2016-10-13T14:00:28Z

Just to talk a little bit about the reasoning behind the change I made in the latest commit.

The root issue that this PR solves, is the ability to create a new type of list, within a list, without nesting within an li.
That type of behaviour requires that the lines handler be informed that the current block type is not continuable.
When this information is passed to the lines handler, it treats the current line as a new block (which is what we want). Unfortunately there are some very specific CommonMark specifications that are violated when this new type of behaviour plays out.
In particular, we have:

This test enforces that a list item take precedence over code when there is ambiguity – which is fine a change I already solved when we're talking about a list item of the same depth.
See changes on lines 199–213 (https://github.com/erusev/parsedown/pull/439/files?diff=unified#diff-ed0d3da57330712681e64f838087ea47R199)
The lines handler will 'prefer' the current block type.

The code requires a little more editing for nested lists though, we want to prefer to start a list over code, but only if the line is a list. If it isn't we want the code priority to sit before everything else to avoid breaking cases like:

In the above, a Rule must super-seed a List, and Code must super-seed both a List and a Rule. However, not if we're already in a list. In that case List must super-seed Code.

Okay, but again: we're okay, and can keep all the priorities in place, except if the current line is a list, and the current block is a list.

However, passing the tests get a bit more complicated when encounting tests like this one, and also simultaneously having to prioritise a nested list that is within a list:

This behaviour is the complete opposite to that of the previous. With the exception that it can only occur at the "root" block. Parsedown didn't currently store information about the 'parent' block when invoking the lines handler from the function li.
So I've added it.

Parsedown can now differentiate between the two cases and assign priorities accordingly.

(Let me know if you can see a nicer way of doing it!)

aidantwoods · 2016-10-13T15:01:59Z

Okay, so this should conclude the fixes for any tests that were previously broken by the added functionality.

Here's the git diff of master vs. this commit for the CommonMarkWeak test, showing the failures my commit removes (i.e. improvements). This commit no longer breaks any previously passed CommonMarkWeak tests 😃 (cc: @PhrozenByte @erusev )

aidantwoods · 2016-10-13T15:09:43Z

Here might be a good point to note that the code I've added to prevent breakages is far from ideal. There's nothing wrong with the actual code, per say – more that the method in which it has been added could be better (but would require code changes in all the handler functions' input structure to facilitate).

Simply put: because the CommonMark spec requires awareness of parent blocks and their properties, (or lack of any parent blocks) in some places (as discussed above), Parsedown should also retain this information so that the handler functions may use information from previous blocks to decide what to do with the current one. In these changes I've only caused awareness of the parent block type in a few cases, but there are cases in which knowing the parent's indent would be useful too. And certainly there may be other properties that a handler may require about the parent when trying to be CommonMark compliant that I haven't run into here.

…k tests" This reverts commit 6973302.

This reverts commit 0a43799.

This reverts commit 2db3199.

See CommonMark spec examples [erusev#226](http://spec.commonmark.org/0.26/#example-226) to erusev#229

According to the CommonMark specs ([list items](http://spec.commonmark.org/0.26/#list-items), rule 3), list items starting with a blank line basically behave like as if the \n doesn't exist. Also see example [erusev#241](http://spec.commonmark.org/0.26/#example-241).

This basically represents [list item parsing](http://spec.commonmark.org/0.26/#list-items), rule 1 of the CommonMark specs.

Subsequent list items which aren't indented sufficiently are treated as part of the original list, see CommonMark spec example [erusev#256](http://spec.commonmark.org/0.26/#example-256).

PhrozenByte · 2016-10-13T19:09:44Z

To be honest, I'd a bit trouble to track this down... Code comments would have been helpful 😆

The parsing strategy of Parsedown is to let the list handler take responsibility of all lines which are supposed to be part of a list item, collecting them as li and finally independently parsing them with decremented indent - and this is exactly what we need for CommonMark. As far as I understand this whole issue right, the only problem you've is that you're not decrementing the indent properly. That's what I was trying to say with

However, Parsedown doesn't actually compare the indent with the list item's width (-·text is 2 chars long, -···text is 4 chars long, 1.·text is 3 chars long, 123)·text is 5 chars long […]; · are whitespaces, GitHub removes them otherwise...), but trims always up to four whitespaces.

However, you're still making (wrong) assumptions about the indent with

preg_replace('/^[ ]{0,'.min(4, $Block['indent'] + 1).'}/', '', $Line['body'])

The CommonMark specs are very clear about this: Don't assume anything, the indent must match exactly the list item's width. Any additional whitespace is treated the same way as when it is at the beginning of any other line. A missing whitespace leads to not recognizing the line as part of the list item. The only exception is a multi-line paragraph.

I highly recommend you to work on the basis of the CommonMark specs. Don't just look at a failing example and try fix it, read through the specs to find out why the result should be like this and then try to implement it that way, completely independent of the failed tests you were looking at.

So, instead of your massive changes in 2db3199, 0a43799 and 6973302, you just need some minor tweaks to Parsedown::blockList() and Parsedown::blockListContinue().

I've opened a PR on your fork (see aidantwoods#2; here's the diff excluding the above three commits) to fix this. Simply merge my PR and the commits will automatically show up in this PR. Since I was working on it anyway, I've also fixed various other small issues. The only "real" list-related CommonMark example that still fails is issue #427, everything else is actually related to something else (like code block or blockquote parsing) or are crazy exception rules which IMHO don't make much sense and appear as if they have been added to match their reference implementation.

Use the list marker width to determine whether a list item is continued

(beginning to) improve commonmark compliance:lists

c4d4a68

These changes aren't fit for merge, nor do they work correctly (yet)

aidantwoods changed the title ~~(beginning to) improve commonmark compliance:lists~~ (Beginning to) improve CommonMark compliance: Lists Oct 11, 2016

aidantwoods commented Oct 11, 2016

View reviewed changes

Contextual limits on indentation stripping

06c4344

aidantwoods added 2 commits October 11, 2016 13:48

Check that the current line is a list before starting a new one

0bd61a7

Update test to comply with CommonMark

d967914

aidantwoods changed the title ~~(Beginning to) improve CommonMark compliance: Lists~~ Improve CommonMark compliance: Lists Oct 11, 2016

aidantwoods changed the title ~~Improve CommonMark compliance: Lists~~ Improve CommonMark list compliance Oct 11, 2016

Add ) as an ordered list marker

d26b33c

Also added marker check to ordered list case when deciding to continue the current list

More appropriate tests for these changes

8965c78

aidantwoods changed the title ~~Improve CommonMark list compliance~~ Improve CommonMark mixed-marker list compliance Oct 11, 2016

PhrozenByte mentioned this pull request Oct 12, 2016

Test Parsedown against cached expect-to-pass CommonMark spec examples #440

Closed

Break less previously passed CommonMarkWeak tests

2db3199

Prevent failure with data set 77 in CommonMarkWeak

0a43799

Prevent breaking remaining previously compliant CommonMarkWeak tests

6973302

PhrozenByte added 10 commits October 13, 2016 19:25

Revert "Prevent breaking remaining previously compliant CommonMarkWea…

eb853da

…k tests" This reverts commit 6973302.

Revert "Prevent failure with data set 77 in CommonMarkWeak"

e691034

This reverts commit 0a43799.

Revert "Break less previously passed CommonMarkWeak tests"

81025cd

This reverts commit 2db3199.

Fix ordered list start argument

bdf537e

See CommonMark spec examples [erusev#226](http://spec.commonmark.org/0.26/#example-226) to erusev#229

Remove unused $placeholder variable

30ff5c6

Support list items starting with indented code

1d61f90

Use the list marker width to determine whether a list item is continued

7b1529f

This basically represents [list item parsing](http://spec.commonmark.org/0.26/#list-items), rule 1 of the CommonMark specs.

Fix code formatting

a9e1163

Handle subsequent list items which aren't indented sufficiently

a3836b1

Subsequent list items which aren't indented sufficiently are treated as part of the original list, see CommonMark spec example [erusev#256](http://spec.commonmark.org/0.26/#example-256).

PhrozenByte mentioned this pull request Oct 13, 2016

Use the list marker width to determine whether a list item is continued aidantwoods/parsedown#2

Merged

aidantwoods added 4 commits October 14, 2016 08:29

Merge pull request #2 from PhrozenByte/aidantwoods/patch-4

67e454e

Use the list marker width to determine whether a list item is continued

Merge branch 'master' into patch-4

21cdd8a

Add more tests for CommonMark compliance

f594d4c

Trim whitespace

d2a73f9

aidantwoods merged commit 2f291e0 into erusev:master Mar 27, 2018

This was referenced Mar 27, 2018

List indentation problem #581

Closed

Incorrect behaviour for mixed list types? #437

Closed

aidantwoods added this to the 1.8.0 milestone Mar 28, 2018

aidantwoods deleted the patch-4 branch March 28, 2018 11:42

aidantwoods mentioned this pull request Apr 5, 2018

1.8.0 release notes [draft] #601

Closed

BenjaminHoegh mentioned this pull request Nov 10, 2024

Merge 1.8.x-beta into master #881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CommonMark mixed-marker list compliance #439

Improve CommonMark mixed-marker list compliance #439

aidantwoods commented Oct 11, 2016

aidantwoods Oct 11, 2016 •

edited

Loading

PhrozenByte commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

PhrozenByte commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016 •

edited

Loading

PhrozenByte commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 11, 2016

PhrozenByte commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 12, 2016 •

edited

Loading

aidantwoods commented Oct 13, 2016

aidantwoods commented Oct 13, 2016

aidantwoods commented Oct 13, 2016 •

edited

Loading

PhrozenByte commented Oct 13, 2016

Improve CommonMark mixed-marker list compliance #439

Improve CommonMark mixed-marker list compliance #439

Conversation

aidantwoods commented Oct 11, 2016

aidantwoods Oct 11, 2016 • edited Loading

Choose a reason for hiding this comment

PhrozenByte commented Oct 11, 2016 • edited Loading

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

PhrozenByte commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016 • edited Loading

PhrozenByte commented Oct 11, 2016

aidantwoods commented Oct 11, 2016

aidantwoods commented Oct 11, 2016 • edited Loading

aidantwoods commented Oct 11, 2016

PhrozenByte commented Oct 11, 2016 • edited Loading

aidantwoods commented Oct 12, 2016 • edited Loading

aidantwoods commented Oct 13, 2016

aidantwoods commented Oct 13, 2016

aidantwoods commented Oct 13, 2016 • edited Loading

PhrozenByte commented Oct 13, 2016

aidantwoods Oct 11, 2016 •

edited

Loading

PhrozenByte commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 11, 2016 •

edited

Loading

PhrozenByte commented Oct 11, 2016 •

edited

Loading

aidantwoods commented Oct 12, 2016 •

edited

Loading

aidantwoods commented Oct 13, 2016 •

edited

Loading