CRLF support #18

btrask · 2015-03-22T22:10:42Z

Fixes #14.

Note that this has some failing tests due to #17. However I don't think any new problems were introduced with these changes.

nwellnhof · 2015-03-23T14:18:11Z

src/blocks.c

+		if (b->end_column && parser->curline->ptr[b->end_column-1] == '\n')
+			b->end_column -= 1;
+		if (b->end_column && parser->curline->ptr[b->end_column-1] == '\r')
+			b->end_column -= 1;


Shouldn't the second if statement only be executed if the first condition is true?

btrask · 2015-03-24T14:42:09Z

The checks are carefully set up to handle CR, LF, and CRLF at the same time. Basically you check for CR first, then increment, then check for LF. So if it's CRLF you get both. In some cases it's scanning backwards so it checks LF first.

CR-only line endings are used on some systems (not as common any more, according to Wikipedia). I didn't see any harm in supporting them. (LFCR is unsupported since I've never seen it used.)

I think you're right about it not handling buffer boundaries though.

nwellnhof · 2015-03-24T15:15:21Z

OK, so the intention is to support single CRs as well. I don't have a problem with that.

@jgm I think the spec could also be improved in that regard. It currently says:

A line ending is, depending on the platform, a newline (U+000A), carriage return (U+000D), or carriage return + newline.

I don't like the "depending on the platform" part. This basically means that CommonMark is interpreted differently depending on the platform. Essentially, this makes it impossible to reliably transfer files between platforms. Also, what is the platform in the context of a web application? Can't we just drop that part and have a single definition for newlines across all platforms?

jgm · 2015-03-27T19:19:30Z

+++ Nick Wellnhofer [Mar 24 15 08:15 ]:

[1]@jgm I think the spec could also be improved in that regard. It
currently says:
A line ending is, depending on the platform, a newline (U+000A),
carriage return (U+000D), or carriage return + newline.
I don't like the "depending on the platform" part. This basically means
that CommonMark is interpreted differently depending on the platform.
Essentially, this makes it impossible to reliably transfer files
between platforms. Also, what is the platform in the context of a web
application? Can't we just drop that part and have a single definition
for newlines across all platforms?

I'm sympathetic with this. To explain what my thinking was: the parser operates on a sequence of characters separated into lines. How these characters and line breaks are encoded in bytes is a concrete implementation detail that the parser does not specify. So, you might have latin1 encoding and LFs as line breaks, or UTF8 encoding and CRLF line breaks.

But there are really three levels of abstraction we could aim for:

Define the spec over a sequence of bytes, specifying everything, including what counts as a line break and how characters are encoded.
Define the spec over a sequence of unicode code points, specifying what counts as a line break in terms of these, but not specifying how characters are encoded as bytes.
Define the spec over a sequence of lines, each of which is a sequence of unicode code points, not specifying how characters are encoded as bytes or how a stream of characters is divided into lines.

I think what I had in mind was roughly 3, but the spec didn't make this too clear. We could switch to 2 or even 1. This may need further discussion on talk.commonmark.org.

Further question: if we said explicitly that any of these combinations counted as a line ending, would we allow these styles to be mixed in the same document?

nwellnhof · 2015-03-28T14:19:58Z

@jgm My proposal is to aim for the second approach.

I think the first approach doesn't make much sense. It limits the CommonMark format to a single encoding (UTF-8).

And yes, I would make any of CR (not followed by LF), LF (not preceded by CR) and CR+LF count as line ending, also allowing different line endings in the same document. This seems like the only workable approach to cross-platform line endings. It should match the behavior this pull request implements. Not sure about the JS implementation, though.

btrask · 2015-03-28T17:40:51Z

I found some bugs with underlined headers. I think what's happening is that \rs and \ns are getting counted as two separate line breaks, and then getting merged. Fixing this might take some more effort because cmark generally assumes that line breaks are a single byte.

I looked into getting the tests running with different line endings but I haven't figured it out yet. Without tests it's hard to tell how much is broken.

jgm · 2015-03-28T18:21:10Z

Tests are essential. For something quick and dirty, can't you just do a regex replace on the markdown content just before it's passed to the program (line 39 of spec_test.py)?

btrask · 2015-03-28T21:40:13Z

I'll look into it, thanks.

btrask · 2015-04-07T13:35:35Z

Fixed three bugs, now all the tests pass!

The tests themselves are just hacked up, right now they are CRLF only. I also tested with CR and that works too.

I'm not sure why the Travis build is failing...

btrask · 2015-04-07T13:44:10Z

Fixed :)

btrask · 2015-04-21T21:59:16Z

@jgm have you had a chance to look at this yet?

I can try to work on integrating the tests better if that's what you're waiting for.

jgm · 2015-04-21T22:59:01Z

+++ Ben Trask [Apr 21 15 14:59 ]:

@jgm have you had a chance to look at this yet?

No, I've just been busy. I will get to it, though.

jgm · 2015-05-26T17:53:53Z

@btrask I've had a look and added some line comments. Aside from those things, it looks good. Thanks for doing this.

@nwellnhof I also checked commonmark.js, and it seems to behave the same way as cmark does with this change. So, we just need to revise the spec. Can you add an issue on that to jgm/CommonMark?

btrask · 2015-05-26T21:55:44Z

Thanks for getting back to this!

Regarding the calls to cmark_strbuf_truncate, it's a plain function call and the data is already going to be cached. I think the overhead is probably trivial and writing it more straightforwardly reduces the risk of bugs. But I'll try to come up with an alternate version that you might like better.

Regarding the -= 1 style, I wasn't sure. I'll change it back.

You're right that the tests currently only check CRLF input. That is a hack that I mentioned above but I should've put a comment in the code too. I honestly wasn't sure the best way to restructure the tests to check all three modes.

jgm · 2015-05-27T20:42:52Z

+++ Ben Trask [May 26 15 14:55 ]:

Thanks for getting back to this!

Regarding the calls to cmark_strbuf_truncate, it's a plain function call and the data is already going to be cached. I think the overhead is probably trivial and writing it more straightforwardly reduces the risk of bugs. But I'll try to come up with an alternate version that you might like better.

We're really going for maximum performance, so avoiding even one
unnecessary function call that gets called every line might help.

Regarding the -= 1 style, I wasn't sure. I'll change it back.

It's not that important. I only commented because your reason for
changing it was to match existing style.

You're right that the tests currently only check CRLF input. That is a hack that I mentioned above but I should've put a comment in the code too. I honestly wasn't sure the best way to restructure the tests to check all three modes.

You can let me take care of that. I was thinking we'd run all
the tests with all three line ending patterns. But now I'm thinking
that will take too long, and that it isn't necessary. Instead, I
may add a separate test for line ending handling.

btrask · 2015-05-28T20:18:13Z

Here is an alternate version.

int trim = 0;
bool cr = false;
bool lf = false;

[...]

// Add a newline to the end if not present:
// TODO this breaks abstraction:
if (parser->curline->size > trim &&
    parser->curline->ptr[parser->curline->size - 1 - trim] == '\n') {
    trim += 1;
    lf = true;
}
if (parser->curline->size > trim &&
    parser->curline->ptr[parser->curline->size - 1 - trim] == '\r') {
    trim += 1;
    cr = true;
}
if (cr) {
    cmark_strbuf_truncate(parser->curline, parser->curline->size - trim);
}
if (cr || !lf) {
    cmark_strbuf_putc(parser->curline, '\n');
}

For the average case with LF input, it saves two function calls per line. For CRLF input, it saves one function call per line. I haven't done any benchmarks but I strongly suspect any performance difference will be drowned by noise (maybe one microsecond for a large document).

I'll leave the tests to you. Thanks.

jgm · 2015-06-03T09:50:10Z

Note to self: I have merged this and pushed a branch line-endings.
I still need to look carefully at the changes and write proper tests before merging into master.

From btrask's alternate code in the comment on #18. Note: this gives a 1-2% performance boot in our benchmark, probably enough to make it worth while.

btrask · 2015-06-05T22:18:36Z

Great work, thank you!

merge from upstream

From btrask's alternate code in the comment on commonmark#18. Note: this gives a 1-2% performance boot in our benchmark, probably enough to make it worth while.

nwellnhof reviewed Mar 23, 2015
View reviewed changes

btrask mentioned this pull request Mar 24, 2015

Rebuild intermediate dependencies [broken] #17

Closed

btrask force-pushed the master branch from 1f11808 to c5ea6bd Compare April 7, 2015 13:00

btrask added 6 commits April 7, 2015 09:26

Support for CRLF and CR line endings.

5a3241c

Add CRLF/CR handling to inlines.c.

209ac5d

Fix regression in remove_trailing_blank_lines().

fbb0836

Hacks to spec_tests.py to test CRLF support.

6f99ff7

Bug fixes for CRLF support.

b8aa649

Try to match existing style better.

271d8a1

btrask force-pushed the master branch from c5ea6bd to 271d8a1 Compare April 7, 2015 13:29

Check length before reading.

60d8ded

nwellnhof mentioned this pull request May 28, 2015

Harmonize newline handling across platforms commonmark/commonmark-spec#326

Closed

jgm added a commit that referenced this pull request Jun 3, 2015

Revised "add newline to end if missing" for performance.

65056ea

From btrask's alternate code in the comment on #18. Note: this gives a 1-2% performance boot in our benchmark, probably enough to make it worth while.

jgm merged commit 60d8ded into commonmark:master Jun 3, 2015

mbernson pushed a commit to mbernson/cmark that referenced this pull request Jul 6, 2020

Merge pull request commonmark#18 from compnerd/20200123

caac851

merge from upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRLF support #18

CRLF support #18

btrask commented Mar 22, 2015

nwellnhof Mar 23, 2015

btrask commented Mar 24, 2015

nwellnhof commented Mar 24, 2015

jgm commented Mar 27, 2015

nwellnhof commented Mar 28, 2015

btrask commented Mar 28, 2015

jgm commented Mar 28, 2015 via email

btrask commented Mar 28, 2015

btrask commented Apr 7, 2015

btrask commented Apr 7, 2015

btrask commented Apr 21, 2015

jgm commented Apr 21, 2015

jgm commented May 26, 2015

btrask commented May 26, 2015

jgm commented May 27, 2015

btrask commented May 28, 2015

jgm commented Jun 3, 2015

btrask commented Jun 5, 2015

CRLF support #18

CRLF support #18

Conversation

btrask commented Mar 22, 2015

nwellnhof Mar 23, 2015

Choose a reason for hiding this comment

btrask commented Mar 24, 2015

nwellnhof commented Mar 24, 2015

jgm commented Mar 27, 2015

nwellnhof commented Mar 28, 2015

btrask commented Mar 28, 2015

jgm commented Mar 28, 2015 via email

btrask commented Mar 28, 2015

btrask commented Apr 7, 2015

btrask commented Apr 7, 2015

btrask commented Apr 21, 2015

jgm commented Apr 21, 2015

jgm commented May 26, 2015

btrask commented May 26, 2015

jgm commented May 27, 2015

btrask commented May 28, 2015

jgm commented Jun 3, 2015

btrask commented Jun 5, 2015