Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRLF support #18

Merged
merged 7 commits into from
Jun 3, 2015
Merged

CRLF support #18

merged 7 commits into from
Jun 3, 2015

Conversation

btrask
Copy link
Contributor

@btrask btrask commented Mar 22, 2015

Fixes #14.

Note that this has some failing tests due to #17. However I don't think any new problems were introduced with these changes.

if (b->end_column && parser->curline->ptr[b->end_column-1] == '\n')
b->end_column -= 1;
if (b->end_column && parser->curline->ptr[b->end_column-1] == '\r')
b->end_column -= 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the second if statement only be executed if the first condition is true?

@btrask
Copy link
Contributor Author

btrask commented Mar 24, 2015

The checks are carefully set up to handle CR, LF, and CRLF at the same time. Basically you check for CR first, then increment, then check for LF. So if it's CRLF you get both. In some cases it's scanning backwards so it checks LF first.

CR-only line endings are used on some systems (not as common any more, according to Wikipedia). I didn't see any harm in supporting them. (LFCR is unsupported since I've never seen it used.)

I think you're right about it not handling buffer boundaries though.

@nwellnhof
Copy link
Contributor

OK, so the intention is to support single CRs as well. I don't have a problem with that.

@jgm I think the spec could also be improved in that regard. It currently says:

A line ending is, depending on the platform, a newline (U+000A), carriage return (U+000D), or carriage return + newline.

I don't like the "depending on the platform" part. This basically means that CommonMark is interpreted differently depending on the platform. Essentially, this makes it impossible to reliably transfer files between platforms. Also, what is the platform in the context of a web application? Can't we just drop that part and have a single definition for newlines across all platforms?

@jgm
Copy link
Member

jgm commented Mar 27, 2015

+++ Nick Wellnhofer [Mar 24 15 08:15 ]:

[1]@jgm I think the spec could also be improved in that regard. It
currently says:

A line ending is, depending on the platform, a newline (U+000A),
carriage return (U+000D), or carriage return + newline.

I don't like the "depending on the platform" part. This basically means
that CommonMark is interpreted differently depending on the platform.
Essentially, this makes it impossible to reliably transfer files
between platforms. Also, what is the platform in the context of a web
application? Can't we just drop that part and have a single definition
for newlines across all platforms?

I'm sympathetic with this. To explain what my thinking was: the parser operates on a sequence of characters separated into lines. How these characters and line breaks are encoded in bytes is a concrete implementation detail that the parser does not specify. So, you might have latin1 encoding and LFs as line breaks, or UTF8 encoding and CRLF line breaks.

But there are really three levels of abstraction we could aim for:

  1. Define the spec over a sequence of bytes, specifying everything, including what counts as a line break and how characters are encoded.
  2. Define the spec over a sequence of unicode code points, specifying what counts as a line break in terms of these, but not specifying how characters are encoded as bytes.
  3. Define the spec over a sequence of lines, each of which is a sequence of unicode code points, not specifying how characters are encoded as bytes or how a stream of characters is divided into lines.

I think what I had in mind was roughly 3, but the spec didn't make this too clear. We could switch to 2 or even 1. This may need further discussion on talk.commonmark.org.

Further question: if we said explicitly that any of these combinations counted as a line ending, would we allow these styles to be mixed in the same document?

@nwellnhof
Copy link
Contributor

@jgm My proposal is to aim for the second approach.

I think the first approach doesn't make much sense. It limits the CommonMark format to a single encoding (UTF-8).

And yes, I would make any of CR (not followed by LF), LF (not preceded by CR) and CR+LF count as line ending, also allowing different line endings in the same document. This seems like the only workable approach to cross-platform line endings. It should match the behavior this pull request implements. Not sure about the JS implementation, though.

@btrask
Copy link
Contributor Author

btrask commented Mar 28, 2015

I found some bugs with underlined headers. I think what's happening is that \rs and \ns are getting counted as two separate line breaks, and then getting merged. Fixing this might take some more effort because cmark generally assumes that line breaks are a single byte.

I looked into getting the tests running with different line endings but I haven't figured it out yet. Without tests it's hard to tell how much is broken.

@jgm
Copy link
Member

jgm commented Mar 28, 2015 via email

@btrask
Copy link
Contributor Author

btrask commented Mar 28, 2015

I'll look into it, thanks.

@btrask
Copy link
Contributor Author

btrask commented Apr 7, 2015

Fixed three bugs, now all the tests pass!

The tests themselves are just hacked up, right now they are CRLF only. I also tested with CR and that works too.

I'm not sure why the Travis build is failing...

@btrask
Copy link
Contributor Author

btrask commented Apr 7, 2015

Fixed :)

@btrask
Copy link
Contributor Author

btrask commented Apr 21, 2015

@jgm have you had a chance to look at this yet?

I can try to work on integrating the tests better if that's what you're waiting for.

@jgm
Copy link
Member

jgm commented Apr 21, 2015

+++ Ben Trask [Apr 21 15 14:59 ]:

@jgm have you had a chance to look at this yet?

No, I've just been busy. I will get to it, though.

@jgm
Copy link
Member

jgm commented May 26, 2015

@btrask I've had a look and added some line comments. Aside from those things, it looks good. Thanks for doing this.

@nwellnhof I also checked commonmark.js, and it seems to behave the same way as cmark does with this change. So, we just need to revise the spec. Can you add an issue on that to jgm/CommonMark?

@btrask
Copy link
Contributor Author

btrask commented May 26, 2015

Thanks for getting back to this!

Regarding the calls to cmark_strbuf_truncate, it's a plain function call and the data is already going to be cached. I think the overhead is probably trivial and writing it more straightforwardly reduces the risk of bugs. But I'll try to come up with an alternate version that you might like better.

Regarding the -= 1 style, I wasn't sure. I'll change it back.

You're right that the tests currently only check CRLF input. That is a hack that I mentioned above but I should've put a comment in the code too. I honestly wasn't sure the best way to restructure the tests to check all three modes.

@jgm
Copy link
Member

jgm commented May 27, 2015

+++ Ben Trask [May 26 15 14:55 ]:

Thanks for getting back to this!

Regarding the calls to cmark_strbuf_truncate, it's a plain function call and the data is already going to be cached. I think the overhead is probably trivial and writing it more straightforwardly reduces the risk of bugs. But I'll try to come up with an alternate version that you might like better.

We're really going for maximum performance, so avoiding even one
unnecessary function call that gets called every line might help.

Regarding the -= 1 style, I wasn't sure. I'll change it back.

It's not that important. I only commented because your reason for
changing it was to match existing style.

You're right that the tests currently only check CRLF input. That is a hack that I mentioned above but I should've put a comment in the code too. I honestly wasn't sure the best way to restructure the tests to check all three modes.

You can let me take care of that. I was thinking we'd run all
the tests with all three line ending patterns. But now I'm thinking
that will take too long, and that it isn't necessary. Instead, I
may add a separate test for line ending handling.

@btrask
Copy link
Contributor Author

btrask commented May 28, 2015

Here is an alternate version.

int trim = 0;
bool cr = false;
bool lf = false;

[...]

// Add a newline to the end if not present:
// TODO this breaks abstraction:
if (parser->curline->size > trim &&
    parser->curline->ptr[parser->curline->size - 1 - trim] == '\n') {
    trim += 1;
    lf = true;
}
if (parser->curline->size > trim &&
    parser->curline->ptr[parser->curline->size - 1 - trim] == '\r') {
    trim += 1;
    cr = true;
}
if (cr) {
    cmark_strbuf_truncate(parser->curline, parser->curline->size - trim);
}
if (cr || !lf) {
    cmark_strbuf_putc(parser->curline, '\n');
}

For the average case with LF input, it saves two function calls per line. For CRLF input, it saves one function call per line. I haven't done any benchmarks but I strongly suspect any performance difference will be drowned by noise (maybe one microsecond for a large document).

I'll leave the tests to you. Thanks.

@jgm
Copy link
Member

jgm commented Jun 3, 2015

Note to self: I have merged this and pushed a branch line-endings.
I still need to look carefully at the changes and write proper tests before merging into master.

jgm added a commit that referenced this pull request Jun 3, 2015
From btrask's alternate code in the comment on
#18.

Note:  this gives a 1-2% performance boot in our benchmark,
probably enough to make it worth while.
@jgm jgm merged commit 60d8ded into commonmark:master Jun 3, 2015
@btrask
Copy link
Contributor Author

btrask commented Jun 5, 2015

Great work, thank you!

mbernson pushed a commit to mbernson/cmark that referenced this pull request Jul 6, 2020
talum pushed a commit to github/cmark-gfm that referenced this pull request Sep 14, 2021
From btrask's alternate code in the comment on
commonmark#18.

Note:  this gives a 1-2% performance boot in our benchmark,
probably enough to make it worth while.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRLF support
3 participants