-
-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRLF support #18
CRLF support #18
Conversation
if (b->end_column && parser->curline->ptr[b->end_column-1] == '\n') | ||
b->end_column -= 1; | ||
if (b->end_column && parser->curline->ptr[b->end_column-1] == '\r') | ||
b->end_column -= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the second if
statement only be executed if the first condition is true?
The checks are carefully set up to handle CR, LF, and CRLF at the same time. Basically you check for CR first, then increment, then check for LF. So if it's CRLF you get both. In some cases it's scanning backwards so it checks LF first. CR-only line endings are used on some systems (not as common any more, according to Wikipedia). I didn't see any harm in supporting them. (LFCR is unsupported since I've never seen it used.) I think you're right about it not handling buffer boundaries though. |
OK, so the intention is to support single CRs as well. I don't have a problem with that. @jgm I think the spec could also be improved in that regard. It currently says:
I don't like the "depending on the platform" part. This basically means that CommonMark is interpreted differently depending on the platform. Essentially, this makes it impossible to reliably transfer files between platforms. Also, what is the platform in the context of a web application? Can't we just drop that part and have a single definition for newlines across all platforms? |
+++ Nick Wellnhofer [Mar 24 15 08:15 ]:
I'm sympathetic with this. To explain what my thinking was: the parser operates on a sequence of characters separated into lines. How these characters and line breaks are encoded in bytes is a concrete implementation detail that the parser does not specify. So, you might have latin1 encoding and LFs as line breaks, or UTF8 encoding and CRLF line breaks. But there are really three levels of abstraction we could aim for:
I think what I had in mind was roughly 3, but the spec didn't make this too clear. We could switch to 2 or even 1. This may need further discussion on talk.commonmark.org. Further question: if we said explicitly that any of these combinations counted as a line ending, would we allow these styles to be mixed in the same document? |
@jgm My proposal is to aim for the second approach. I think the first approach doesn't make much sense. It limits the CommonMark format to a single encoding (UTF-8). And yes, I would make any of CR (not followed by LF), LF (not preceded by CR) and CR+LF count as line ending, also allowing different line endings in the same document. This seems like the only workable approach to cross-platform line endings. It should match the behavior this pull request implements. Not sure about the JS implementation, though. |
I found some bugs with underlined headers. I think what's happening is that I looked into getting the tests running with different line endings but I haven't figured it out yet. Without tests it's hard to tell how much is broken. |
Tests are essential. For something quick and dirty,
can't you just do a regex replace on the markdown content
just before it's passed to the program (line 39 of spec_test.py)?
|
I'll look into it, thanks. |
Fixed three bugs, now all the tests pass! The tests themselves are just hacked up, right now they are CRLF only. I also tested with CR and that works too. I'm not sure why the Travis build is failing... |
Fixed :) |
@jgm have you had a chance to look at this yet? I can try to work on integrating the tests better if that's what you're waiting for. |
+++ Ben Trask [Apr 21 15 14:59 ]:
No, I've just been busy. I will get to it, though. |
@btrask I've had a look and added some line comments. Aside from those things, it looks good. Thanks for doing this. @nwellnhof I also checked commonmark.js, and it seems to behave the same way as cmark does with this change. So, we just need to revise the spec. Can you add an issue on that to jgm/CommonMark? |
Thanks for getting back to this! Regarding the calls to Regarding the You're right that the tests currently only check CRLF input. That is a hack that I mentioned above but I should've put a comment in the code too. I honestly wasn't sure the best way to restructure the tests to check all three modes. |
+++ Ben Trask [May 26 15 14:55 ]:
We're really going for maximum performance, so avoiding even one
It's not that important. I only commented because your reason for
You can let me take care of that. I was thinking we'd run all |
Here is an alternate version. int trim = 0;
bool cr = false;
bool lf = false;
[...]
// Add a newline to the end if not present:
// TODO this breaks abstraction:
if (parser->curline->size > trim &&
parser->curline->ptr[parser->curline->size - 1 - trim] == '\n') {
trim += 1;
lf = true;
}
if (parser->curline->size > trim &&
parser->curline->ptr[parser->curline->size - 1 - trim] == '\r') {
trim += 1;
cr = true;
}
if (cr) {
cmark_strbuf_truncate(parser->curline, parser->curline->size - trim);
}
if (cr || !lf) {
cmark_strbuf_putc(parser->curline, '\n');
} For the average case with LF input, it saves two function calls per line. For CRLF input, it saves one function call per line. I haven't done any benchmarks but I strongly suspect any performance difference will be drowned by noise (maybe one microsecond for a large document). I'll leave the tests to you. Thanks. |
Note to self: I have merged this and pushed a branch |
From btrask's alternate code in the comment on #18. Note: this gives a 1-2% performance boot in our benchmark, probably enough to make it worth while.
Great work, thank you! |
merge from upstream
From btrask's alternate code in the comment on commonmark#18. Note: this gives a 1-2% performance boot in our benchmark, probably enough to make it worth while.
Fixes #14.
Note that this has some failing tests due to #17. However I don't think any new problems were introduced with these changes.