-
Notifications
You must be signed in to change notification settings - Fork 525
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Do not call tolower on non-ASCII chars because it would otherwise insert invalid UTF-8 bytes into the HTML output. (tolower is not locale-aware) Invalid UTF-8 bytes will cause various errors, e.g. "ArgumentError (invalid byte sequence in UTF-8)", when rendering the generated HTML in Rails. Signed-off-by: Clemens Gruber <clemensgru@gmail.com>
- Loading branch information
Showing
3 changed files
with
14 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @clemensg , this change will prevent redcarpet rendering correct header id for unicode characters(for example, Chinese, Japanese etc). And will result in generating the incorrect toc.
Since you mentioned in the commit description that non-ascii caused invalid UTF-8 bytes, can we use some more general condition check(such as isutf8() or similar) instead of isascii()? It will help a lot for non-english users.
Thanks in advance!
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lengerfulluse , the problem is that you can't just call
tolower()
on UTF-8 characters. Changing the case of UTF-8 characters is nontrivial, we would need to link with ICU, which is a rather big dependency.But maybe we don't need to convert them to lowercase if
!isascii()
and just keep all non-ASCII characters as they are?(AFAIK it's not a requirement for the header IDs to be lowercase. They need to be unique and without spaces, but the case does not matter)
@robin850 What do you think about skipping
tolower()
for all non-ASCII characters in header anchors to allow UTF-8 ids?This would change the expected result of my test case above to
<h1 id=\"glühlampe\">Glühlampe</h1>\n
.Should I send a PR?
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clemensg i don't think it's the problem of tolower(). the tolower() implementation in ctype.c will not convert them to lowercase if the character is non-ASCII, it just echo back. you can refer the code snap here(line#320)
And the
else if (!isascii(a[i]) || strchr(STRIPPED, a[i])) {
condition check will just stripped non-ascii characters, which causes the empty html header ids. Similar issues are already opened:As you suggested, we can just keep the original non-ascii characters but not stripped. Or we can use url encode the non-ascill characters if the non-ascill could potentially cause other problems.
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lengerfulluse , if I remember correctly there were indeed problems with non-ASCII characters. Ruby threw an
ArgumentError (invalid byte sequence in UTF-8)
exception. I assumed thattolower()
was responsible for creating those invalid byte sequences but that is not possible, as you have shown.What about hashing the strings and using those hashes as header IDs? SHA1?
But URL encoding would be fine as well. I don't have much time at the moment, so please go ahead and create a PR.
If you want me to create the PR, I can look into it next week.
Please let me know how you want to proceed.
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #539, which already does that. Would be nice to get that reviewed and landed...
154c318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reedloden cool! i think that fix works for the problem.
Another thought from the perspective of i18n, since the users of redcarpet are from all over the world with their own locale/LANG setting, we can just keep the same render strategy for both english and non-english users.
So i do think if we already keep the english word, numbers in head id, we should also keep other languages' characters(with valid UTF-8 encoding for example) in the head id too. Or else, we can just use the hash for both of them and prefix with toc_ , which is more efficiency and with little complex logic.