Skip to content

Commit

Permalink
Fix header anchor normalization
Browse files Browse the repository at this point in the history
Do not call tolower on non-ASCII chars because it would otherwise
insert invalid UTF-8 bytes into the HTML output. (tolower is not
locale-aware)

Invalid UTF-8 bytes will cause various errors, e.g. "ArgumentError
(invalid byte sequence in UTF-8)", when rendering the generated HTML in
Rails.

Signed-off-by: Clemens Gruber <clemensgru@gmail.com>
  • Loading branch information
clemensg committed Nov 20, 2015
1 parent da6d95b commit 154c318
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 1 deletion.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

* Fix the header anchor normalization by skipping non-ASCII chars
and not calling tolower because this leads to invalid UTF-8 byte
sequences in the HTML output. (tolower is not locale-aware)

*Clemens Gruber*

## Version 3.3.3

* Fix a memory leak instantiating a `Redcarpet::Render::Base` object.
Expand Down
2 changes: 1 addition & 1 deletion ext/redcarpet/html.c
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ rndr_header_anchor(struct buf *out, const struct buf *anchor)
while (i < size && a[i] != '>')
i++;
}
else if (strchr(STRIPPED, a[i])) {
else if (!isascii(a[i]) || strchr(STRIPPED, a[i])) {
if (inserted && !stripped)
bufputc(out, '-');
stripped = 1;
Expand Down
7 changes: 7 additions & 0 deletions test/html_render_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -238,4 +238,11 @@ def test_no_styles_inside_html_block_rendering

assert_no_match %r{<style>}, output
end

def test_non_ascii_removal_in_header_anchors
markdown = "# Glühlampe"
html = "<h1 id=\"gl-hlampe\">Glühlampe</h1>\n"

assert_equal html, render(markdown, with: [:with_toc_data])
end
end

6 comments on commit 154c318

@lengerfulluse
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @clemensg , this change will prevent redcarpet rendering correct header id for unicode characters(for example, Chinese, Japanese etc). And will result in generating the incorrect toc.
Since you mentioned in the commit description that non-ascii caused invalid UTF-8 bytes, can we use some more general condition check(such as isutf8() or similar) instead of isascii()? It will help a lot for non-english users.
Thanks in advance!

@clemensg
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lengerfulluse , the problem is that you can't just call tolower() on UTF-8 characters. Changing the case of UTF-8 characters is nontrivial, we would need to link with ICU, which is a rather big dependency.
But maybe we don't need to convert them to lowercase if !isascii() and just keep all non-ASCII characters as they are?
(AFAIK it's not a requirement for the header IDs to be lowercase. They need to be unique and without spaces, but the case does not matter)

@robin850 What do you think about skipping tolower() for all non-ASCII characters in header anchors to allow UTF-8 ids?

This would change the expected result of my test case above to <h1 id=\"glühlampe\">Glühlampe</h1>\n.

Should I send a PR?

@lengerfulluse
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clemensg i don't think it's the problem of tolower(). the tolower() implementation in ctype.c will not convert them to lowercase if the character is non-ASCII, it just echo back. you can refer the code snap here(line#320)
And the else if (!isascii(a[i]) || strchr(STRIPPED, a[i])) { condition check will just stripped non-ascii characters, which causes the empty html header ids. Similar issues are already opened:

As you suggested, we can just keep the original non-ascii characters but not stripped. Or we can use url encode the non-ascill characters if the non-ascill could potentially cause other problems.

@clemensg
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lengerfulluse , if I remember correctly there were indeed problems with non-ASCII characters. Ruby threw an ArgumentError (invalid byte sequence in UTF-8) exception. I assumed that tolower() was responsible for creating those invalid byte sequences but that is not possible, as you have shown.

What about hashing the strings and using those hashes as header IDs? SHA1?
But URL encoding would be fine as well. I don't have much time at the moment, so please go ahead and create a PR.
If you want me to create the PR, I can look into it next week.

Please let me know how you want to proceed.

@reedloden
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #539, which already does that. Would be nice to get that reviewed and landed...

@lengerfulluse
Copy link

@lengerfulluse lengerfulluse commented on 154c318 Jul 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reedloden cool! i think that fix works for the problem.

Another thought from the perspective of i18n, since the users of redcarpet are from all over the world with their own locale/LANG setting, we can just keep the same render strategy for both english and non-english users.

So i do think if we already keep the english word, numbers in head id, we should also keep other languages' characters(with valid UTF-8 encoding for example) in the head id too. Or else, we can just use the hash for both of them and prefix with toc_ , which is more efficiency and with little complex logic.

Please sign in to comment.