Fix header anchor normalization

Do not call tolower on non-ASCII chars because it would otherwise insert invalid UTF-8 bytes into the HTML output. (tolower is not locale-aware) Invalid UTF-8 bytes will cause various errors, e.g. "ArgumentError (invalid byte sequence in UTF-8)", when rendering the generated HTML in Rails. Signed-off-by: Clemens Gruber <clemensgru@gmail.com>
vmg · Nov 20, 2015 · 154c318 · 154c318 · lengerfulluse · Jul 7, 2016
1 parent da6d95b
commit 154c318
Show file tree

Hide file tree

Showing 3 changed files with 14 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Changelog
 
+* Fix the header anchor normalization by skipping non-ASCII chars
+  and not calling tolower because this leads to invalid UTF-8 byte
+  sequences in the HTML output. (tolower is not locale-aware)
+
+  *Clemens Gruber*
+
 ## Version 3.3.3
 
 * Fix a memory leak instantiating a `Redcarpet::Render::Base` object.

diff --git a/ext/redcarpet/html.c b/ext/redcarpet/html.c
@@ -285,7 +285,7 @@ rndr_header_anchor(struct buf *out, const struct buf *anchor)
 			while (i < size && a[i] != '>')
 				i++;
 		}
-		else if (strchr(STRIPPED, a[i])) {
+		else if (!isascii(a[i]) || strchr(STRIPPED, a[i])) {
 			if (inserted && !stripped)
 				bufputc(out, '-');
 			stripped = 1;

diff --git a/test/html_render_test.rb b/test/html_render_test.rb
@@ -238,4 +238,11 @@ def test_no_styles_inside_html_block_rendering
 
     assert_no_match %r{<style>}, output
   end
+
+  def test_non_ascii_removal_in_header_anchors
+    markdown = "# Glühlampe"
+    html = "<h1 id=\"gl-hlampe\">Glühlampe</h1>\n"
+
+    assert_equal html, render(markdown, with: [:with_toc_data])
+  end
 end