Bugfix unescape entities #119

mogsie · 2018-06-15T13:16:47Z

Hi! Thanks for this fast XML parser!

I hit this bug and thought I'd swap out with sax-js but a microbenchmark made it slow down XML parsing by a factor of 3, so I thought my time was better spent fixing the issue I have.

The problem is exposed in the tests. Essentially, if an XML file contains { or 𒎫 it should treat these as character references meaning that it should take it as character 123 or the character parseInt('ABC12', 16) respectively.
I hope I followed the coding standards; yarn test warned me about changing == to === so I did that :-)

Even though the change is simple, I split it into three commits, 40ab541 is the real "work" of the PR.

Character references are defined here: https://www.w3.org/TR/xml/#dt-charref CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' This test includes the use of the snake emoji U+1F40D, which is encoded as two characters in UTF-16 (used by JavaScript).

This allows unescapeXML to correctly parse strings like &xmppjs#64; (@) and complex sequences like 🐍 (U+1F40D, Snake).

The normal numeric entity escaping now takes care of these characters.

mogsie · 2018-06-15T14:28:15Z

The choice to strip illegal characters might not be the best choice; it might be better to throw an error; I'll gladly join in on the discussion. I can't find any processing instructions in the XML spec on what to do when it encounters invalid character references, but I know that XML 1.1 (not so widely used) expanded on the list. The test for  is invalid in both XML 1.0 and XML 1.1 though.

Causing a parsing error on invalid entities seems like the norm, though.

sonnyp · 2018-09-13T07:42:12Z

Throwing an error on illegal characters sounds saner to me too. Could you update your PR?

Regarding XML 1.1, ltx doesn't support it, I'm not entirely against the idea but i'd like to move that to a separate conversation if it's of interest to you.

Thanks!

When a character reference is outside the bounds defined by XML v1.0, throw an error, seeing as the characters are not legal, and the document is invalid. Not doing this would probably open up attack vectors allowing the entry of binary data and control sequences into data structures.

mogsie · 2018-09-14T09:15:37Z

Done. Throwing an error feels a lot safer, since it keeps the data clean. I threw a generic Error instance, I didn't know if there's a better type of error to throw. The test was changed so that it expects this error on e.g. a  character entity. This is in line with other tools like xmllint:

$ echo '<a>&#0;</a>' | xmllint --format -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 0
<a>&#0;</a>
       ^

sonnyp · 2018-09-15T16:19:15Z

@mogsie thanks a lot, rebased here #123

sonnyp · 2018-09-22T11:32:10Z

@mogsie released as 2.7.2 thanks a lot and sorry it took so long

mogsie added 3 commits June 15, 2018 15:09

unescapeXML decodes any numeric entity, not just a whitelisted set

40ab541

This allows unescapeXML to correctly parse strings like &xmppjs#64; (@) and complex sequences like 🐍 (U+1F40D, Snake).

Remove unnecessary whitelisted numeric characters

9cc1867

The normal numeric entity escaping now takes care of these characters.

mogsie force-pushed the bugfix-unescape-entities branch from ff29990 to 8027dbe Compare September 13, 2018 12:57

sonnyp mentioned this pull request Sep 15, 2018

Bugfix unescape entities #123

Merged

sonnyp closed this Sep 15, 2018

mogsie deleted the bugfix-unescape-entities branch September 17, 2018 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix unescape entities #119

Bugfix unescape entities #119

mogsie commented Jun 15, 2018 •

edited

Loading

mogsie commented Jun 15, 2018

sonnyp commented Sep 13, 2018 •

edited

Loading

mogsie commented Sep 14, 2018

sonnyp commented Sep 15, 2018

sonnyp commented Sep 22, 2018

Bugfix unescape entities #119

Bugfix unescape entities #119

Conversation

mogsie commented Jun 15, 2018 • edited Loading

mogsie commented Jun 15, 2018

sonnyp commented Sep 13, 2018 • edited Loading

mogsie commented Sep 14, 2018

sonnyp commented Sep 15, 2018

sonnyp commented Sep 22, 2018

mogsie commented Jun 15, 2018 •

edited

Loading

sonnyp commented Sep 13, 2018 •

edited

Loading