Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix unescape entities #119

Closed
wants to merge 4 commits into from

Conversation

mogsie
Copy link
Contributor

@mogsie mogsie commented Jun 15, 2018

Hi! Thanks for this fast XML parser!

I hit this bug and thought I'd swap out with sax-js but a microbenchmark made it slow down XML parsing by a factor of 3, so I thought my time was better spent fixing the issue I have.

The problem is exposed in the tests. Essentially, if an XML file contains { or 𒎫 it should treat these as character references meaning that it should take it as character 123 or the character parseInt('ABC12', 16) respectively.
I hope I followed the coding standards; yarn test warned me about changing == to === so I did that :-)

Even though the change is simple, I split it into three commits, 40ab541 is the real "work" of the PR.

mogsie added 3 commits June 15, 2018 15:09
Character references are defined here:
  https://www.w3.org/TR/xml/#dt-charref

    CharRef   ::=     '&#' [0-9]+ ';'
                      | '&#x' [0-9a-fA-F]+ ';'

This test includes the use of the snake emoji U+1F40D, which is
encoded as two characters in UTF-16 (used by JavaScript).
This allows unescapeXML to correctly parse strings like &xmppjs#64; (@) and
complex sequences like 🐍 (U+1F40D, Snake).
The normal numeric entity escaping now takes care of these characters.
@mogsie
Copy link
Contributor Author

mogsie commented Jun 15, 2018

The choice to strip illegal characters might not be the best choice; it might be better to throw an error; I'll gladly join in on the discussion. I can't find any processing instructions in the XML spec on what to do when it encounters invalid character references, but I know that XML 1.1 (not so widely used) expanded on the list. The test for � is invalid in both XML 1.0 and XML 1.1 though.

Causing a parsing error on invalid entities seems like the norm, though.

@sonnyp
Copy link
Member

sonnyp commented Sep 13, 2018

Throwing an error on illegal characters sounds saner to me too. Could you update your PR?

Regarding XML 1.1, ltx doesn't support it, I'm not entirely against the idea but i'd like to move that to a separate conversation if it's of interest to you.

Thanks!

When a character reference is outside the bounds defined by XML v1.0,
throw an error, seeing as the characters are not legal, and the
document is invalid.

Not doing this would probably open up attack vectors allowing the
entry of binary data and control sequences into data structures.
@mogsie mogsie force-pushed the bugfix-unescape-entities branch from ff29990 to 8027dbe Compare September 13, 2018 12:57
@mogsie
Copy link
Contributor Author

mogsie commented Sep 14, 2018

Done. Throwing an error feels a lot safer, since it keeps the data clean. I threw a generic Error instance, I didn't know if there's a better type of error to throw. The test was changed so that it expects this error on e.g. a � character entity. This is in line with other tools like xmllint:

$ echo '<a>&#0;</a>' | xmllint --format -
-:1: parser error : xmlParseCharRef: invalid xmlChar value 0
<a>&#0;</a>
       ^

@sonnyp sonnyp mentioned this pull request Sep 15, 2018
@sonnyp
Copy link
Member

sonnyp commented Sep 15, 2018

@mogsie thanks a lot, rebased here #123

@sonnyp sonnyp closed this Sep 15, 2018
@mogsie mogsie deleted the bugfix-unescape-entities branch September 17, 2018 07:54
@sonnyp
Copy link
Member

sonnyp commented Sep 22, 2018

@mogsie released as 2.7.2 thanks a lot and sorry it took so long

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants