-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix unescape entities #119
Conversation
Character references are defined here: https://www.w3.org/TR/xml/#dt-charref CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' This test includes the use of the snake emoji U+1F40D, which is encoded as two characters in UTF-16 (used by JavaScript).
This allows unescapeXML to correctly parse strings like &xmppjs#64; (@) and complex sequences like 🐍 (U+1F40D, Snake).
The normal numeric entity escaping now takes care of these characters.
The choice to strip illegal characters might not be the best choice; it might be better to throw an error; I'll gladly join in on the discussion. I can't find any processing instructions in the XML spec on what to do when it encounters invalid character references, but I know that XML 1.1 (not so widely used) expanded on the list. The test for Causing a parsing error on invalid entities seems like the norm, though. |
Throwing an error on illegal characters sounds saner to me too. Could you update your PR? Regarding XML 1.1, ltx doesn't support it, I'm not entirely against the idea but i'd like to move that to a separate conversation if it's of interest to you. Thanks! |
When a character reference is outside the bounds defined by XML v1.0, throw an error, seeing as the characters are not legal, and the document is invalid. Not doing this would probably open up attack vectors allowing the entry of binary data and control sequences into data structures.
ff29990
to
8027dbe
Compare
Done. Throwing an error feels a lot safer, since it keeps the data clean. I threw a generic Error instance, I didn't know if there's a better type of error to throw. The test was changed so that it expects this error on e.g. a
|
@mogsie released as 2.7.2 thanks a lot and sorry it took so long |
Hi! Thanks for this fast XML parser!
I hit this bug and thought I'd swap out with sax-js but a microbenchmark made it slow down XML parsing by a factor of 3, so I thought my time was better spent fixing the issue I have.
The problem is exposed in the tests. Essentially, if an XML file contains
{
or𒎫
it should treat these as character references meaning that it should take it as character 123 or the character parseInt('ABC12', 16) respectively.I hope I followed the coding standards; yarn test warned me about changing == to === so I did that :-)
Even though the change is simple, I split it into three commits, 40ab541 is the real "work" of the PR.