-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding ampersands on HTML entities when parser.decodeEntities = false #249
Comments
Not all parser options can be safely combined with sanitize-html, which has
to set some in a predictable way in order to work.
…On Wed, Aug 29, 2018 at 11:01 AM, WillGibson ***@***.***> wrote:
E.g. Code like...
const text = 'This & that ®';
const sanitizeHtmlOptions = {
parser: {
decodeEntities: false
}
};
demand(sanitizeHtml(text, sanitizeHtmlOptions)).equal(text);
...results in...
AssertionError: "This & that &reg" must equal "This & that ®"
+ expected - actual
-This & that &reg
+This & that ®
I'm guessing that this behaviour is not intended?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#249>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB9fak9B_pK04UZo8A63dmMYjDCazABks5uVqzVgaJpZM4WRvGx>
.
--
*THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT*
P'UNK AVENUE | (215) 755-1330 | punkave.com
|
I would expect it not to do the extra encode on the ampersands. I could have a look at whether I can make a PR that makes it behave like that if you set |
It would have to pass all of the unit tests and introduce no XSS
vulnerabilities.
…On Wed, Aug 29, 2018 at 11:12 AM, WillGibson ***@***.***> wrote:
I would expect it not to do the extra encode on the ampersands.
I could have a look at whether I can make a PR that makes it behave like
that if you set parser.decodeEntities = false in the morning if you think
that would be desirable?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB9fSBgtRiJN3t_P2SjXLwKKIqDcBLAks5uVq9xgaJpZM4WRvGx>
.
--
*THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT*
P'UNK AVENUE | (215) 755-1330 | punkave.com
|
On it |
I think this change caused some regression.
produces |
@WillGibson did you miss a |
Or is it the presence of the |
I'm on holiday now with no laptop. If no-one has addressed it before I get back I'll add this example to the tests and code away until they all go green again. |
The regex will be picking up |
A fix for the regression with the module's default parser settings is under
internal code review. I also added a commented-out test demonstrating
Timo's issue but bear in mind, it must be a rigorous validation of actually
valid entities only to really meet the requirement. So for now
decodeEntities: false is not recommended. (It has never been the default or
a suggested configuration though.)
…On Fri, Sep 28, 2018 at 8:04 AM WillGibson ***@***.***> wrote:
The regex will be picking up &0; as if it's an ampersand that's already
part of an HTML entity and therefore not encoding it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB9fdnpTPfuEW7inUU3w8PgoWBvrZ1Eks5ufhBTgaJpZM4WRvGx>
.
--
*Thomas Boutell, Chief Software Architect*
P'unk Avenue | (215) 755-1330 | punkave.com
|
The fix for the default case has been published to npm. decodeEntities: false is still broken as described. |
@boutell Re: "I can see that it may not be as simple as we thought to detect the valid cases for leaving the & alone.". I agree. It's a good challenge though, I'm going to have a go at it! |
Hmmm, encoding ampersands without double encoding them on HTML entities is indeed hard given that the range of strings that can make up a valid HTML entity is so wide. There are too many edge cases. In our use case, we want to strip all the stuff that isn't text, but keep HTML entities encoded. After all this, I'm wondering if I should not have left sanitise-html alone and just used something else to encode the HTML entities after the text comes out clean from sanitise-html. |
Any follow up on this? |
Trying to make a regex to pick up Short of maintaining a full list of HTML entities, or using another package which already does that, I'm not sure how to deal with it. I'm open to suggestions. |
any update on this?
|
ahan now I see, the concern the library author is having is that if decodeEntities is false < and > are allowed as is which means a vulnerability, isnt there some way to say opt out for some tags but not for the others, & can probably be exempted but <> are encoded |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
E.g. Code like...
...results in...
I'm guessing that this behaviour is not intended?
The text was updated successfully, but these errors were encountered: