Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antisamy 1.7.5 version - <body> tag issue #453

Closed
jeetu22 opened this issue May 24, 2024 · 34 comments
Closed

Antisamy 1.7.5 version - <body> tag issue #453

jeetu22 opened this issue May 24, 2024 · 34 comments

Comments

@jeetu22
Copy link

jeetu22 commented May 24, 2024

The Antisamy library versions above 1.7.2 require a <body> tag in the HTML page; otherwise, it causes the HTML to break. Here's an example of the input HTML:

<html lang="en">
<head>
</head> 
<table class="container" cellpadding="0" cellspacing="0" align="center">    
<SELECT NAME="Lang">
<OPTION VALUE="da">Dansk</OPTION>
<OPTION VALUE="en" selected=selected>English</OPTION>
</SELECT>
</table>
</html>

The output produced is:

<html lang="en">
<head>
    <table class="container" cellpadding="0" cellspacing="0" align="center"></table>
    **<select name="Lang"></select>**
    <option value="da">Dansk</option>
    <option value="en" selected="selected">English</option>
</head>
</html>

As you can see, the <select> tag closes on the same line, causing the dropdown to malfunction and breaking the HTML page. This issue does not occur in Antisamy version 1.7.2 and earlier but appears in versions after 1.7.2. We are upgrading Antisamy in our project to version 1.7.5, but this issue is causing the complete HTML page to become distorted.

@jeetu22 jeetu22 changed the title Antisamy 1.7.5 version - <body> tag issue Antisamy 1.7.5 version - issue May 24, 2024
@rbri
Copy link
Contributor

rbri commented May 24, 2024

Will have a look...

@jeetu22 jeetu22 changed the title Antisamy 1.7.5 version - issue Antisamy 1.7.5 version - <body> tag issue May 24, 2024
rbri added a commit to HtmlUnit/htmlunit-neko that referenced this issue May 24, 2024
rbri added a commit to HtmlUnit/htmlunit-neko that referenced this issue May 24, 2024
rbri added a commit to HtmlUnit/htmlunit that referenced this issue May 24, 2024
@rbri
Copy link
Contributor

rbri commented May 24, 2024

Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?

@jeetu22
Copy link
Author

jeetu22 commented May 24, 2024

Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?

can antisamy version 1.7.5 adds <body > tag if its missed or not added ?

@rbri
Copy link
Contributor

rbri commented May 24, 2024

@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).

And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.

@jeetu22
Copy link
Author

jeetu22 commented May 24, 2024

@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).

And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.

Thank you very much!!. as antisamy uses neko internally , anyone from Antisamy who can guide us in this scenario.i m suspecting HTMLScanner.java is modifying DOM

@rbri
Copy link
Contributor

rbri commented May 24, 2024

Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.

@jeetu22
Copy link
Author

jeetu22 commented May 24, 2024

Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.

@rbri

public void selectInsideEmptyTable() throws Exception {
       final String html = "<html><head></head><body>\n"
               + "<table><select name='Lang'><option value='da'>Dansk</option></select></table>\n"
               + "<script>\n"
               + LOG_TITLE_FUNCTION
               + "log(document.body.childNodes.length);\n"
               + "log(document.body.children.length);\n"
               + "log(document.body.children[0]);\n"
               + "log(document.body.children[1]);\n"
               + "log(document.body.children[2]);\n"
               + "</script>\n"
               + "</body></html>";

       expandExpectedAlertsVariables(URL_FIRST);

       loadPageVerifyTitle2(html);
   }

can you please remove body tag from this Junit test case and assert that output HTML should contains <body> tag.

@davewichers
Copy link
Collaborator

@spassarop - Can you look into this with @rbri?

@rbri
Copy link
Contributor

rbri commented May 26, 2024

i guess i found the reason - will analyze this a bit more

@rbri
Copy link
Contributor

rbri commented May 26, 2024

Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem.
Will require some time to fix that.

@spassarop
Copy link
Collaborator

Thanks @rbri for being so proactive with this.

@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.

@jeetu22
Copy link
Author

jeetu22 commented May 27, 2024

Thanks @rbri for being so proactive with this.

@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.

we are using SAX parser.
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@rbri
Copy link
Contributor

rbri commented May 27, 2024

parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.

As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)

@jeetu22
Copy link
Author

jeetu22 commented May 27, 2024

parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.

As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)

Thank you for the update! I appreciate you looking into the issue.Given your busy schedule, I completely understand that a fix might take some time.

Please take the time you need, and I look forward to your findings.

Thanks again for your efforts!

@spassarop
Copy link
Collaborator

spassarop commented May 27, 2024 via email

rbri added a commit to rbri/antisamy that referenced this issue May 27, 2024
use 4.2.0-SNAPSHOT
@rbri
Copy link
Contributor

rbri commented May 27, 2024

After some thinking

  • i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages
  • i have improved the fragment parser in a way that the parser now takes care of an existing html tag - if this was passed before the automatic generation of head and body tags is enabled also in the fragment mode
  • have added the issue as test case

@kwwall
Copy link
Contributor

kwwall commented May 27, 2024

@rbri wrote:

  • i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages

That is exactly why! In fact, I think that is the most common use case for HTML sanitizers in general. There's generally some user input that you might capture that only allows some specific mark-up (and which mark up may be vary from one use to another) and you want to sanitize that to make it safe to use it in a broader context of an application generated page. I think it's rare that AntiSamy or the OWASP HTML Sanitizer project would get a complete HTML page to sanitize. That's certainly a valid use case too, but just not one that is as common. If AntiSamy ditched the fragment parser, then I think that ESAPI would have to ditch AntiSamy because dealing with HTML fragments is what Validator.getValidSafeHTML is generally expecting.

@spassarop
Copy link
Collaborator

spassarop commented May 27, 2024 via email

@rbri
Copy link
Contributor

rbri commented Jun 1, 2024

@davewichers @spassarop fix is ready in PR #454

@davewichers
Copy link
Collaborator

@spassarop - can you create test case for this situation that fails, and then verify that it now passes with his snapshot version?

@rbri
Copy link
Contributor

rbri commented Jun 1, 2024

My PR includes such an test case...

@spassarop
Copy link
Collaborator

Hahaha yeah, our man here is one step ahead ;)

@rbri
Copy link
Contributor

rbri commented Jun 4, 2024

neko 4.2.0 released

@jeetu22
Copy link
Author

jeetu22 commented Jun 5, 2024

Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem. Will require some time to fix that.

@rbri , can you confirm if the Neko 4.2.0 release resolves the above issue?

@rbri
Copy link
Contributor

rbri commented Jun 5, 2024

@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)

@jeetu22
Copy link
Author

jeetu22 commented Jun 5, 2024

@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)

i tried with Antisamy:1.7.5 and neko-4.2.0 many testcases are failing in AntisamyTest.java
one such example:
image

@rbri
Copy link
Contributor

rbri commented Jun 5, 2024

@jeetu22 strange - have done this right now

  • checkout the current code of the antisamy project
  • change the version of neko in the pom
  • run 'mvn clean test'

image

@rbri
Copy link
Contributor

rbri commented Jun 5, 2024

for me this looks like you still have an old version of neko somewhere in your class path... can you please provide the whole stack trace...

davewichers pushed a commit that referenced this issue Jun 6, 2024
* add test for issue #453
use 4.2.0-SNAPSHOT

* code style

* add neko-htmlunit snapshot repo

* use neko 4.2.0 release

* neko-htmlunit version 4.2.1

* remove property
@jeetu22
Copy link
Author

jeetu22 commented Jun 13, 2024

checking , will update you @rbri .

@jeetu22
Copy link
Author

jeetu22 commented Jul 3, 2024

Hi @rbri,

We've thoroughly tested Antisamy 1.7.6-SNAPSHOT and found that both the workflow and UI are working fine. It would be great if you could provide a tentative release date for the non-snapshot version.

Thank you!

@davewichers
Copy link
Collaborator

@rbri - we are working on that right now. We are trying to figure out how to address issue #456, which is turning out a bit more complicated than we thought. As soon as we get this addressed, we'll do a release. Hopefully in the next week or so.

@rbri
Copy link
Contributor

rbri commented Jul 3, 2024

In between neko 4.3.0 is out (https://github.com/HtmlUnit/htmlunit-neko/releases) - you should be able to safely switch to this one.

@rbri
Copy link
Contributor

rbri commented Jul 3, 2024

oh you are already at 4.3.0...

@davewichers
Copy link
Collaborator

This was fixed in release 1.7.6 which went out yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants