Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser.xmlParser() no longer corrects tags that may not be self-closing #2040

Closed
hnljp opened this issue Nov 7, 2023 · 2 comments
Closed

Comments

@hnljp
Copy link

hnljp commented Nov 7, 2023

With 1.16.1 and earlier the following was true

        String testIframe = "<iframe src=\"https://example.com\"/>";
        String expectedResult = "<iframe src=\"https://example.com\"></iframe>";

        Document document = Jsoup.parse(testIframe, "", Parser.xmlParser());
        document.outputSettings().indentAmount(0).prettyPrint(false);
        String result = document.html();

        assertEquals(expectedResult, result);

With 1.16.2 "<iframe src="https://example.com\"/>" remains "<iframe src="https://example.com\"/>".
Is it intentional that tags that are not allowed to be self-closing now only get fixed when using the htmlParser?

@jhy
Copy link
Owner

jhy commented Nov 10, 2023

Hi there,

The intention of the XML parser is to be a generic parser, and not follow the specific rules of the HTML parser.

Your snippet worked originally by a fluke of the implementation - the Tag object which holds these formatting rules was intended for the HTML parser and not the XML parser. In HTML we know what that an iframe element is and how it should behave. But in the XML parser we shouldn't assume that.

I didn't really intend for the XML parser to be used as a kind of semi-html parser, using some of the rules from HTML and ignoring others.

So this changed in #2008 when I added basic namespace support, to enable Math and SVG tags. Now the Tag set is namespaced, and in the XML parser, the HTML namespace is not set and therefore there is no matching iframe tag to return, and so there's no setting to disable self-closing.

A couple of ways I could think of changing this:

  1. implement a namespace stack in the XML parser (there is an impl in the W3CDom which we might shift out to XML parser), and detect the namespace from an xmlns attribute, and lookup tags by that
  2. add a default namespace parameter to the XML parser

Questions for you:

  1. Can you tell me more about your usecase, and why you're not using the HTML parser?
  2. Can you provide me a full sample of the HTML (or the source URL) so I can review implementation options

@hnljp
Copy link
Author

hnljp commented Nov 13, 2023

There is no problem in using the HTML parser.
I just wanted to make sure that it was not an a mistake.

Thank you very much for your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants