-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put W3C DOM elements in HTML namespace by default for jhy/jsoup#1837. #1848
Conversation
Element el = namespace == null && tagName.contains(":") ? | ||
doc.createElementNS("", tagName) : // doesn't have a real namespace defined | ||
doc.createElementNS(namespace, tagName); | ||
// use an empty namespace if none is present but the tag name has a prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change any functionality here. The modification makes it clear that the only thing happening is that you're "imputing" a namespace of the empty string under certain conditions — you're not calling a separate method. The original code duplicated the call to doc.createElementNS()
, which primarily was confusing and obscured the purpose of the logic.
I don't know why some of the integration tests are failing on macOS, but from the output it doesn't appear to be related to any changes here. My changes shouldn't have anything to do with integration-level issues, and the unit tests are all working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Please see a couple notes inline and let me know if you have any thoughts.
LMK if you want to proceed with the changes, otherwise I am happy to apply them also.
@@ -348,6 +387,7 @@ protected static class W3CBuilder implements NodeVisitor { | |||
public W3CBuilder(Document doc) { | |||
this.doc = doc; | |||
namespacesStack.push(new HashMap<>()); | |||
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we should only be setting this default namespace if the input document is HTML. If it's XML, then the HTML -> XML compat note doesn't apply, and we shouldn't be applying it.
The impl could be something like
if (inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder)
Would need to move this to the appropriate convert()
method. Or, add that flag as user data to the W3C Document.
I'm not sure if checking the treebuilder is the best way to see if it's HTML vs XML. The output syntax is how we normally check that, but that doesn't tell us particularly if the input was HTML, and people may be setting the syntax to XML before calling these methods. So am leaning towards checking treebuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we should only be setting this default namespace if the input document is HTML.
That makes sense. I don't know enough about your library's API to know when it thinks it's parsing XML. I had assumed it only parsed things considered to be non-XML HTML. (Otherwise I would think you could just use an off-the-shelf XML parser because the input would not be "soup".)
So am leaning towards checking treebuilder.
You know best here. All I care about is when I parse an HTML document, the HTML namespace gets imputed as per the space. If the tree builder is always set to HtmlTreeBuilder
in those cases, that works. Whatever you want to do.
Would need to move this to the appropriate
convert()
method. Or, add that flag as user data to the W3C Document.
That sounds like a lot of shuffling. I note that the W3CBuilder
does seem to get passed the org.jsoup.nodes.Element
being converted via the W3C document user data context property. In normal cases that element would know its owner org.jsoup.nodes.Document
would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)
So what about this inside the W3CBuilder
constructor (where I made the change already)?
final org.jsoup.nodes.Document inDoc = contextElement.ownerDocument();
if (inDoc == null || inDoc.parser().getTreeBuilder() instanceof HtmlTreeBuilder) {
namespacesStack.peek().put("", "http://www.w3.org/1999/xhtml"); // TODO document
}
(I see that I had forgotten to document the actual namespace imputation line. I'll add that once I get an OK from you on how to proceed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhy could you respond here and let me know which way to go, so that I can finish this PR and get it accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a lot of shuffling. I note that the
W3CBuilder
does seem to get passed theorg.jsoup.nodes.Element
being converted via the W3C document user data context property. In normal cases that element would know its ownerorg.jsoup.nodes.Document
would it not? (If not, can we assume this is HTML input? I would imagine it to be a rare case anyway for the jsoup element not to have an owner document.)
OK, using the contextElement is a good idea. And yes let's go with checking if the builder was HtmlTreeBuilder. That will work for all cases in the core library (vs anyone's own extensions, which is kind of up them).
I would not assume that an Element without an ownerDocument is HTML though, so we should not add the namespace if it is null. There will be no ownerDoc for Elements constructed via new
and not attached to another doc, which is a supported use case. And in that case we don't know either way.
I plan on making this namespace inference a defaulted-on option in the W3C DOM. This will allow existing uses who have working code without a namespace (for e.g. simple xpath queries) a migration path, or for new uses that prefer it otherwise.
static String removeDefaultHtmlNamespaceDeclaration(String html) { | ||
Matcher matcher = HTML_DEFAULT_NAMESPACE_PATTERN.matcher(html); | ||
if (matcher.find()) { | ||
html = html.substring(0, matcher.start(1)) + html.substring(matcher.end(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to leave the introduced namespace in the output. It makes the output XML clearer as to what's happened, and may help other downstream parses. So the regex components etc can be removed.
We can leave it like that or play tricks to get rid of it, but it depends on the purpose and usage of W3CDom.asString(). Is it just used for testing? Or is it part of the jsoup API for turning DOM into a string?
It's part of the API.
jsoup is about parsing, not pretty-printing
I consider jsoup to be also about pretty-printing. And we do introduce other tokens throughout jsoup as a convenience (e.g. xml declarations, meta charsets, the HTML document structure, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to leave the introduced namespace in the output.
For my purposes I don't care one way or another, and it's certainly easier just to leave it in. At the time, due to lack of response to the ticket, I had to make a choice and do what I thought would raise the chance of the PR being requested. (You can see all the time and painful research on Stack Overflow to remove the declaration.) I'll remove the code that removed the declaration in the output.
Yep don't worry about those - likely to have just been GitHub not having enough Mac resources on hand when the tests ran to complete some of the timing tests within the limit. |
@jhy I've made the changes you've requested. Sorry for the delay. I merged in your latest changes from Can you can review this and merge it soon? I really don't want for it to get stale, and each time I come back to it after a couple of months it's harder to re-orient myself to where I left it. 😅 Let me know if you need further changes. |
Thanks, merged! Appreciate your perseverance in getting it completed. I added a test that the converter is in namespace aware mode before applying it, so that it can be optionally disabled. |
Woohoo!! Thanks; this is exciting. |
Fixes #1837.