Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W3CDom attribute names case sensitivity #981

Closed
cigaly opened this issue Nov 24, 2017 · 3 comments
Closed

W3CDom attribute names case sensitivity #981

cigaly opened this issue Nov 24, 2017 · 3 comments
Labels

Comments

@cigaly
Copy link

cigaly commented Nov 24, 2017

According to HTML specification (http://w3c.github.io/html-reference/documents.html#case-insensitivity), both tag and attribute names are case insensitive. However, in current implementation tag names are converted to lower case, but attribute names are left as-is.

Example HTML:

<html lang=en>
<body>
<img src="firstImage.jpg" alt="Alt one" />
<IMG SRC="secondImage.jpg" AlT="Alt two" />
</body>
</html>

will make following test case to fail:

public void checkElementsAttributesCaseSensitivity() throws IOException {
    File in = ParseTest.getFile("/htmltests/attributes-case-sensitivity-test.html");
    org.jsoup.nodes.Document jsoupDoc;
    jsoupDoc = Jsoup.parse(in, "UTF-8");

    org.jsoup.helper.W3CDom jDom = new org.jsoup.helper.W3CDom();
    Document doc = jDom.fromJsoup(jsoupDoc);

    final org.w3c.dom.Element body = (org.w3c.dom.Element) doc.getDocumentElement().getElementsByTagName("body").item(0);

    final NodeList imgs = body.getElementsByTagName("img");
    assertEquals(2, imgs.getLength());

    final org.w3c.dom.Element first = (org.w3c.dom.Element) imgs.item(0);
    assertEquals(first.getAttributes().getLength(), 2);
    final String img1 = first.getAttribute("src");
    assertEquals("firstImage.jpg", img1);
    final String alt1 = first.getAttribute("alt");
    assertEquals("Alt one", alt1);

    final org.w3c.dom.Element second = (org.w3c.dom.Element) imgs.item(1);
    assertEquals(second.getAttributes().getLength(), 2);
    final String img2 = second.getAttribute("src");
    assertEquals("secondImage.jpg", img2);
    final String alt2 = second.getAttribute("alt");
    assertEquals("Alt two", alt2);
}
@cigaly
Copy link
Author

cigaly commented Nov 24, 2017

Change that will fix that i squite simple:

index 81ac932..281e3d7 100644
--- a/src/main/java/org/jsoup/helper/W3CDom.java
+++ b/src/main/java/org/jsoup/helper/W3CDom.java
@@ -124,7 +124,7 @@ public class W3CDom {
                 // valid xml attribute names are: ^[a-zA-Z_:][-a-zA-Z0-9_:.]
                 String key = attribute.getKey().replaceAll("[^-a-zA-Z0-9_:.]", "");
                 if (key.matches("[a-zA-Z_:][-a-zA-Z0-9_:.]*"))
-                    el.setAttribute(key, attribute.getValue());
+                    el.setAttribute(key.toLowerCase(), attribute.getValue());
             }
         }

@jhy
Copy link
Owner

jhy commented Aug 27, 2024

Thanks - this was fixed many, many moons ago! Apologies for the late reply.

@jhy jhy closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024
@jhy jhy closed this as completed Aug 27, 2024
@jhy jhy added the fixed label Aug 27, 2024
@jhy
Copy link
Owner

jhy commented Aug 27, 2024

And, if the XML parser was used instead of the HTML parser, the attribute names and tag names would be output with the original case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants