SelectorParseException when parsing a class name with invalid characters. #1874

RonaldRuckus · 2022-12-28T19:17:42Z

Hi, I am new to GitHub. I have made a workaround but don't know how to actually push it or whatever.
Anyways.

A very easy problem to replicate is a parser error when class names use invalid characters such as: ":" or "=".
These can be found, for educational purposes on very popular websites.

An example class taken from a famous electronic retailer div class : "acl-display--flex acl-flex--row acl-flex--nowrap acl-pb--small lg:acl-pb--large lg:acl-px--medium"

The exact issue can be found in QueryParser.java under findElements.

A simple workaround without messing the actual element location as long as it has other unique identifiers is here: Can be refactored better most likely. Just simply removes the invalid classes from the actual html. I'm fully aware of safelists, but each time it would destroy the actual location of elements. I wrote this workaround at about 2am last night after extensive hours trying to understand the issue.

       val regex = """[:=]""".toRegex()  
       return Jsoup.parse(html).apply{ 
        // Find & iterate through all elements with invalid classes //
        select("[class~=[:=]]").forEach{ element ->
            // Filter the invalid classes
            element.classNames().filter {classes ->
                classes.contains(regex)
            }.forEach{ invalidClass ->
                // Remove the class from the element
                element.removeClass(invalidClass)
            }
        }
    }

An actual solution would be to include a more robust regex match for classes with these invalid characters. I just honestly don't know how to edit or override imported files

Thank you.

The text was updated successfully, but these errors were encountered:

jhy · 2023-01-05T04:21:32Z

Hi, and welcome!

This is not directly a bug in that a CSS class query must not contain a : or ., as they will conflict with subsequent query components (psuedo-classes and classes).

The best fix would be to add support for escaping those characters in the queries, as in the partially implemented #1442.

A workaround for now would be to select using the attribute selector on the class. E.g. div[class*="lg:acl-pb--large"]. See https://try.jsoup.org/~1cI7An6Uen4imRCybNSizsLrCsA

BTW for the future, your report would be improved with a small testcase (what code are you actually running and what do you expect it to do). And there's no need to obfuscate what you're trying to scrape, IMHO. It makes it simpler to check and understand.

(Close as dupe of #1442)

jhy closed this as completed Jan 5, 2023

fcolin-odigo mentioned this issue Apr 12, 2023

Double URLencoding since 1.15.4 due to #1873 "fix" #1936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SelectorParseException when parsing a class name with invalid characters. #1874

SelectorParseException when parsing a class name with invalid characters. #1874

RonaldRuckus commented Dec 28, 2022 •

edited

Loading

jhy commented Jan 5, 2023

SelectorParseException when parsing a class name with invalid characters. #1874

SelectorParseException when parsing a class name with invalid characters. #1874

Comments

RonaldRuckus commented Dec 28, 2022 • edited Loading

jhy commented Jan 5, 2023

RonaldRuckus commented Dec 28, 2022 •

edited

Loading