Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SelectorParseException when parsing a class name with invalid characters. #1874

Closed
RonaldRuckus opened this issue Dec 28, 2022 · 1 comment
Closed

Comments

@RonaldRuckus
Copy link

RonaldRuckus commented Dec 28, 2022

Hi, I am new to GitHub. I have made a workaround but don't know how to actually push it or whatever.
Anyways.

A very easy problem to replicate is a parser error when class names use invalid characters such as: ":" or "=".
These can be found, for educational purposes on very popular websites.

An example class taken from a famous electronic retailer div class : "acl-display--flex acl-flex--row acl-flex--nowrap acl-pb--small lg:acl-pb--large lg:acl-px--medium"

The exact issue can be found in QueryParser.java under findElements.

A simple workaround without messing the actual element location as long as it has other unique identifiers is here: Can be refactored better most likely. Just simply removes the invalid classes from the actual html. I'm fully aware of safelists, but each time it would destroy the actual location of elements. I wrote this workaround at about 2am last night after extensive hours trying to understand the issue.

       val regex = """[:=]""".toRegex()  
       return Jsoup.parse(html).apply{ 
        // Find & iterate through all elements with invalid classes //
        select("[class~=[:=]]").forEach{ element ->
            // Filter the invalid classes
            element.classNames().filter {classes ->
                classes.contains(regex)
            }.forEach{ invalidClass ->
                // Remove the class from the element
                element.removeClass(invalidClass)
            }
        }
    }

An actual solution would be to include a more robust regex match for classes with these invalid characters. I just honestly don't know how to edit or override imported files

Thank you.

@jhy
Copy link
Owner

jhy commented Jan 5, 2023

Hi, and welcome!

This is not directly a bug in that a CSS class query must not contain a : or ., as they will conflict with subsequent query components (psuedo-classes and classes).

The best fix would be to add support for escaping those characters in the queries, as in the partially implemented #1442.

A workaround for now would be to select using the attribute selector on the class. E.g. div[class*="lg:acl-pb--large"]. See https://try.jsoup.org/~1cI7An6Uen4imRCybNSizsLrCsA

BTW for the future, your report would be improved with a small testcase (what code are you actually running and what do you expect it to do). And there's no need to obfuscate what you're trying to scrape, IMHO. It makes it simpler to check and understand.

(Close as dupe of #1442)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants