You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am new to GitHub. I have made a workaround but don't know how to actually push it or whatever.
Anyways.
A very easy problem to replicate is a parser error when class names use invalid characters such as: ":" or "=".
These can be found, for educational purposes on very popular websites.
An example class taken from a famous electronic retailer div class : "acl-display--flex acl-flex--row acl-flex--nowrap acl-pb--small lg:acl-pb--large lg:acl-px--medium"
The exact issue can be found in QueryParser.java under findElements.
A simple workaround without messing the actual element location as long as it has other unique identifiers is here: Can be refactored better most likely. Just simply removes the invalid classes from the actual html. I'm fully aware of safelists, but each time it would destroy the actual location of elements. I wrote this workaround at about 2am last night after extensive hours trying to understand the issue.
val regex = """[:=]""".toRegex()
return Jsoup.parse(html).apply{
// Find & iterate through all elements with invalid classes //
select("[class~=[:=]]").forEach{ element ->
// Filter the invalid classes
element.classNames().filter {classes ->
classes.contains(regex)
}.forEach{ invalidClass ->
// Remove the class from the element
element.removeClass(invalidClass)
}
}
}
An actual solution would be to include a more robust regex match for classes with these invalid characters. I just honestly don't know how to edit or override imported files
Thank you.
The text was updated successfully, but these errors were encountered:
This is not directly a bug in that a CSS class query must not contain a : or ., as they will conflict with subsequent query components (psuedo-classes and classes).
The best fix would be to add support for escaping those characters in the queries, as in the partially implemented #1442.
BTW for the future, your report would be improved with a small testcase (what code are you actually running and what do you expect it to do). And there's no need to obfuscate what you're trying to scrape, IMHO. It makes it simpler to check and understand.
Hi, I am new to GitHub. I have made a workaround but don't know how to actually push it or whatever.
Anyways.
A very easy problem to replicate is a parser error when class names use invalid characters such as: ":" or "=".
These can be found, for educational purposes on very popular websites.
An example class taken from a famous electronic retailer div class : "acl-display--flex acl-flex--row acl-flex--nowrap acl-pb--small lg:acl-pb--large lg:acl-px--medium"
The exact issue can be found in QueryParser.java under findElements.
A simple workaround without messing the actual element location as long as it has other unique identifiers is here: Can be refactored better most likely. Just simply removes the invalid classes from the actual html. I'm fully aware of safelists, but each time it would destroy the actual location of elements. I wrote this workaround at about 2am last night after extensive hours trying to understand the issue.
An actual solution would be to include a more robust regex match for classes with these invalid characters. I just honestly don't know how to edit or override imported files
Thank you.
The text was updated successfully, but these errors were encountered: