Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow backslash-escaping within ElementSelector and CssIdentifier #1442

Conversation

hannibal218bc
Copy link
Contributor

There has been previous discussion about element names or CSS tokens that can currently not been selected through JSoup's selector, because they contain JSoup Selector operators (like #, . , ...).

In #1055 it was suggested to use escape characters within the selector syntax.

This (simple) implementation modifies TokenQueue so consumeElementSelector and consumeCSSIdentifier will continue matching if the preceeding character is a backslash (\). Finally, all backslashes are removed from the result.

It is a simple approach, and will not allow escaping backslashes themselves - but there should never be a backslash in an element name or CSS identifier?

Here's a simple test case:

	@Test public void testSelectorEscapes() {
		Document doc = Jsoup.parse("<root><e.dot>e with dot</e.dot><e class=\"dot\">e with class dot</e><f id=\"i.d\">f with i.d</f><x id=\"i\" class=\"d\">id i with class d</x></root>", "", Parser.xmlParser());

		assertEquals("e with class dot", 	doc.select("e.dot").text());
		assertEquals("e with dot", 			doc.select("e\\.dot").text());
		assertEquals("id i with class d", 	doc.select("#i.d").text());
		assertEquals("f with i.d", 			doc.select("#i\\.d").text());

	}

This PR would close issue #1441 .

@hannibal218bc
Copy link
Contributor Author

Updated the PR to use TokenQueue.unescape to remove the escape characters, and added a flag to only do it when an escape char was detected in the loop.

@jhy
Copy link
Owner

jhy commented Jan 11, 2021

Hi Hannes,

Thanks for this, and sorry for the late reply. This update looks great. One suggestion:

It is a simple approach, and will not allow escaping backslashes themselves - but there should never be a backslash in an element name or CSS identifier?

[Un]fortunately, HTML does indeed allow backslashes, both in tag names and in attribute names:

<p\p attri\bute>Foo</p\p>

is valid and parses as-is (both in jsoup and in Chrome)

I think that we should enable the backslashes to be escaped.

LMK if you want to do that, or if I should.

Finally, it would be great if you could update CHANGES and most importantly, add a test-case for each case, to make sure it works and we never regress.

@jhy
Copy link
Owner

jhy commented Jan 11, 2021

(Somewhat amusingly, <table(╯°□°)╯>Hello!</table(╯°□°)╯> is also a valid tag name, although I wasn't able to create a selector for it, either with jsoup or Chrome. And I don't think that's a requirement for this PR, but a nice i18n stretch goal :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants