XPath selector based on Jsoup.
@Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/@href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
Xsoup use Jsoup as HTML parser.
Compare with another most used XPath selector for HTML - HtmlCleaner
, Xsoup is much faster:
Normal HTML, size 44KB
XPath: "//a"
Run for 2000 times
Environment:Mac Air MD231CH/A
CPU: 1.8Ghz Intel Core i5
Operation | Xsoup | HtmlCleaner |
parse | 3,207(ms) | 7,999(ms) |
select | 95(ms) | 380(ms) |
Name | Expression | Support |
nodename | nodename | yes |
immediate parent | / | yes |
parent | // | yes |
attribute | [@key=value] | yes |
nth child | tag[n] | yes |
attribute | /@key | yes |
wildcard in tagname | /* | yes |
wildcard in attribute | /[@*] | yes |
function | function() | part |
or | a | b | no |
parent in path | . or .. | no |
predicates | price>35 | no |
In Xsoup, we use some function (maybe not in Standard XPath 1.0):
Expression | Description | Standard XPath |
text(n) | nth text content of element(0 for all) | text() only |
allText() | text including children | not support |
tidyText() | text including children, well formatted | not support |
html() | innerhtml of element | not support |
outerHtml() | outerHtml of element | not support |
regex(@attr,expr,group) | use regex to extract content | not support |
These XPath syntax are extended only in Xsoup (for convenience in extracting HTML, refer to Jsoup CSS Selector):
Name | Expression | Support |
attribute value not equals | [@key!=value] | yes |
attribute value start with | [@key~=value] | yes |
attribute value end with | [@key$=value] | yes |
attribute value contains | [@key*=value] | yes |
attribute value match regex | [@key~=value] | yes |
MIT License, see file LICENSE