It enables you to parse web sites or any other XML-based content with a predefined template.
The finder HtmlXPathTemplateFinder is based on XPath selectors and uses HtmlAgilityPack library under the hood.
All you need is to provide three things
- an html content (string)
- a template reader (IHtmlTemplateReader<HtmlXPathTemplate>)
- an entity type (any class/struct with a few string properties)
The default reader HtmlXPathTemplateReader supports custom XPath-based templates like
//*[@class='row']
.//*[@row-type='photo']
.//img[@src=$img]
.//*[@row-type='title']
.//a[@href=$url]/$title
.//*[@row-type='price']
.//span/$price
.//*[@row-type='date']/$date
The XPath format can't be changed while you are using HtmlXPathTemplateFinder.
But you can change all other stuff by implementing your own template reader based on IHtmlTemplateReader<out TTemplate>
For example, it could be JSON format like
{
"RootNodeXPath": "//*[@class='row']",
"Patterns": [
{
"XPathSelector": ".//*[@row-type='photo']"
"Children": [ { "XPathSelector": ".//img[@src=$img]" } ]
},
...
]
}
There are two variable types in the template
- attribute variable
- innerText variable
You can set multiple attribute variables in a single XPath selector
.//a[@href=$url and @title=$title]
innerText variable grabs all the text inside the specified tag and can be combined with attribute variables in a single XPath selector
...
.//a[@href=$url]/$title
.//*[@row-type='price']
.//span/$price
.//*[@row-type='date']/$date
...
Just keep in mind
- all of them are removed once template is read
- the format is parsed by regex
AvitoHtmlXPathTemplateFinderFixture.cs
- https://www.nuget.org/packages/Html.XPath.Template.Finder/
- Install-Package Html.XPath.Template.Finder
- HtmlAgilityPack library
- Developed for educational purposes only
- avito.ru/e-katalog.ru/market.yandex.ru websites are used as a few examples only
- Don't forget to check their policies