Html.Template.Finder library (C#, .NET Standard)

It enables you to parse web sites or any other XML-based content with a predefined template.

Click here to expand...

Basics

The finder HtmlXPathTemplateFinder is based on XPath selectors and uses HtmlAgilityPack library under the hood.

All you need is to provide three things

an html content (string)
a template reader (IHtmlTemplateReader<HtmlXPathTemplate>)
an entity type (any class/struct with a few string properties)

Template format

The default reader HtmlXPathTemplateReader supports custom XPath-based templates like

//*[@class='row']
    .//*[@row-type='photo']
        .//img[@src=$img]
    .//*[@row-type='title']
        .//a[@href=$url]/$title
    .//*[@row-type='price']
        .//span/$price
    .//*[@row-type='date']/$date

The XPath format can't be changed while you are using HtmlXPathTemplateFinder.

But you can change all other stuff by implementing your own template reader based on IHtmlTemplateReader<out TTemplate>

For example, it could be JSON format like

{
  "RootNodeXPath": "//*[@class='row']",
  "Patterns": [
    {
      "XPathSelector": ".//*[@row-type='photo']"
      "Children": [ { "XPathSelector": ".//img[@src=$img]" } ]
    },
    ...
  ]
}

There are two variable types in the template

attribute variable
innerText variable

You can set multiple attribute variables in a single XPath selector

.//a[@href=$url and @title=$title]

innerText variable grabs all the text inside the specified tag and can be combined with attribute variables in a single XPath selector

    ...
        .//a[@href=$url]/$title
    .//*[@row-type='price']
        .//span/$price
    .//*[@row-type='date']/$date
    ...

Just keep in mind

all of them are removed once template is read
the format is parsed by regex

Click here to expand...

Code examples

AvitoHtmlXPathTemplateFinderFixture.cs

Nuget

https://www.nuget.org/packages/Html.XPath.Template.Finder/
Install-Package Html.XPath.Template.Finder

References

HtmlAgilityPack library
- MIT License

Disclaimer

Developed for educational purposes only
avito.ru/e-katalog.ru/market.yandex.ru websites are used as a few examples only
- Don't forget to check their policies
  - https://www.avito.ru/info/polzovatelskoe_soglashenie
  - https://yandex.ru/legal/market_termsofuse/?target=terms-of-use

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Nuget.nuspec		Nuget.nuspec
README.md		README.md
eKatalog.png		eKatalog.png
eKatalogHtml.png		eKatalogHtml.png
entity.png		entity.png
example0.png		example0.png
example1.png		example1.png
example2.png		example2.png
pack.cmd		pack.cmd
xpath.template1.png		xpath.template1.png
xpath.template2.png		xpath.template2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Html.Template.Finder library (C#, .NET Standard)

Basics

Template format

Code examples

Nuget

References

Disclaimer

About

Languages

License

zanybaka/Html.Template.Finder

Folders and files

Latest commit

History

Repository files navigation

Html.Template.Finder library (C#, .NET Standard)

Basics

Template format

Code examples

Nuget

References

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages