This is an example of how to crawl a website using the (NuGet) HtmlAgilityPack and saving the results to a text file.
In order to install the HtmlAgilityPack module to your project, in Visual studio do the following:
- Go to Tools -> NuGet Package manager -> Manage Nuget Packages for Solution...
- Search for HtmlAgilityPack in the search bar.
- When found, click your project and then click on Install.
- In the Review Changes window, click on OK.
The above steps will install the HtmlAgilityPack in your project. A confirmation readme.txt will be displayed. In order to be completely sure that your project now has the HtmlAgilityPack, check your project's References.
For this project, the website http://www.espn.com/nba/statistics was crawled as an example. At the moment of the development of this project, this website had the following tables:
- Offensive Leaders,
- Defensive Leaders,
- Assists,
- Blocks,
- Field Goal, and
- Steals
with the ranking, name, hyperlink and points of each player. After analyzing the HTML, some defined structures are needed to be identified. For this project, the structures:
<div class="mod-container mod-table mod-no-footer">
and
<div class="mod-container mod-table mod-no-footer mod-no-header">
were spotted. These two structures are the ones that contain the players' data.
After identifying the needed sections, the website crawl can be done in three easy steps.
The selected website has 4 sections
<div class="mod-container mod-table mod-no-footer">
and also 4 sections
<div class="mod-container mod-table mod-no-footer mod-no-header">
but we will only use the first two of the mod-container mod-table mod-no-footer type because the last two belong to a different section,
so we will ignore these two last sections.
The first step is to crawl the main site. In order to do this, the code
string mainSite = "http://www.espn.com/nba/statistics";
HtmlWeb site = new HtmlWeb();
HtmlDocument htmlDocument = site.Load(@mainSite);
is needed. Please note that for these commands to work, you need to do the import:
using HtmlAgilityPack;
Now that the main site has been crawled, we need to access the individual sections containing the data we need. For this project, we use the code
HtmlNodeCollection leaderBoards_01 = htmlDocument.DocumentNode.SelectNodes("//div[@class='mod-container mod-table mod-no-footer']"); //We need only the first two.
HtmlNodeCollection leaderBoards_02 = htmlDocument.DocumentNode.SelectNodes("//div[@class='mod-container mod-table mod-no-footer mod-no-header']"); //We will use all of them
This code will create a HtmlNodeCollection object that contains all the
<div class="mod-container mod-table mod-no-footer">
<div class="mod-container mod-table mod-no-footer mod-no-header">
About the HtmlNodeCollection objects:
- If you use the double slash notation // before any HTML tag, it will look for these tags from the beginning of the site.
- If you don't use the // tags, it will look from the parent object.
So for example
HtmlNodeCollection collection = Parent.SelectNodes("//div[@class='foo bar']");
will look starting from the main site (and probably will find div that are not inside Parent), and
HtmlNodeCollection collection = Parent.SelectNodes("div[@class='foo bar']");
will look for the div inside Parent only (in one level).
After reaching the HTML level that contains the needed data, it can be obtained as normal values by using the Attributes or InnerText methods as
HtmlNode cell_name_aTag = cell_name.SelectNodes("a").FirstOrDefault();
string link = cell_name_aTag.Attributes["href"].Value.ToString();
string name = cell_name_aTag.InnerText;
With these values, classical objects can be constructed.
After the data has been crawled, it can be saved in different places (as a Database for example). For this project, it is saved to a text file by the code
public static bool SaveToFile(List<Player> listOfPlayers, string title)
{
string fileName = ConfigurationSettings.AppSettings["File.location"];
try
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(fileName, true)) //we use 'using' because it automatically flushes and closes the stream; also calls the IDisposable.Dispose of the stream object.
{
file.WriteLine("---------------------");
file.WriteLine(title);
file.WriteLine("");
for (int i = 0; i < listOfPlayers.Count; i++)
{
int counter = i + 1;
file.WriteLine(counter + ", " + listOfPlayers[i].name + ", " + listOfPlayers[i].points + ", " + listOfPlayers[i].link);
}
file.WriteLine("---------------------");
}
}
catch(Exception e)
{
Console.WriteLine(e);
return false;
}
return true;
}
which returns either true or false for success or failure when saving, respectively.
For this function
- The file path is read from the ConfigurationSettings.AppSettings["File.location"] object. Please check the project's App.config to verify its value.
- The commands
using (System.IO.StreamWriter file = new System.IO.StreamWriter(fileName, true)){
...
}
was used. This command automatically flushes and closes the stream and also calls the IDisposable.Dispose of the stream object.
When running the project, each step's status is output to console.
After running the program, a text file is generated under C:\webcrawler_results.txt (If you want to change the path of this file - or the file name - modify the File.location value in the App.config file).