Skip to content

sropelato/Fetchino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fetchino

Build Status

Fetchino is a lightweight web scraper, helping you get structured information from websites. It uses configuration files to describe how the data you're looking for can be retrieved and makes it super easy to access the data in an API-like way.

Example: Who played in the first Lord of the Rings movie?

The following configuration will tell Fetchino to open the imdb.com page for "The Lord of the Rings: The Fellowship of the Ring" and find all actors.

Config
<?xml version="1.0" encoding="UTF-8"?>
<config>
    <data>
        <list name="actors" type="string" />
    </data>
    <workflow>
        <request url="http://www.imdb.com/title/tt0120737/fullcredits" />
        <addToList list="actors" path="//table[@class='cast_list']/tbody/tr[@class]/td[2]/a" />
    </workflow>
</config>

The names of the actors are stored in a list "actors" which can be accessed as shown below:

Java Code
Fetchino fetchino = Fetchino.fromConfig("./LordOfTheRingsActors.xml"));
fetchino.fetch();
fetchino.getContext().getList("actors").forEach(System.out::println);