Skip to content
forked from spider-rs/spider

Multithreaded Web spider crawler written in Rust.

License

Notifications You must be signed in to change notification settings

Dragnucs/spider

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spider

crate version

Multithreaded web crawler written in Rust.

Dependencies

On Debian or other DEB based distributions:

$ sudo apt install openssl libssl-dev

On Fedora and other RPM based distributions:

$ sudo dnf install openssl-devel

Usage

Add this dependency to your Cargo.toml file.

[dependencies]
spider = "1.2.1"

Then you'll be able to use library. Here is a simple example:

extern crate spider;

use spider::website::Website;

fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl();

    for page in website.get_pages() {
        println!("- {}", page.get_url());
    }
}

You can use Configuration object to configure your crawler:

// ..
let mut website: Website = Website::new("https://choosealicense.com");
website.configuration.blacklist_url.push("https://choosealicense.com/licenses/".to_string());
website.configuration.respect_robots_txt = true;
website.configuration.verbose = true; // Defaults to false
website.configuration.delay = 2000; // Defaults to 250 ms
website.configuration.concurrency = 10; // Defaults to 4
website.configuration.user_agent = "myapp/version" // Defaults to spider/x.y.z, where x.y.z is the library version
website.crawl();

TODO

  • multi-threaded system
  • respect robot.txt file
  • add configuration object for polite delay, etc..
  • add polite delay
  • parse command line arguments

Contribute

I am open-minded to any contribution. Just fork & commit on another branch.

About

Multithreaded Web spider crawler written in Rust.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Rust 100.0%