A tool for get information from external websites. Powered by PhantomJS and malahierba.cl dev team
Add in your composer.json:
{
"require": {
"malahierba-lab/web-harvester": "1.*"
}
}
Then you need run the composer update
command.
After install you must configure Service Provider. Simply add the service provider in the config/app.php
providers section:
Malahierba\WebHarvester\WebHarvesterServiceProvider::class
Now you need publish the config file. Simply execute php artisan vendor:publish
Laravel Web Harvester run using PhantomJS headless Webkit browser. This tool is included as binary, so before you can use this package you need to specify your OS. This can be done in config file config\webharvester.php
.
You need set option environment
with once of the options supported:
- linux-i686-32
- linux-i686-64
- macosx
- windows
example: 'environment' => 'macosx'
Important: For documentation purposes, in the examples below, always we assume than you import the library into your namespace using use Malahierba\WebHarvester;
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//Page Title
$title = $webharvester->getTitle();
//Page Description
$description = $webharvester->getDescription();
//Get Status Code (If the url redirect to another webpage, then return the status code for the final webpage)
$status_code = $webharvester->getStatusCode();
//Page Featured Image as URL
$featured_image_url = $webharvester->getFeaturedImage();
//Page Featured Image as Base64
$featured_image_base_64 = $webharvester->getFeaturedImage('base64');
//Page real URL (if the $url redirect to another, return the final)
$real_url = $webharvester->getRealURL();
//Site Name
$sitename = $webharvester->getSiteName();
}
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//check for index
if ($webharvester->isIndexable()) {
//...some code
}
//check for follow
if ($webharvester->isFollowable()) {
//...some code
}
}
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
//all full links as array
$links = $webharvester->getLinks(); //retrieve an array with found links
//all links as array, but query component removed (from the character "?" onwards)
$links = $webharvester->getLinks([
'remove' => ['query']
]);
//retrieve links as array of objects (properties: url, follow)
//if follow is false indicate than that links is marked to no follow (rel='nofollow') by the source website
$links = $webharvester->getLinks(['only_urls' => false]); //default true
}
Important: For security reasons all links with embeded javascript are not included in output array
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
$raw = $webharvester->content();
}
$url = 'http://someurl';
$webharvester = new WebHarvester;
//Check if we can process the URL and Load it
if ($webharvester->takeScreenshot($url)) {
$image_base_64 = $webharvester->content(); //return a base64 string
}
You can customize the webharvester with some functions:
$webharvester = new WebHarvester;
//Custom User Agent
$webharvester->setUserAgent('your user agent');
//Ignore SSL Errors
$webharvester->setIgnoreSSLErrors(true);
//Resource Timeout (in milliseconds)
$webharvester->setResourceTimeout(3000);
//Wait after load (in milliseconds)
$webharvester->setWaitAfterLoad(3000); // <- useful for get async content
This project has MIT licence. For more information please read LICENCE file.