An open-source web crawler that extracts internal links info for SEO auditing & optimization purposes. The project builds upon Puppeteer headless browser. Inspired by Arachnid PHP library.
- Simple NodeJS library with asynchronous crawling capability.
- Crawl site pages controlled by maximum depth or maximum result count.
- Implements BFS (Breadth First Search) algorithm, traversing pages ordered level by level.
- Event driven implementation enables users of the library to consume output in real-time (crawling started/completed/skipped/failed ...etc.).
- Extracting the following SEO-related information for each page in a site:
- Page titles, main heading H1 and subheading H2 tag contents.
- Page status code/text, enabling to detect broken links (4xx/5xx).
- Meta tag information including: description, keywords, author, robots tags.
- Detect broken image resources and images with missing alt attributes.
- Extract page indexability status, and if page is not indexable detect the reason (ex. blocked by robots.txt, client error, canonicalized)
- Retrieve information about page resources (document/stylesheet/javascript/images files...etc requested by a page)
- More SEO-oriented information will be added soon...
NodeJS v10.0.0+ is required.
npm install @web-extractors/arachnid-seo
const Arachnid = require('@web-extractors/arachnid-seo').default;
const crawler = new Arachnid('https://www.example.com');
crawler.setCrawlDepth(2)
.traverse()
.then((results) => console.log(results)); //pages results
// or you can use in await/async manner:
// const results = await crawler.traverse();
results output:
Map(3) {
"https://www.example.com/" => {
"url": "https://www.example.com/",
"urlEncoded": "https://www.example.com/",
"isInternal": true,
"statusCode": 200,
"statusText": "",
"contentType": "text/html; charset=UTF-8",
"depth": 1,
"resourceInfo": [
{
"type": "document",
"count": 1,
"broken": []
}
],
"responseTimeMs": 340,
"DOMInfo": {
"title": "Example Domain",
"h1": [
"Example Domain"
],
"h2": [],
"meta": [],
"images": {
"missingAlt": []
},
"canonicalUrl": "",
"uniqueOutLinks": 1
},
"isIndexable": true,
"indexabilityStatus": ""
},
"https://www.iana.org/domains/example" => {
"url": "https://www.iana.org/domains/example",
"urlEncoded": "https://www.iana.org/domains/example",
"statusCode": 301,
"statusText": "",
"contentType": "text/html; charset=iso-8859-1",
"isInternal": false,
"robotsHeader": null,
"depth": 2,
"redirectUrl": "https://www.iana.org/domains/reserved",
"isIndexable": false,
"indexabilityStatus": "Redirected"
},
"https://www.iana.org/domains/reserved" => {
"url": "https://www.iana.org/domains/reserved",
"urlEncoded": "https://www.iana.org/domains/reserved",
"isInternal": false,
"statusCode": 200,
"statusText": "",
"contentType": "text/html; charset=UTF-8",
"depth": 2,
"isIndexable": true,
"indexabilityStatus": ""
}
}
The library designed using Builder pattern to construct flexible Arachnid-SEO
crawling variables, as following:
To specify maximum links depth to crawl, setCrawlDepth
method can be used:
depth
equal 1 would be used by default, ifCrawlDepth
is not set norMaxResultsNum
.
cralwer.setCrawlDepth(3);
To specify the maximum results to be crawled, setMaxResultsNum
method can be used:
setMaxResultsNum
overwritessetCrawlDepth
when both methods are used.
cralwer.setMaxResultsNum(100);
To improve the speed of the crawl the package concurrently crawls 5 urls by default,
to change that concurrency value, setConcurrency
method can be used:
That will modify the number of pages/tabs created by puppeteer at the same time, increasing it to a large number may cause some memory impact.
cralwer.setConcurrency(10);
To pass additional arguments to puppeteer browser instance, setPuppeteerOptions
method can be used:
Refer to puppeteeer documentation for more information about options.
Sample below to run Arachnid-SEO
on UNIX with no need to install extra dependencies:
cralwer.setPuppeteerOptions({
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--single-process'
]
});
By default, only crawling and extracting information of internal links with the same domain is enabled.
To enable following subdomain links, shouldFollowSubdomains
method can be used:
cralwer.shouldFollowSubdomains(true);
By default, the crawler will respect robots.txt allow/disallow results, to ignore robots rules, ignoreRobots
method can be used:
cralwer.ignoreRobots();
Arachnid-SEO provides methods to track crawling activity progress, by emitting various events as below:
const Arachnid = require('@web-extractors/arachnid-seo').default;
const crawler = new Arachnid('https://www.example.com/')
.setConcurrency(5)
.setCrawlDepth(2);
crawler.on('results', response => console.log(response))
.on('pageCrawlingSuccessed', pageResponse => processResponsePerPage(pageResponse))
.on('pageCrawlingFailed', pageFailed => handleFailedCrwaling(pageFailed));
// See https://github.com/web-extractors/arachnid-seo-js#using-events for full list of events emitted
crawler.traverse();
See Full examples for full list of events emitted.
- Emitted when a general activity takes place like: getting the next page batch to process.
- Payload: <InformativeMessage(String)>
- Emitted when an error occurs while processing a link or batch of links, ex. URL with invalid hostname.
- Payload: <ErrorMessage(String)>
- Emitted when crawling of a page start (puppeteer open tab for page URL).
- Payload: <{url(String), depth(int)}>
- Emitted when a success response received for Url (2xx/3xx).
- Payload: <{url(String), statusCode(int)}>
- Emitted when a failure response received for Url (4xx/5xx).
- Payload: <{url(String), statusCode(int)}>
- Emitted when the page url marked as processed after extracting all information and adding it to thr results map.
- Payload: <{url(String), ResultInfo}>
- Emitted when crawling or extracting page info skipped due to non-html content, invalid or external link.
- Payload: <{url(String), reason(String)}>
- Emitted when crawling all links matching parameters completed and returning all links information.
- Payload: <Map<{url(String), ResultInfo}>>
We are still in Beta version 🌑
Feel free to raise a ticket under Issue tab or Submit PRs for any new bug fix/feature/enhancement.
- Zeid Rashwani http://zrashwani.com
- Ahmad Khasawneh https://github.com/AhmadKhasanweh
MIT Public License