infrastructure

Dark Crawler

Most hidden wikis are just scam directories. The sights that do not work are most likely just forgotten about sites that have went down. If you want to actually explore onions, find Daniel's directory link. I would give it to you, but I don't have it. -- reddit, ewna843

The crawler gather onions from Daniel's directory and put them in a stack. While the stack is filled, the crawler pop website from the stack and visit some pages of this website. From this website the crawler gather links to other websites and put them at the end of the stack. They are added only if they have not been visited.

The crawler also takes screenshot of visited websites and replace all NSFW images using a classifier, this is to prevent any harmful material being shown. Be aware that this classifier is not perfect, it use nsfw.js underhood. The crawler only accept html, stylesheets, images and fonts, other ressources requests, like scripts are intercepted and aborted to prevent any unwanted exposure.

Here is the pseudo-code:

use_proxy tor
stack = daniel_directory

while stack not empty
  website = pop stack
  visit website
  remove_nsfw_images website
  screenshot website
  stack = stack + extractlinks website

In order to speed up the crawling, multiple instances of the crawler can be launched, this is done using only one browser and multiple pages.

Technical details

The crawler is decomposed in 4 services, orchestrated using docker-compose.

Tor Socks5 proxy: Configure a Tor proxy to be used by other services
NSFW Classifier: API with image url classification if not safe for work using nsfw.js
Chrome Browser (Puppeteer): Crawl the web using Daniel's directory
Autoheal: Restart any unhealthy services, specially Tor proxy when the circuit seem down

Usage

After installing docker, go to the dark-crawler folder and execute this command:

docker-compose up -d

Then, serve the website and you will see the dark crawler in action.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
assets		assets
dark-crawler		dark-crawler
src		src
web_modules		web_modules
.gitignore		.gitignore
README.md		README.md
index.html		index.html
index.js		index.js
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infrastructure

Dark Crawler

Technical details

Usage

About

Releases

Packages

Languages

nestarz/infrastructure-crawler

Folders and files

Latest commit

History

Repository files navigation

infrastructure

Dark Crawler

Technical details

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages