WebCrawler

I dodin't make this work fully automatic.
Exptect 80-260mbps download trafiic on i7-2660 and abuse comming to your ISP. Also you'll be oveloading Bind(or other DNS server).

How to start?

Get latest dump from http://rdf.dmoz.org/ (file: content.rdf.u8)
use: urlServer/content2urlList.php to build first url file
Run server providing url's:
urlServer/urlServer.php data
Create ramdrive "dl" folder in downloader directory. Size depends on how fast server can parse files.
On I7-2660 it does near real time- so 2gB is enouth.
Run parserServer/parserServer.php
Run parser9- it will scan dl directory.
Run downloader_remote.php- it will start downloading

parserServer.php produces 2 types of files:
.data -> this is backlink data
.queue -> this are url's to download
to create unique list of url's to download use: urlServer/uniq.c

urlServer/urlServer.php -> feeds downloaders with list of url
downloader/downloader_remote.php -> using curl multithread- curl has to be compiled with threaded resolver
downloader/downloader_threads_remote.php -> php downloader using pthread- php has some issues with multiple threads so it often crashed(internal php error)
uniq.c -> bst tree duplicate remover- when crawler producess mass amount of url's somehow list for urlServer has to be created

FreeBSD ramdrive setup:
mdconfig -a -t malloc -s 2048m -u 1
newfs -U md1
mount /dev/md1 /root/dl2/dl/

FreeBSD UDP tuning(Bind sends lots of packets):
sysctl net.inet.udp.recvspace=168320
sysctl net.inet.udp.maxdgram=36864

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
downloader		downloader
elasticSearch		elasticSearch
parser		parser
parserServer		parserServer
urlServer		urlServer
README.md		README.md
tld.txt		tld.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

About

Releases

Packages

Languages

krzycho1024/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages