I dodin't make this work fully automatic.
Exptect 80-260mbps download trafiic on i7-2660 and abuse comming to your ISP. Also you'll be oveloading Bind(or other DNS server).
How to start?
- Get latest dump from http://rdf.dmoz.org/ (file: content.rdf.u8)
use: urlServer/content2urlList.php to build first url file - Run server providing url's:
urlServer/urlServer.php data - Create ramdrive "dl" folder in downloader directory. Size depends on how fast server can parse files.
On I7-2660 it does near real time- so 2gB is enouth. - Run parserServer/parserServer.php
- Run parser9- it will scan dl directory.
- Run downloader_remote.php- it will start downloading
parserServer.php produces 2 types of files:
.data -> this is backlink data
.queue -> this are url's to download
to create unique list of url's to download use: urlServer/uniq.c
urlServer/urlServer.php -> feeds downloaders with list of url
downloader/downloader_remote.php -> using curl multithread- curl has to be compiled with threaded resolver
downloader/downloader_threads_remote.php -> php downloader using pthread- php has some issues with multiple threads so it often crashed(internal php error)
uniq.c -> bst tree duplicate remover- when crawler producess mass amount of url's somehow list for urlServer has to be created
FreeBSD ramdrive setup:
mdconfig -a -t malloc -s 2048m -u 1
newfs -U md1
mount /dev/md1 /root/dl2/dl/
FreeBSD UDP tuning(Bind sends lots of packets):
sysctl net.inet.udp.recvspace=168320
sysctl net.inet.udp.maxdgram=36864