Skip to content

krzycho1024/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

I dodin't make this work fully automatic.
Exptect 80-260mbps download trafiic on i7-2660 and abuse comming to your ISP. Also you'll be oveloading Bind(or other DNS server).

How to start?

  1. Get latest dump from http://rdf.dmoz.org/ (file: content.rdf.u8)
    use: urlServer/content2urlList.php to build first url file
  2. Run server providing url's:
    urlServer/urlServer.php data
  3. Create ramdrive "dl" folder in downloader directory. Size depends on how fast server can parse files.
    On I7-2660 it does near real time- so 2gB is enouth.
  4. Run parserServer/parserServer.php
  5. Run parser9- it will scan dl directory.
  6. Run downloader_remote.php- it will start downloading

parserServer.php produces 2 types of files:
.data -> this is backlink data
.queue -> this are url's to download
to create unique list of url's to download use: urlServer/uniq.c

urlServer/urlServer.php -> feeds downloaders with list of url
downloader/downloader_remote.php -> using curl multithread- curl has to be compiled with threaded resolver
downloader/downloader_threads_remote.php -> php downloader using pthread- php has some issues with multiple threads so it often crashed(internal php error)
uniq.c -> bst tree duplicate remover- when crawler producess mass amount of url's somehow list for urlServer has to be created

FreeBSD ramdrive setup:
mdconfig -a -t malloc -s 2048m -u 1
newfs -U md1
mount /dev/md1 /root/dl2/dl/

FreeBSD UDP tuning(Bind sends lots of packets):
sysctl net.inet.udp.recvspace=168320
sysctl net.inet.udp.maxdgram=36864

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published