Skip to content

Distributed web crawl, index, query - Apache Storm, Redis, Node.js

License

Notifications You must be signed in to change notification settings

sunil3590/spiderz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiderz

* Distributed crawling of Wikipedia using Apache Storm

* Store inverted index of keywords and link counts in Redis

* Handle search queries from web clients using Node.js

More details to come soon!

To run

  1. Update "conf/redis.conf" with the IP and port where redis should bind to
  2. Start redis data store sudo redis-server conf/redis.conf
  3. Start storm crawler topology storm jar target/spiderz-1.0-SNAPSHOT-jar-with-dependencies.jar edu.ncsu.spiderz.WikiCrawlerTopology REDIS_IP REDIS_PORT
  4. Start node search engine nodejs wikiSearch/app.js REDIS_IP REDIS_PORT

TODO

  1. Use first paragraph to index and search better
  2. Make request for multiple topics at a time
  3. Test in distributed mode

About

Distributed web crawl, index, query - Apache Storm, Redis, Node.js

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published