Skip to content
dshaw edited this page Sep 14, 2010 · 2 revisions

So, why write a spider in Node.js?

Well, really there might not be a good reason to do this. I woke up one Saturday morning and decided it might be fun to explore what was possible. Node.js might lend itself to some interesting new emergent behavior in web crawling …and then again it might not.

Caveat Emptor

I write web applications and do not have a background in data mining. I did not start this adventure by researching web crawler implementations. I simply defined what I thought might be useful and went about implementing that functionality.

Phase 1

Node Spider should accept a URL and identify all the that links on that page.

Known Deficiencies

  1. Spider does not have a user agent string.
  2. Does not respect robots.txt.
Clone this wiki locally