Most open source Web crawlers (e.g. Apache Nutch) deal with focused crawling by relying on a keyword or document list composed by subject matter experts and similarity measures such as cosine similarity and Naïve Bayes classifier. This work has extended Nutch by developing a semi-supervised method of creating keyword list and considering both text content and hyperlink structure in the Planetary Defense Framework Gateway project, a NASA funded effort aimed to develop a cyberinfrastructure for scientific collaboration across different organizations. Please refer to the slides here for more detail.
For the latest information about Nutch, please visit our website at:
and our wiki, at:
To get started using Nutch read Tutorial: