-
Notifications
You must be signed in to change notification settings - Fork 10
simpleNutchSolrSetup
This tutorial will create a simple Nutch 2.2.1 + Solr 4.3.1 setup.
Since 2.x Nutch uses Apache Gora as a datastore backend. You will have to choose a specific Gora datastore. In this tutorial we use HBase 0.90.4.
Be careful only certain versions of these tools work together seamlessly. Don't always choose the latest version of a program.
Create a new directory, download these files and extract them. We will call this directory trynutch
in this tutorial.
-
Nutch. This tutorial uses
Nutch 2.2.1
. -
HBase. This tutorial uses
HBase 0.90.4
. -
Solr. This tutorial uses
Solr 4.3.1
.
You will need to set HBase and Zookeeper storage dirs. Edit trynutch/hbase-0.90.4/conf/hbase-site.xml
:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///path/to/trynutch/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/path/to/trynutch/zookeeper</value>
</property>
</configuration>
After this you should be able to start HBase with the following command:
$ ./trynutch/hbase/bin/start_hbase.sh
especially you can run hbase command line util:
$ ./bin/hbase shell
You can stop HBase again with this command:
$ ./trynutch/hbase/bin/stop_hbase.sh
(On my mashine sometimes stop_hbase.sh
takes forever. Deleting trynutch/hbase
and trynutch/zookeeper
, clearing /tmp
, and restarting a couple of times seems to fix this.)
If you have trouble running hbase on an ubuntu system you might want to look at /etc/hosts
and see if your host and localhost have the same IP adress (127.0.0.1). On ubuntu systems your host nowadays has 127.0.1.1 find more information about this problem
We need to setup a name for our web crawler. We also need to tell Nutch that we use HBase as a Gora datastore backend. Edit trynutch/apache-nutch-2.2.1/conf/nutch-site.xml
.
<configuration>
<property>
<name>http.agent.name</name>
<value>your-crawler-name</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
</configuration>
Change this line in trynutch/apache-nutch-2.2.1/conf/gora.properties
:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Open trynutch/apache-nutch-2.2.1/ivy/ivy.xml
. Scroll down to section Gora artifacs
and uncomment this line:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
Now we need to compile Nutch (since 2.x only source archives are available).
$ cd trynutch/apache-nutch-2.2.1/
$ ant runtime
(This might take a long time for the first time. On my mashine it took 25 minutes.)
The database schema that comes with Nutch is outdated.
Download this schema and save it as trynutch/solr-4.3.1/example/solr/collection1/conf/schema.xml
.
Start Solr.
$ cd trynutch/solr-4.3.1/example/
$ java -jar start.jar
If Solr is running you should be able to access the following site:
http://localhost:8983/solr/admin/
Make sure HBase and Solr are running.
To limit crawl range for this tutorial edit trynutch/apache-nutch-2.2.1/runtime/local/conf/regex-urlfilter.txt
and change the last line to:
+^http://work-at-google.com
Crawl with Nutch.
$ cd trynutch/apache-nutch-2.2.1/runtime/local/
$ mkdir urls
$ echo "http://work-at-google.com" > urls/seed.txt
$ bin/nutch inject urls
$ bin/nutch generate -topN 5
$ bin/nutch fetch -all
$ bin/nutch parse -all
$ bin/nutch updatedb
Now feed this data Solr.
$ bin/nutch solrindex http://localhost:8983/solr/ -all
You can now search over you data in Solr under http://localhost:8983/solr/#/collection1/query.