imfht/setup.md

## setup.md

      
    Raw
  

              setup.md
            
          
    Info

This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.
Terms


Nutch - the crawler (fetches and parses websites)
HBase - filesystem storage for Nutch (Hadoop component, basically)
Gora - filesystem abstraction, used by Nutch (HBase is one of the possible implementations)
ElasticSearch - index/search engine, searching on data created by Nutch (does not use HBase, but its down data structure and storage)

Requirements


OpenJDK 7 & ant
Nutch 2.3 RC (yes, you need 2.3, 2.2 will not work)
HBase 0.94.26 (HBase 0.98 won't work)
ElasticSearch 1.4.2

Install OpenJDK, ant and ElasticSearch via your repository manager of choice (ES can be installed by using the .deb linked above, if you need).
Extract Nutch and HBase somewhere. From now on, we will refer to the Nutch root directory by $NUTCH_ROOT and the HBase root by $HBASE_ROOT.
Setting up HBase


edit $HBASE_ROOT/conf/hbase-site.xml and add

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

edit $HBASE_ROOT/conf/hbase-env.sh and enable JAVA_HOME and set it to the proper path:

-# export JAVA_HOME=/usr/java/jdk1.6.0/
+export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn't recognize it.

kick off HBase:

$HBASE_ROOT/bin/start-hbase.sh
Setting up Nutch


enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

configure the HBase adapter by editing the $NUTCH_ROOT/conf/gora.properties:

-#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
+gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

build Nutch

$ cd $NUTCH_ROOT
$ ant clean
$ ant runtime
This can take a while and creates $NUTCH_ROOT/runtime/local.

configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value> <!-- do not leave the seeded domains (optional) -->
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value> <!-- where is ElasticSearch listening -->
  </property>
</configuration>

configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value> <!-- same path as you've given for HBase above -->
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>
That's it. Everything is now setup to crawl websites.
Adding new Domains to crawl with Nutch


create an empty directory. Add a textfile containing a list of seed URLs.

$ mkdir seed
$ echo "https://www.website.com" >> seed/urls.txt
$ echo "https://www.another.com" >> seed/urls.txt
$ echo "https://www.example.com" >> seed/urls.txt

inject them into Nutch by giving a file URL (!)

$ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/
Actual Crawling Procedure


Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.

$ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10
The above command will create job batches for 10 URLs.

Fetch the URLs. We are not clustering, so we can simply fetch all batches:

$ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all

Now we parse all fetched pages:

$ $NUTCH_ROOT/runtime/local/bin/nutch parse -all

Last step: Update Nutch's internal database:

$ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all
On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.
Putting Documents into ElasticSearch

Easy peasy:
$ $NUTCH_ROOT/runtime/local/bin/nutch index -all
Query for Documents

The usual ElasticSearch way:
$ curl -X GET "http://localhost:9200/_search?query=my%20term"