Skip to content

Instantly share code, notes, and snippets.

@xrstf
Last active October 3, 2022 13:30
  • Star 62 You must be signed in to star a gist
  • Fork 32 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save xrstf/b48a970098a8e76943b9 to your computer and use it in GitHub Desktop.
Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 Setup

Info

This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.

Terms

  • Nutch - the crawler (fetches and parses websites)
  • HBase - filesystem storage for Nutch (Hadoop component, basically)
  • Gora - filesystem abstraction, used by Nutch (HBase is one of the possible implementations)
  • ElasticSearch - index/search engine, searching on data created by Nutch (does not use HBase, but its down data structure and storage)

Requirements

Install OpenJDK, ant and ElasticSearch via your repository manager of choice (ES can be installed by using the .deb linked above, if you need).

Extract Nutch and HBase somewhere. From now on, we will refer to the Nutch root directory by $NUTCH_ROOT and the HBase root by $HBASE_ROOT.

Setting up HBase

  1. edit $HBASE_ROOT/conf/hbase-site.xml and add
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>
  1. edit $HBASE_ROOT/conf/hbase-env.sh and enable JAVA_HOME and set it to the proper path:
-# export JAVA_HOME=/usr/java/jdk1.6.0/
+export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn't recognize it.

  1. kick off HBase:
$HBASE_ROOT/bin/start-hbase.sh

Setting up Nutch

  1. enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
  1. configure the HBase adapter by editing the $NUTCH_ROOT/conf/gora.properties:
-#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
+gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  1. build Nutch
$ cd $NUTCH_ROOT
$ ant clean
$ ant runtime

This can take a while and creates $NUTCH_ROOT/runtime/local.

  1. configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value> <!-- do not leave the seeded domains (optional) -->
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value> <!-- where is ElasticSearch listening -->
  </property>
</configuration>
  1. configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value> <!-- same path as you've given for HBase above -->
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

That's it. Everything is now setup to crawl websites.

Adding new Domains to crawl with Nutch

  1. create an empty directory. Add a textfile containing a list of seed URLs.
$ mkdir seed
$ echo "https://www.website.com" >> seed/urls.txt
$ echo "https://www.another.com" >> seed/urls.txt
$ echo "https://www.example.com" >> seed/urls.txt
  1. inject them into Nutch by giving a file URL (!)
$ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/

Actual Crawling Procedure

  1. Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
$ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10

The above command will create job batches for 10 URLs.

  1. Fetch the URLs. We are not clustering, so we can simply fetch all batches:
$ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
  1. Now we parse all fetched pages:
$ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
  1. Last step: Update Nutch's internal database:
$ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all

On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.

Putting Documents into ElasticSearch

Easy peasy:

$ $NUTCH_ROOT/runtime/local/bin/nutch index -all

Query for Documents

The usual ElasticSearch way:

$ curl -X GET "http://localhost:9200/_search?query=my%20term"
@chocolim
Copy link

Try with HBASE 9.8
Look at this

@searchandanalytics
Copy link

Nice article,
I am using Nutch 2.3.1 and have many urls in seed.txt file, e.g. http://www.foodurl1.com, http://www.foofurl2.com etc.. and want to index all URL in ES under single index e.g. foodindex and each url as separate type e.g foodindex/foodurl1, foodindex/foodurl2. So i can search each url individually based on type. Is there any out of the box way to pass type in elasticindexwriter.java in ES indexer plugin(by default its using "doc" type).
public void write(NutchDocument doc) throws IOException {
String id = (String) doc.getFieldValue("id");
String type = doc.getDocumentMeta().get("type");
if (type == null)
type = "doc";
IndexRequestBuilder request = client.prepareIndex(defaultIndex, type, id);

or any other suggestion to achieve this.

Thanks in advance, Mra

@purplewall1206
Copy link

hbase-0.94 didn't work well on my server , nutch will throw some exceptions like

java.lang.ClassNotFoundException: org.apache.gora.hbase.store.HBaseStore
so as the suggestion on nutch wiki
HBase 0.98.8-hadoop2 works well on my server(centOS 6)`

@AcroSaizin
Copy link

This is a good tutorial.
But I have a problem when inject the url file.
$ bin/nutch inject seed/urls.txt

At that time the following error was occurs.
bash nutch command not found

Could you please help me how to solve.
Regards.

@jainruchika
Copy link

Hiii

Can any one know how to integrate tika with Nutch -2.3+Hbase-0.98.80+Solr-4.1.0

@PammyS
Copy link

PammyS commented Jun 12, 2017

Everything works fine apart from indexing this output comes and there in nothing in elastic search

Elastic Version: 1.7.2
Nutch 1.13

Indexer: starting at 2017-06-12 13:42:24
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)

Indexer: number of documents indexed, deleted, or skipped:
Indexer: finished at 2017-06-12 13:42:41, elapsed: 00:00:17

@RohitDhankar
Copy link

I followed the Nutch 2.x Official tutorial - got so far as Inject URL's in Nutch - but that hangs forever .

Have documented own errors and process here - have read lots on SO and other blogs nothing seems to resolve issue - kindly suggest what changes required ??

https://github.com/RohitDhankar/Nutch-Solr-HBase-Ant-Gora-InitialConfig

@kikekr
Copy link

kikekr commented Oct 12, 2017

I'm trying to deploy this, but when I start the url injection the process takes forever.. I describe it in this post. Anyone can help me?

@shivakyasaram
Copy link

Thank You

@jacksonp2008
Copy link

I'm looking for a solution that will work with Nutch 2.3 and Elastic 5.x. Everything works fine but the indexer to elastic. I tried this one: https://github.com/futureweb/indexer-elastic/blob/master/README.md
But it doesn't work either, and the documentation is clearly wrong.

Has anyone got this combination working?

@carlnc
Copy link

carlnc commented Dec 14, 2017

Spidering websites, converting pages and documents into JSON, and pushing JSON to Elasticsearch ... should not be this hard.

I'm yet to come across an ant/maven project that isn't a nightmare to setup and use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment