Instantly share code, notes, and snippets.

@xrstf /setup.md
Last active Oct 21, 2018

Embed
What would you like to do?
Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 Setup

Info

This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.

Terms

  • Nutch - the crawler (fetches and parses websites)
  • HBase - filesystem storage for Nutch (Hadoop component, basically)
  • Gora - filesystem abstraction, used by Nutch (HBase is one of the possible implementations)
  • ElasticSearch - index/search engine, searching on data created by Nutch (does not use HBase, but its down data structure and storage)

Requirements

Install OpenJDK, ant and ElasticSearch via your repository manager of choice (ES can be installed by using the .deb linked above, if you need).

Extract Nutch and HBase somewhere. From now on, we will refer to the Nutch root directory by $NUTCH_ROOT and the HBase root by $HBASE_ROOT.

Setting up HBase

  1. edit $HBASE_ROOT/conf/hbase-site.xml and add
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>
  1. edit $HBASE_ROOT/conf/hbase-env.sh and enable JAVA_HOME and set it to the proper path:
-# export JAVA_HOME=/usr/java/jdk1.6.0/
+export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/

This step might seem redundant, but even with JAVA_HOME being set in my shell, HBase just didn't recognize it.

  1. kick off HBase:
$HBASE_ROOT/bin/start-hbase.sh

Setting up Nutch

  1. enable the HBase dependency in $NUTCH_ROOT/ivy/ivy.xml by uncommenting the line
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
  1. configure the HBase adapter by editing the $NUTCH_ROOT/conf/gora.properties:
-#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
+gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  1. build Nutch
$ cd $NUTCH_ROOT
$ ant clean
$ ant runtime

This can take a while and creates $NUTCH_ROOT/runtime/local.

  1. configure Nutch by editing $NUTCH_ROOT/runtime/local/conf/nutch-site.xml:
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
    <value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>
  <property>
    <name>db.ignore.external.links</name>
    <value>true</value> <!-- do not leave the seeded domains (optional) -->
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value> <!-- where is ElasticSearch listening -->
  </property>
</configuration>
  1. configure HBase integration by editing $NUTCH_ROOT/runtime/local/conf/hbase-site.xml:
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///full/path/to/where/the/data/should/be/stored</value> <!-- same path as you've given for HBase above -->
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

That's it. Everything is now setup to crawl websites.

Adding new Domains to crawl with Nutch

  1. create an empty directory. Add a textfile containing a list of seed URLs.
$ mkdir seed
$ echo "https://www.website.com" >> seed/urls.txt
$ echo "https://www.another.com" >> seed/urls.txt
$ echo "https://www.example.com" >> seed/urls.txt
  1. inject them into Nutch by giving a file URL (!)
$ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/

Actual Crawling Procedure

  1. Generate a new set of URLs to fetch. This is is based on both the injected URLs as well as outdated URLs in the Nutch crawl db.
$ $NUTCH_ROOT/runtime/local/bin/nutch generate -topN 10

The above command will create job batches for 10 URLs.

  1. Fetch the URLs. We are not clustering, so we can simply fetch all batches:
$ $NUTCH_ROOT/runtime/local/bin/nutch fetch -all
  1. Now we parse all fetched pages:
$ $NUTCH_ROOT/runtime/local/bin/nutch parse -all
  1. Last step: Update Nutch's internal database:
$ $NUTCH_ROOT/runtime/local/bin/nutch updatedb -all

On the first run, this will only crawl the injected URLs. The procedure above is supposed to be repeated regulargy to keep the index up to date.

Putting Documents into ElasticSearch

Easy peasy:

$ $NUTCH_ROOT/runtime/local/bin/nutch index -all

Query for Documents

The usual ElasticSearch way:

$ curl -X GET "http://localhost:9200/_search?query=my%20term"
@xentnex

This comment has been minimized.

xentnex commented Mar 17, 2015

Superrr guide on Nutch + Elasticsearch couple !

@ovidiubuligan

This comment has been minimized.

ovidiubuligan commented Mar 20, 2015

sorry for positing this here but I can't build nutch (why can't they post the binaries ?) Can't get ivy to work behind a proxy with username and password authentification. Can you post a built version of nutch ? (In return I will create a dockerfile and image with this on the docker repo ) . (I have tried with what seams to be the most complete option http://www.midvision.com/community/code-blog-for-developers/bid/275503/Allow-access-from-Ivy-to-the-internet-through-a-corporate-firewall-that-requires-authentication but i still get WARN: Your proxy requires authentication.)

@thevman

This comment has been minimized.

thevman commented Apr 7, 2015

Hello,

I got everything else, but my elasticsearch is remote. when I run the "$NUTCH_ROOT/runtime/local/bin/nutch index -all" command, I get "SolrIndexerJob: java.lang.RuntimeException: job failed: name=Indexer, jobid=job_local495526100_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)"

when I run the "$NUTCH_ROOT/runtime/local/bin/nutch elasticindex" command, i get "Error: Could not find or load main class org.apache.nutch.indexer.elastic.ElasticIndexerJob".

Also, I'm using mongo instead of HBase...I don't think that should make a difference. I can verify that the documents were inserted into mongo.

@chris-gunawardena

This comment has been minimized.

chris-gunawardena commented Apr 19, 2015

Thanks for the excellent write up, the most uptodate guide I've seen so far.
@ovidiubuligan One way to bypass the corporate proxy is to tether your phones internet to your laptop :D

@hezila

This comment has been minimized.

hezila commented Apr 29, 2015

Hi @thevman, I get the same problem. Have you fixed it?

@fkostadinov

This comment has been minimized.

fkostadinov commented May 10, 2015

Great tutorial, thanks! Just a minor point: The tutorial does not mention that one should first make sure Elastic Search is started, i.e. sudo /etc/init.d/elasticsearch start. For complete beginners like me this is not obvious.

@attithur

This comment has been minimized.

attithur commented Jun 10, 2015

Any solution for the RuntimeException?

@meodorewan

This comment has been minimized.

meodorewan commented Jun 16, 2015

[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = /home/fx/Abivin/apache-nutch-2.3/ivy/ivysettings.xml
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve] :::: WARNINGS
[ivy:resolve] problem while downloading module descriptor: http://repo1.maven.org/maven2/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.pom: invalid sha1: expected=8e27f87ba4be16b9c643b09afbb991f903ff81f6 computed=70d363c5b5ac5baa51f98f9ec59b5b47899cf6b6 (1141ms)
[ivy:resolve] module not found: org.codehaus.jettison#jettison;1.3.1
[ivy:resolve] ==== local: tried
[ivy:resolve] /home/fx/.ivy2/local/org.codehaus.jettison/jettison/1.3.1/ivys/ivy.xml
[ivy:resolve] -- artifact org.codehaus.jettison#jettison;1.3.1!jettison.jar:
[ivy:resolve] /home/fx/.ivy2/local/org.codehaus.jettison/jettison/1.3.1/jars/jettison.jar
[ivy:resolve] ==== maven2: tried
[ivy:resolve] http://repo1.maven.org/maven2/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.pom
[ivy:resolve] ==== sonatype: tried
[ivy:resolve] http://oss.sonatype.org/content/repositories/releases/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.pom
[ivy:resolve] -- artifact org.codehaus.jettison#jettison;1.3.1!jettison.jar:
[ivy:resolve] http://oss.sonatype.org/content/repositories/releases/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.jar
[ivy:resolve] ==== apache-snapshot: tried
[ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.pom
[ivy:resolve] -- artifact org.codehaus.jettison#jettison;1.3.1!jettison.jar:
[ivy:resolve] https://repository.apache.org/content/repositories/snapshots/org/codehaus/jettison/jettison/1.3.1/jettison-1.3.1.jar
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: org.codehaus.jettison#jettison;1.3.1: not found
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

BUILD FAILED

@rabinnh

This comment has been minimized.

rabinnh commented Jun 28, 2015

The correct indexing command is:

$NUTCH_ROOT/bin/nutch index elasticsearch -all

Or else it may try to use Solr

Also, int conf/nutch-site.xml, check that the properties named "elastic.*" match your host, port, cluster, and the index that you want to save to.

@vampire1306

This comment has been minimized.

vampire1306 commented Jul 3, 2015

@numb3r3, @thevman, the problem may be caused by your elasticsearch config. My had the same problem, I set index for nutch is "Article", after change it to "nutch", everything is okay. I guess Elasticsearch does not accept uppercase in index's name.

@pradumnapanditrao

This comment has been minimized.

pradumnapanditrao commented Jul 6, 2015

I want to ask about nutch plugin. I am trying to fetch specific data from website. I made few changes in DOMContentUtilities.java file to get specific data. Please guide me for the same.

@anurag2050

This comment has been minimized.

anurag2050 commented Jul 30, 2015

Thank you for this great tutorial .
after follow this tutorial everything is working fine and the data is stored in Hbase in webpage table.but i am not able to fetch data in elasticsearch. can anyone tell me how to display data in elasticsearch.

@tru3d3v

This comment has been minimized.

tru3d3v commented Oct 1, 2015

Hi.

Anyone else can help me how re-crawling script ?

Thx

@minervax

This comment has been minimized.

minervax commented Nov 6, 2015

Awesome. Thanks

@javinc

This comment has been minimized.

javinc commented Nov 9, 2015

will it work on ES 2.0?

@SyedSulaimanM

This comment has been minimized.

SyedSulaimanM commented Nov 26, 2015

Data saved in HBase, but those datas are not indexed into elasticsearch, not sure why. Tried with ES-1.4.2 and 1.7.3 both failed.
Am giving the command - 'bin/nutch index elasticsearch -all' for indexing documents.
No errors in the log

Any help pls ??

@SyedSulaimanM

This comment has been minimized.

SyedSulaimanM commented Nov 26, 2015

@javinc did you tried with ES 2.0. Is it working

@suresh88

This comment has been minimized.

suresh88 commented Dec 4, 2015

Build nutch successfully. While injecting the seed url's I am getting the following exception:

~$ /usr/local/nutch-2.3/runtime/local/bin/nutch inject seed/
InjectorJob: starting at 2015-12-04 23:15:31
InjectorJob: Injecting urlDir: seed
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/exceptions/TimeoutIOException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:104)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:163)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:137)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.exceptions.TimeoutIOException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more

Any help is highly appreciated

@gkd141

This comment has been minimized.

gkd141 commented Dec 8, 2015

How to use "$ $NUTCH_ROOT/runtime/local/bin/nutch inject file:///path/to/seed/" command.
I am using command prompt to enter these commands and presently at path "C:\Users\Gaurav Kandpal\Desktop\elastic\apache-nutch-2.3-src\apache-nutch-2.3\runtime\local\bin". There is only one file inside bin folder named nutch, when I enter the command I get an error message "'bin' is not recognized as an internal or external command,
operable program or batch file."
If I do cd nutch, then I get error no directory nutch exists.
please tell me the proper way to use the inject command.

@ghost

This comment has been minimized.

ghost commented Jan 6, 2016

This is a fantastic tutorial!

How can we remove domains/urls?

@dgrene

This comment has been minimized.

dgrene commented Jan 11, 2016

Hi can you please help me, iam using elastic search 2.x, its not working in that case

@vrkansagara

This comment has been minimized.

vrkansagara commented Jan 13, 2016

guys I am also tried hard on Elasticsearch with 2,1,1 but no luck with apache nutch 2.3.

@wallena3

This comment has been minimized.

wallena3 commented Apr 10, 2016

Anyone slove the problem that can't find the data in elasticsearch?I use elasticsearch 1.4.1 and 1.4.4,but both of them can't find data.
And I try es2.3.1,it give me lots of error.
Any help will be much appreciate

@pranipat

This comment has been minimized.

pranipat commented May 10, 2016

The injector goes forever and getting the below error in $NUTCH_ROOT/runtime/local/logs/hadoop.log file

java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2016-05-10 12:42:15,239 WARN zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect

@pranipat

This comment has been minimized.

pranipat commented May 11, 2016

after doing runtime/local/bin/nutch index elasticsearch -all

and to check if we got indices on elastic
curl 'localhost:9200/_cat/indices?v'
gives the below result Which simply means we have no indexes yet in the cluster.

@pranipat

This comment has been minimized.

pranipat commented May 11, 2016

result of runtime/local/bin/nutch index -all or runtime/local/bin/nutch index elasticsearch -all

IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

But nothing in elastic . Can anyone plz let me know how to solve this?

@idda

This comment has been minimized.

idda commented May 18, 2016

great tutorial.

@pranipat , I meet same problem as you. For me, it is because I used "bin/nutch index elasticsearch -all -crawlId 100". I found if I add elasticsearch parameter, it will create a default 'webpage' and index from this table,so nothing will put into elasticsearch. In log:
Processing remaining requests [docs= 0 , length = 0, total docs = 0]
in hbase shell > list:
"100_webpage", "web_page"
The "web_page" was created by index command.

I do following commands, it is ok:
bin/nutch inject conf/urls -crawlId 100
bin/nutch generate -crawlId 100
bin/nutch fetch -all -carwlId 100
bin/nutch parse -all -crawlId 100
bin/nutch updatedb -all -crawlId 100
bin/nutch index -all -crawlId 100

curl 'localhost:9200/_cat/indices?v'
... pri rep ...
... 5 1 ...

hope it is useful for you.

@narendrakadari

This comment has been minimized.

narendrakadari commented May 27, 2016

Hi everyone Could any one help me out of this error ?
how to slove this error .
Iam using solr 4.10.3 , hbase0.98.19.hadoop and nutch 2.3.1
here Iam getting data in to hbase but when iam doing indexing ,am getting this error in nutch logs

IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name=
[TestCrawl5]Indexer, jobid=job_local112769475_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
/usr/local/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/#/collection1 -all -crawlId TestCrawl5
Failed with exit value 255.

@HarshaSuranjith

This comment has been minimized.

HarshaSuranjith commented Jul 4, 2016

I am getting the following error, Any ideas ??

InjectorJob: starting at 2016-07-04 14:53:20
InjectorJob: Injecting urlDir: /home/ubuntu/seed/urls.txt
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more

@chocolim

This comment has been minimized.

chocolim commented Aug 29, 2016

Try with HBASE 9.8
Look at this

@searchandanalytics

This comment has been minimized.

searchandanalytics commented Sep 4, 2016

Nice article,
I am using Nutch 2.3.1 and have many urls in seed.txt file, e.g. http://www.foodurl1.com, http://www.foofurl2.com etc.. and want to index all URL in ES under single index e.g. foodindex and each url as separate type e.g foodindex/foodurl1, foodindex/foodurl2. So i can search each url individually based on type. Is there any out of the box way to pass type in elasticindexwriter.java in ES indexer plugin(by default its using "doc" type).
public void write(NutchDocument doc) throws IOException {
String id = (String) doc.getFieldValue("id");
String type = doc.getDocumentMeta().get("type");
if (type == null)
type = "doc";
IndexRequestBuilder request = client.prepareIndex(defaultIndex, type, id);

or any other suggestion to achieve this.

Thanks in advance, Mra

@purplewall1206

This comment has been minimized.

purplewall1206 commented Sep 12, 2016

hbase-0.94 didn't work well on my server , nutch will throw some exceptions like

java.lang.ClassNotFoundException: org.apache.gora.hbase.store.HBaseStore
so as the suggestion on nutch wiki
HBase 0.98.8-hadoop2 works well on my server(centOS 6)`

@AcroSaizin

This comment has been minimized.

AcroSaizin commented Dec 16, 2016

This is a good tutorial.
But I have a problem when inject the url file.
$ bin/nutch inject seed/urls.txt

At that time the following error was occurs.
bash nutch command not found

Could you please help me how to solve.
Regards.

@jainruchika

This comment has been minimized.

jainruchika commented Dec 24, 2016

Hiii

Can any one know how to integrate tika with Nutch -2.3+Hbase-0.98.80+Solr-4.1.0

@PammyS

This comment has been minimized.

PammyS commented Jun 12, 2017

Everything works fine apart from indexing this output comes and there in nothing in elastic search

Elastic Version: 1.7.2
Nutch 1.13

Indexer: starting at 2017-06-12 13:42:24
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500)
elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100)
elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10)
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)

Indexer: number of documents indexed, deleted, or skipped:
Indexer: finished at 2017-06-12 13:42:41, elapsed: 00:00:17

@RohitDhankar

This comment has been minimized.

RohitDhankar commented Jul 8, 2017

I followed the Nutch 2.x Official tutorial - got so far as Inject URL's in Nutch - but that hangs forever .

Have documented own errors and process here - have read lots on SO and other blogs nothing seems to resolve issue - kindly suggest what changes required ??

https://github.com/RohitDhankar/Nutch-Solr-HBase-Ant-Gora-InitialConfig

@kikekr

This comment has been minimized.

kikekr commented Oct 12, 2017

I'm trying to deploy this, but when I start the url injection the process takes forever.. I describe it in this post. Anyone can help me?

@shivakyasaram

This comment has been minimized.

shivakyasaram commented Oct 16, 2017

Thank You

@jacksonp2008

This comment has been minimized.

jacksonp2008 commented Nov 2, 2017

I'm looking for a solution that will work with Nutch 2.3 and Elastic 5.x. Everything works fine but the indexer to elastic. I tried this one: https://github.com/futureweb/indexer-elastic/blob/master/README.md
But it doesn't work either, and the documentation is clearly wrong.

Has anyone got this combination working?

@carlnc

This comment has been minimized.

carlnc commented Dec 14, 2017

Spidering websites, converting pages and documents into JSON, and pushing JSON to Elasticsearch ... should not be this hard.

I'm yet to come across an ant/maven project that isn't a nightmare to setup and use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment