Skip to content

Instantly share code, notes, and snippets.

View robints's full-sized avatar
🎯
Focusing

Robin robints

🎯
Focusing
View GitHub Profile
@robints
robints / setup.md
Created November 23, 2016 08:15 — forked from xrstf/setup.md
Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 Setup

Info

This guide sets up a non-clustered Nutch crawler, which stores its data via HBase. We will not learn how to setup Hadoop et al., but just the bare minimum to crawl and index websites on a single machine.

Terms

  • Nutch - the crawler (fetches and parses websites)
  • HBase - filesystem storage for Nutch (Hadoop component, basically)

Intro

The plan is to create a pair of executables (ngrok and ngrokd) that are connected with a self-signed SSL cert. Since the client and server executables are paired, you won't be able to use any other ngrok to connect to this ngrokd, and vice versa.

DNS

Add two DNS records: one for the base domain and one for the wildcard domain. For example, if your base domain is domain.com, you'll need a record for that and for *.domain.com.

Different Operating Systems

cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y
### Check http://www.elasticsearch.org/download/ for latest version of ElasticSearch and replace wget link below
# NEW WAY / EASY WAY
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.0.1.deb
sudo dpkg -i elasticsearch-1.0.1.deb