mikeatlas/GeoMesa-CDH-5.3.md

## GeoMesa-CDH-5.3.md

      
    Raw
  

              GeoMesa-CDH-5.3.md
            
          
    Getting GeoMesa 1.0 to work on Cloudera CDH 5.3 with Accumulo 1.6

by @mikeatlas
Thanks goes out to @manasdebashiskar for helping me work through all these steps!
Getting GeoMesa to work on Accumulo 1.6 using Cloudera's CDH 5.3 is not any less easy than getting it to work on the officially supported version of Accumulo 1.5.x, but here are the steps you can take to make it happen.
First, you will need to setup an Accumulo 1.6 cluster in CDH. This requires you create a Zookeeper cluster, an HDFS cluster, and a Hadoop MapReduce cluster. For my purposes, I have the following setup (yours may differ as you see fit/need):

3-host Zookeeper cluster, each running Ubuntu 12.02 (ami-018dd631 EC2 image) on t2.medium instances
3-host Accumulo-Tablet, HDFS, Hadoop cluster, each running Ubuntu 12.02 (AMI ami-03dfa333) on m1.xlarge instances.
1-host Accumulo-Aux Services, HDFS-NameNode and YARN running Ubuntu 12.02 (AMI ami-03dfa333) on m1.xlarge instance.

At this point, in Cloudera Manager, you should be able to see that your cluster is set up and running smoothly, and you can connect to your Accumulo 1.6 instance monitor UI, e.g., http://acc6.mycluster.com:7180/ and see that all zookeepers are known, and the 3 tablet servers are available and ready. You can check the log area for resolving any errors.
Make sure you know your root accumulo user and password, we'll use it later to run the GeoMesa Quickstart example.
Next, I chose to build geomesa/accumulo6 branch on my Accumulo master instance. You'll need to install some things first, though, to build it successfully:

sudo apt-get install git maven openjdk-7-jdk -y
sudo update-alternatives --config java (pick OpenJDK 1.7)
sudo update-alternatives --config javac (pick OpenJDK 1.7)

Next, we'll build GeoMesa against Accumulo 1.6 branch:

git clone git@github.com:locationtech/geomesa.git
cd geomesa
git pull origin accumulo6
git branch accumulo6
Now, you're going to need to update the pom.xml in the geomesa project a little bit in order to find the right CDH dependencies. Here's that file you need in its entirety. The changes point to the correct Hadoop client jar and Cloudera repository.
mvn clean install

After this successfully builds, you should copy the output geomesa-distributed-runtime-accumulo1.5-1.0.0-rc.3-SNAPSHOT.jar (note, you can ignore that the name of this file is 1.5 - it was still compiled against accumulo 1.6 actually) to the Accumulo lib/ext directory for each of your tablet servers. This jar contains the GeoMesa Accumulo iterators.

sudo chmod 777 /opt/cloudera/parcels/ACCUMULO/lib/accumulo/lib/ext (I am not sure if this is necessary or very secure step)
sudo cp ./geomesa-distributed-runtime/geomesa-distributed-runtime-accumulo1.5-1.0.0-rc.3-SNAPSHOT.jar /opt/cloudera/parcels/ACCUMULO/lib/accumulo/lib/ext

You'll also need to copy a specific dependency joda-time-2.3 into Accumulo's lib directory, which can be found in your local maven repository after building GeoMesa:

sudo cp ~/.m2/repository/joda-time/joda-time/2.3/joda-time-2.3.jar /opt/cloudera/parcels/ACCUMULO/lib/accumulo/lib

Make sure the joda-time dependency and GeoMesa iterators are installed in the same relative place (lib and lib/ext, respectively) in all your Accumulo tablet servers. You don't need to build them again on each machine - just copy them over using whatever method you prefer (scp, etc).
Now, go back into your Cloudera Manager UI, pick your cluster, pick your Accumulo role, and in the Actions menu button dropdown, restart Accumulo. This is necessary so that Accumulo runtime picks up the joda-time jar and your GeoMesa iterators on every tablet instance.
GeoMesa Quickstart example

Okay, back again on your Accumulo master machine:

cd ~
git clone https://github.com/geomesa/geomesa-quickstart.git
cd geomesa-quickstart
Edit the pom.xml file like we did for building GeoMesa to fix up some dependencies. Here's that file as well in entirety.
mvn clean install

Okay, we have the quickstart compiled, time to run the GeoMesa Quickstart example. Be sure to update the parameters to match your specific environment:

java -cp ./target/geomesa-quickstart-1.0-SNAPSHOT.jar org.geomesa.QuickStart -instanceId accumulo -zookeepers "zk-1.test.com:2181,zk-2.test.com:2181,zk-1.test.com:2181" -user root -password yourpassword -tableName geomesaQS

You should see the following output:
Creating feature-type (schema):  QuickStart
Creating new features
Inserting new features
Submitting query
1.  Bierce|322|Tue Jul 15 21:09:42 UTC 2014|POINT (-77.01760098223343 -37.30933767159561)|null
2.  Bierce|343|Wed Aug 06 08:59:22 UTC 2014|POINT (-76.66826220670282 -37.44503877750368)|null
3.  Bierce|589|Sat Jul 05 06:02:15 UTC 2014|POINT (-76.88146600670152 -37.40156607152168)|null
4.  Bierce|925|Mon Aug 18 03:28:33 UTC 2014|POINT (-76.5621106573523 -37.34321201566148)|null
5.  Bierce|394|Fri Aug 01 23:55:05 UTC 2014|POINT (-77.42555615743139 -37.26710898726304)|null
6.  Bierce|259|Thu Aug 28 19:59:30 UTC 2014|POINT (-76.90122194030118 -37.148525741002466)|null
7.  Bierce|640|Sun Sep 14 19:48:25 UTC 2014|POINT (-77.36222958792739 -37.13013846773835)|null
8.  Bierce|931|Fri Jul 04 22:25:38 UTC 2014|POINT (-76.51304097832912 -37.49406125975311)|null
9.  Bierce|886|Tue Jul 22 18:12:36 UTC 2014|POINT (-76.59795732474399 -37.18420917493149)|null
When you see that magic output of Bierce you'll know that you're up and running GeoMesa! If you don't see that output in entirety, you'll need to check your Accumulo logs and figure out on your own what went wrong.
--
GDELT Ingestion Example

You should already have been able to load up the raw GDELT zip set into HDFS. After that,

Update pom.xml to include joda-time 2.3 as a dependency (full file here)
Add maven relocation directive for joda-time 2.3 in pom.xml (full file here)
Update pom.xml with the CDH version of Hadoop and Accumulo. (full file here)
Update GDELTIngest.java on line 145 and 151 to match the name of output jar of the project (in this case, geomesa-gdelt-accumulo1.5-1.0-SNAPSHOT.jar, which matches from above since I didn't update the artifact id to be 1.6)
mvn clean install
Grant HDFS permissions sudo -u hdfs hadoop fs -chmod 777 /user to the user directory.
Run the MapReduce job: hadoop jar /path/to/geomesa-gdelt-accumulo1.5-1.0-SNAPSHOT.jar \ geomesa.gdelt.GDELTIngest                       \ -instanceId <accumulo-instance-id>              \ -zookeepers <zookeeper-hosts-string>            \ -user <username> -password <password>           \ -auths <comma-separated-authorization-string>   \ -tableName gdelt -featureName event             \ -ingestFile hdfs:///gdelt/uncompressed/gdelt.tsv

After that, you should see the ingest working:
15/02/03 15:32:38 INFO mapreduce.Job: Job job_000000000_0000 running in uber mode : false
15/02/03 15:32:38 INFO mapreduce.Job:  map 0% reduce 0%
15/02/03 15:36:08 INFO mapreduce.Job:  map 1% reduce 0%
...
And, check your Accumulo Monitor UI, you should see the ingestion rate go up and hold steady while the number of entries climb.
--
Notes


The accumulo shell on each machine is already in the path and need not be launched from $ACCUMULO_HOME/bin like most of the accumulo guides state. Simply run $ accumulo shell -u root from any path and you should be able to connect without trouble.