abajwa-hw/Sql on Hadoop.md

## Sql on Hadoop.md

      
    Raw
  

              Sql on Hadoop.md
            
          
    LAB

This lab is part of a 'Sql on Hadoop' webinar. The recording and slides can be found here
Purpose

How/when to use Hive vs Phoenix vs SparkSQL
Steps

Start sandbox


Download HDP 2.3 sandbox from here
After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g.

192.168.191.241 sandbox.hortonworks.com sandbox    


Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry

ssh root@sandbox.hortonworks.com

Install components


Install bits for Zeppelin and Nifi Ambari services

VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/abajwa-hw/ambari-nifi-service.git   /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/NIFI
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git   /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN   


Restart Ambari

#sandbox
service ambari restart

#non sandbox
sudo service ambari-server restart


Click Add services wizard and follow wizard to install both services with default settings

For more details on each see here and here


Configuration changes


Change below in Hive config and restart Hive

under General -> change hive.tez.java.opts to -server -Xmx1000m -Djava.net.preferIPv4Stack=true
under General -> ensure hive.exec.dynamic.partition.mode = nonstrict


(Optional) Install and start Solr


On sandbox, solr is installed as part of HDPsearch. Run below to fix a bug with solr setup on sandbox

chown -R solr:solr /opt/lucidworks-hdpsearch/solr  #current sandbox version has files owned by root here which causes problems


If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch first

yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr


Setup Banana snd Solr configs

su solr

#setup banana dasboard
cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/
mv default.json default.json.orig
wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json


Edit solrconfig.xml by adding <str>EEE MMM d HH:mm:ss Z yyyy</str> under ParseDateFieldUpdateProcessorFactory so it looks like below. This is done to allow Solr to recognize the timestamp format of tweets.

vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml

  <processor>
    <arr name="format">
      <str>EEE MMM d HH:mm:ss Z yyyy</str>


Start Solr in cloud mode and create tweets collection

/opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181

/opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets \
   -d data_driven_schema_configs \
   -s 1 \
   -rf 1 

Solr setup is complete. Return to root user
exit

Use Nifi to collect tweets


Fix time on sandbox VM

yum install -y ntp
service ntpd stop
ntpdate pool.ntp.org
service ntpd start


Download Nifi template called Twitter_Dashboard.xml onto your laptop's local filesystem

wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.xml


Open Nifi at http://sandbox.hortonworks.com:9090 and import the template
Open the 'EditTwitter' processor and enter your Twitter keys/secrets
Start the Nifi template flow to push tweets into HDFS, Solr and local disk

Verify data in Solr at: http://sandbox.hortonworks.com:8983/solr/tweets_shard1_replica1/select?q=*%3A*&wt=json&indent=true
Verify data in HDFS under /tmp/tweets_staging at http://sandbox.hortonworks.com:8080/#/main/views/FILES/1.0.0/Files
Verify data in Banana dashboard under http://sandbox.hortonworks.com:8983/solr/banana/index.html#/dashboard


See here or here for more details and screenshots on collecting tweets using Nifi if needed

Use Zeppelin to execute Hive, Phoenix, SparkSQL


Install the zeppelin notebook for the workshop by unzipping the below under /opt/incubator-zeppelin/notebook/ and restarting Zeppelin

su zeppelin
cd /opt/incubator-zeppelin/notebook/
wget https://www.dropbox.com/s/dxjc0ugj4lhurcf/2AY3B5WDV.zip
unzip 2AY3B5WDV.zip


Follow the instructions in the notebook to use zeppelin to execute reports using Hive or Phoenix or SparkSQL depending on the scenario/workload.

Make sure Hive and Hbase are started. If running on non-sandbox cluster, you will need to enable Phoenix first as well.