This lab is part of a 'Sql on Hadoop' webinar. The recording and slides can be found here
How/when to use Hive vs Phoenix vs SparkSQL
- Download HDP 2.3 sandbox from here
- After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox
- Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry
ssh root@sandbox.hortonworks.com
- Install bits for Zeppelin and Nifi Ambari services
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/abajwa-hw/ambari-nifi-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/NIFI
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN
- Restart Ambari
#sandbox
service ambari restart
#non sandbox
sudo service ambari-server restart
- Click Add services wizard and follow wizard to install both services with default settings
- Change below in Hive config and restart Hive
- under General -> change hive.tez.java.opts to
-server -Xmx1000m -Djava.net.preferIPv4Stack=true
- under General -> ensure hive.exec.dynamic.partition.mode = nonstrict
- under General -> change hive.tez.java.opts to
- On sandbox, solr is installed as part of HDPsearch. Run below to fix a bug with solr setup on sandbox
chown -R solr:solr /opt/lucidworks-hdpsearch/solr #current sandbox version has files owned by root here which causes problems
- If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch first
yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr
- Setup Banana snd Solr configs
su solr
#setup banana dasboard
cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/
mv default.json default.json.orig
wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json
- Edit solrconfig.xml by adding
<str>EEE MMM d HH:mm:ss Z yyyy</str>
underParseDateFieldUpdateProcessorFactory
so it looks like below. This is done to allow Solr to recognize the timestamp format of tweets.
vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
<processor>
<arr name="format">
<str>EEE MMM d HH:mm:ss Z yyyy</str>
- Start Solr in cloud mode and create tweets collection
/opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181
/opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets \
-d data_driven_schema_configs \
-s 1 \
-rf 1
Solr setup is complete. Return to root user
exit
- Fix time on sandbox VM
yum install -y ntp
service ntpd stop
ntpdate pool.ntp.org
service ntpd start
- Download Nifi template called Twitter_Dashboard.xml onto your laptop's local filesystem
wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.xml
- Open Nifi at http://sandbox.hortonworks.com:9090 and import the template
- Open the 'EditTwitter' processor and enter your Twitter keys/secrets
- Start the Nifi template flow to push tweets into HDFS, Solr and local disk
- Verify data in Solr at: http://sandbox.hortonworks.com:8983/solr/tweets_shard1_replica1/select?q=*%3A*&wt=json&indent=true
- Verify data in HDFS under /tmp/tweets_staging at http://sandbox.hortonworks.com:8080/#/main/views/FILES/1.0.0/Files
- Verify data in Banana dashboard under http://sandbox.hortonworks.com:8983/solr/banana/index.html#/dashboard
- See here or here for more details and screenshots on collecting tweets using Nifi if needed
- Install the zeppelin notebook for the workshop by unzipping the below under /opt/incubator-zeppelin/notebook/ and restarting Zeppelin
su zeppelin
cd /opt/incubator-zeppelin/notebook/
wget https://www.dropbox.com/s/dxjc0ugj4lhurcf/2AY3B5WDV.zip
unzip 2AY3B5WDV.zip
- Follow the instructions in the notebook to use zeppelin to execute reports using Hive or Phoenix or SparkSQL depending on the scenario/workload.
- Make sure Hive and Hbase are started. If running on non-sandbox cluster, you will need to enable Phoenix first as well.