abajwa-hw/HDB-install.md

## HDB-install.md

      
    Raw
  

              HDB-install.md
            
          
    Add HDB (HAWQ) to HDP 2.4.2 with Zeppelin

Goals:

Install a 4 node cluster running HDP 2.4.2 using Ambari 2.2.2.0 (including Zeppelin and HDB) using Ambari bootstrap via blueprints or Ambari install wizard
Configure HAWQ for Zeppelin
Configure Zeppelin for HAWQ
Run HAWQ queries via Zeppelin

Notes:

HDB managed via Ambari is only supported from Ambri 2.2.2.0 onwards. Do not attempt using older versions of Ambari

Install Ambari 2.2.2.0 and HDB service definitions


Bring up 4 VMs imaged with RHEL/CentOS 6.x (e.g. node1-4 in this case)


On non-ambari nodes (nodes2-4 in this case), install ambari-agents and point them to ambari node (e.g. node1 in this case)


export ambari_server=node1
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh


On Ambari node (e.g. node1), install ambari-server

export install_ambari_server=true
curl -sSL https://raw.githubusercontent.com/seanorama/ambari-bootstrap/master/ambari-bootstrap.sh | sudo -E sh


Install Zeppelin service definition

yum install -y git
git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git /var/lib/ambari-server/resources/stacks/HDP/2.4/services/ZEPPELIN
sed -i.bak '/dependencies for all/a \    "ZEPPELIN_MASTER-START": ["NAMENODE-START", "DATANODE-START"],' /var/lib/ambari-server/resources/stacks/HDP/2.4/role_command_order.json


Install Pivotal service definition and repo per HDB doc

Create staging dir:

mkdir /staging
chmod a+rx /staging


Copy hdb-2.0.0.0-22126.tar.gz  and hdb-ambari-plugin-2.0.0-448.tar.gz to /staging


Setup HDB repo and Ambari service definition:


tar -xvzf /staging/hdb-2.0.0.0-*.tar.gz -C /staging/
tar -xvzf /staging/hdb-ambari-plugin-2.0.0-*.tar.gz -C /staging/  
yum install -y httpd
service httpd start
cd /staging/hdb*
./setup_repo.sh
cd /staging/hdb-ambari-plugin*
./setup_repo.sh  
 yum install -y hdb-ambari-plugin


At this point you should see a local repo up at http://node1/HDB/


Restart Ambari so it now recognizes Zeppelin, HAWQ, PXF services


service ambari-server restart
service ambari-agent restart


Confirm 4 agents were registered and agent is up


curl -u admin:admin -H  X-Requested-By:ambari http://localhost:8080/api/v1/hosts
service ambari-agent status

Deploy vanilla HDP + Zeppelin + HDB


Deploy cluster running latest HDP including Zeppelin, HAWQ, PXF. You can either:

Option 1: login to Ambari UI and use Install Wizard. In this case:

You will need to set the 'HAWQ System User Password' to any value you like
Make sure to manually adjust the HDFS settings mentioned in HDB doc
Make sure that the port specified in 'HAWQ master port' (by default, 5432) is not in use on the host where you will install HAWQ master

If installing on single node or any other scenario where HAWQ master need to be installed on node where a postgres setup already exists (e.g. if installing HAWQ master on the same host where Ambari is installed) you will need to change the master port from default value (5432)
On single node setup, 'HAWQ standby master' will not be installed


Refer to HDB doc for full details


OR
Option 2: generate/deploy a customized blueprint using ambari-bootstrap that takes care of the HDFS configurations as below:


yum install -y python-argparse
cd
git clone https://github.com/seanorama/ambari-bootstrap.git

#decide which services to deploy and set the number of nodes in the cluster
export ambari_services="HDFS MAPREDUCE2 YARN ZOOKEEPER HIVE ZEPPELIN SPARKHAWQ PXF"
export host_count=4
 
cd ./ambari-bootstrap/deploy/

#add HDFS config customizations for HAWQ and any others you may want
cat << EOF > configuration-custom.json
{
  "configurations" : {
    "hdfs-site": {
        "dfs.allow.truncate": "true",
        "dfs.block.access.token.enable": "false",
        "dfs.block.local-path-access.user": "gpadmin",
        "dfs.client.read.shortcircuit": "true",
        "dfs.client.socket-timeout": "300000000",
        "dfs.client.use.legacy.blockreader.local": "false",
        "dfs.datanode.handler.count": "60",
        "dfs.datanode.socket.write.timeout": "7200000",                                
        "dfs.namenode.handler.count": "600",
        "dfs.support.append": "true"               
    },
    "hawq-env":{
        "hawq_password":"gpadmin"
      },
    "core-site": {
        "ipc.client.connection.maxidletime": "3600000",
        "ipc.client.connect.timeout": "300000",
        "ipc.server.listen.queue.size": "3300"
    }
  }
}
EOF

#optional - if you want to review the BP before deploying it
#export deploy=false
#./deploy-recommended-cluster.bash
#more temp*/blueprint.json

#generate BP including customizations and start cluster deployment
export deploy=true
./deploy-recommended-cluster.bash


This will kick off HDP cluster install, including Zeppelin, HAWQ and PXF. You can monitor it via Ambari at http://node1:8080

Configure HAWQ for Zeppelin


On HAWQ master node:

SSH in
connect to HAWQ
create a new DB
add a user for zeppelin
give access to the DB to zeppelin user


su - gpadmin
source /usr/local/hawq/greenplum_path.sh
export PGPORT=5432
psql -d postgres

create database contoso;
CREATE USER zeppelin WITH PASSWORD 'zeppelin';
GRANT ALL PRIVILEGES ON DATABASE contoso to zeppelin;
\q


Note: you only need to set PGPORT if HAWQ master was not installed on default port (5432). If you specified a different port, you will need to set this accordingly.


On HAWQ master node, run below to add the IP of zeppelin node to HAWQ pg_hba.conf conf. This is done to allow Zeppelin to access HAWQ from a different node

Make sure to replace 172.17.0.2 below with IP of host running Zeppelin


echo "host all all 172.17.0.2/32 trust" >> /data/hawq/master/pg_hba.conf


Restart HAWQ via Ambari

Configure Zeppelin for HAWQ


Open Zeppelin interpreter and scroll down to section for psql and make below changes to use zeppelin user to connect to contoso DB:

postgresql.url = jdbc:postgresql://node3:5432/contoso
postgresql.user = zeppelin
postgresql.password = zeppelin


Run HAWQ queries via Zeppelin


Create a new note in Zeppelin with below cells to create/populate a test table and calculate avg of subset:

%psql.sql
create table tt (i int);
insert into tt select generate_series(1,1000000);

%psql.sql
select avg(i) from tt where i>5000;