Skip to content

Instantly share code, notes, and snippets.

@nsabharwal
nsabharwal / gist:600bef5a0454e0738a93
Created April 29, 2015 06:12
Indigesting to kafka by flume syslog
yum -y install sys-ng
# /etc/flume/conf/flume.conf
agent.sources=syslogsource-1
agent.channels=mem-channel-1
agent.sinks=kafka-sink-1
agent.sources.syslogsource-1.type=syslogtcp
agent.sources.syslogsource-1.port=13073
Please see the following details on Apache Phoenix “sql skin for HBase” .
Phoenix
The following details are based on a test done in one of my lab environments. You can see that we can run sql, secondary indexes, explain plan, data load and bulk load by using phoenix.
Table definition
drop table if exists crime;
1. Checkout source code from https://github.com/apache/incubator-zeppelin
2. Custom build the code with spark 1.3 and with the respective Hadoop version.
mvn clean package -Pspark-1.3 -Dhadoop.version=2.6.0 -Phadoop-2.6 -DskipTests
3. Have the following jars in the spark classpath by placing them in the location $ZEPPELIN_HOME/interpreter/spark
a. hbase-client.jar
b. hbase-protocol.jar
c. hbase-common.jar
d. phoenix-4.4.x-client-without-hbase.jar
4. Start Zeppelin
@nsabharwal
nsabharwal / Big Data product list and short description
Last active October 13, 2015 13:42
Big Data product list and short description
Sqoop : tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Spark : fast and general engine for large-scale data processing. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
CouchBase : open source, distributed NoSQL document-oriented database. It exposes a fast key-value store with managed cache for submillisecond data operations, purpose-built indexers for fast queries and a query engine for executing SQL queries
Jupyter: Web application that allows to create and share docs that contain live code, equations, visualizations and explanatory text.Use case: Data cleaning, transformation, numerical simulation, statistical modeling, ML and more
H20 : H2O is for data scientists and business analysts who need scalable and fast machine learning.It is an open source predictive analytics platform.use case: Ad, fraud detection, predictive modeling, customer intelligence
Tachyon : Tachyon is
drop table if exists crime;
create table crime (
caseid varchar,
Date varchar,
block varchar,
description varchar,
sdesc varchar,
ldesc varchar,
arrest char(2),
domestic char(2),
yum install expect*
#!/usr/bin/expect
spawn ambari-server sync-ldap --existing
expect "Enter Ambari Admin login:"
send "admin\r"
expect "Enter Ambari Admin password:"
send "admin\r"
expect eof
read -p "enter HS2 hostname: " HS2
read -p "enter username: " username
echo "enter password"
read -s passwd
read -p "enter filename: " filename
beeline -u jdbc:hive2://$HS2:10000/default -n $username -p $passwd -f $filename
mysql -u hive -p -e " select concat( 'show create table ' , TBL_NAME,';') from TBLS" hive > file.sql
hive -f /tmp/file.sql
HDFS test
Make sure that google connector is defined in Hadoop CLASSPATH as decribed in the blog
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/
15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'
@nsabharwal
nsabharwal / control.sh
Last active January 19, 2016 19:01 — forked from randerzander/control.sh
Ambari Service Start/Stop script
USER='admin'
PASS='admin'
CLUSTER='dev'
HOST=$(hostname -f):8080
function start(){
curl -u $USER:$PASS -i -H 'X-Requested-By: ambari' -X PUT -d \
'{"RequestInfo": {"context" :"Start '"$1"' via REST"}, "Body": {"ServiceInfo": {"state": "STARTED"}}}' \
http://$HOST/api/v1/clusters/$CLUSTER/services/$1
}