Skip to content

Instantly share code, notes, and snippets.

@tomthetrainer
Last active August 29, 2015 14:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomthetrainer/cdfc1ef4d7a890bdd20b to your computer and use it in GitHub Desktop.
Save tomthetrainer/cdfc1ef4d7a890bdd20b to your computer and use it in GitHub Desktop.
Fellow instructors,
I am teaching the DataAnalyst class so I thought I would write up a demo and share it. I tried when possible to make it fit into the labs and demo collections that the DA class has already. I hope you find this useful.
A student in class today asked me to provide a demo of pig over hbase.
It took me a few tries to get things to work, so my script here got a little muddy. I thought I would include it anyhow, I think I got the important steps recorded in this email.
I am teaching the DataAnalyst class so I thought I would write up a demo and share it. I tried when possible to make it fit into the labs and demo collections that the DA class has already. I hope you find this useful.
Here are the steps.
The VM does not start HBase by default, you will have to start it manually. I believe that once you start it, it will continue to start when VM is rebooted.
### Pig History
cd /apps/hive/warehouse/emp
ls
cat 'emp.txt'
cat emp.txt
emp = load 'emp.tx' as (name:chararray, state:chararray);
emp = load 'emp.txt' as (name:chararray, state:chararray);
ca_emp = filter emp by (state='ca');
ca_emp = filter emp by state='ca';
ca_emp = filter emp by (state=='ca');
store ca_emp into 'ca_emp'
;
quit
wd
pwd
cd /apps/hive/warehouse/emp
quit
quit;
ls
quit
cd ca_emp
ls
pwd
ls
cd /apps/hive/warehouse/emp
ls
cd ca_emp
ls
cat part-m-00000
quit
A = LOAD '/user/root/whitehouse/' USING
TextLoader();
describe A;
A_limit = limit A 10;
dump A_limit;
B = LOAD '/user/root/whitehouse/visits.txt' USING
PigStorage(',') AS (
lname:chararray,
fname:chararray,
mname:chararray,
id:chararray,
status:chararray,
state:chararray,
arrival:chararray
);
store B into 'whouse_tab' using PigStorage('\t');
ls whouse_tab
cat whouse_tab/part-m-00000
###
Create hive table
set hbase.zookeeper.quorum=localhost;
set hive.zookeeper.client.port=2181;
set zookeeper.znode.parent=/hbase-unsecure;
create external table pigdemo(id int, name string, state string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,a:name,a:state')
TBLPROPERTIES ('hbase.table.name' = 'pigdemo');
###
Pig script to read from hbase
REGISTER /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-server-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-protocol-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/htrace-core-2.01.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/guava-12.0.1.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/hbase-*.jar
REGISTER /usr/lib/hadoop/hadoop*.jar
REGISTER /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar
emp = load 'hbase://pigdemo' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:state,a:name','-loadKey true')
AS(id:bytearray, state:chararray,name:chararray);
emp_ca = filter emp by(id==1);
dump emp_ca;
###
Hive table..
create external table pigdemo(id int, name string, state string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,a:name,a:state')
show tables;
select * from pigdemo;
I dislike running things as root, but that is the only user available in the DA vm so I will run with it.
As root run
/root/start_hbase.sh
You will need to create a table in HBase.
Launch the hbase shell. The HBase shell is a Jruby shell so a single quit with no quotes or semicolon is how you leave it. A newline is the command delimiter.
# hbase shell
create ‘pigdemo’, ‘a’
List table to confirm.
hbase(main):001:0> list
TABLE
ambarismoketest
emp
pigdemo
3 row(s) in 1.4410 seconds
=> ["ambarismoketest", "emp", "pigdemo"]
That is the first HBase part done.
Summary: an HBase table needs nothing more than a table name and a single column family. In this case table name pig demo column family "a"
Now the pig part. Modify the pigdemo.txt so we have a good rowkey.
1 SD Rich
2 NV Barry
3 CO George
4 CA Ulf
5 IL Danielle
6 OH Tom
7 CA manish
8 CA Brian
8 CO Mark
Put the ID,state,name dataset into hdfs.
hadoop fs -put pigdemo.txt pigdemo.txt
*note I changed the original pigdemo.txt to have the id field
Write a pig script that looks like this.
REGISTER /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-server-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-protocol-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/htrace-core-2.01.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/guava-12.0.1.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/hbase-*.jar
REGISTER /usr/lib/hadoop/hadoop*.jar
REGISTER /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar
emp = load 'pigdemo.txt' as(id:chararray, state:chararray, name:chararray);
store emp into 'hbase://pigdemo' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:state,a:name');
Run the pig script
pig pigscript.txt
Verify the data in hbase.
hbase(main):002:0> scan 'pigdemo'
ROW COLUMN+CELL
1 column=a:name, timestamp=1403065549589, value=Rich
1 column=a:state, timestamp=1403065549589, value=SD
2 column=a:name, timestamp=1403065549590, value=Barry
2 column=a:state, timestamp=1403065549590, value=NV
3 column=a:name, timestamp=1403065549591, value=George
3 column=a:state, timestamp=1403065549591, value=CO
4 column=a:name, timestamp=1403065549591, value=Ulf
4 column=a:state, timestamp=1403065549591, value=CA
5 column=a:name, timestamp=1403065549591, value=Danielle
5 column=a:state, timestamp=1403065549591, value=IL
6 column=a:name, timestamp=1403065549591, value=Tom
6 column=a:state, timestamp=1403065549591, value=OH
7 column=a:name, timestamp=1403065549591, value=manish
7 column=a:state, timestamp=1403065549591, value=CA
8 column=a:name, timestamp=1403065549591, value=Mark
8 column=a:state, timestamp=1403065549591, value=CO
To read from hbase into pig.
REGISTER /usr/lib/hbase/lib/hbase-common-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-client-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-server-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/hbase-protocol-0.96.0.2.0.6.0-76-hadoop2.jar
REGISTER /usr/lib/hbase/lib/htrace-core-2.01.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/guava-12.0.1.jar
REGISTER /usr/lib/hbase/lib/zookeeper.jar
REGISTER /usr/lib/hbase/lib/hbase-*.jar
REGISTER /usr/lib/hadoop/hadoop*.jar
REGISTER /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar
emp = load 'hbase://pigdemo' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('a:state,a:name','-loadKey true')
AS(id:bytearray, state:chararray,name:chararray);
Some other settings that might be needed. Only make these changes if the above example fails..
/etc/hadoop/conf/hadoop_env.sh
## added these lines.
HBASE_JARS=
for f in $HBASE_HOME/lib/*.jar; do
HBASE_JARS=${HBASE_JARS}:$f;
done
export HADOOP_CLASSPATH=$HBASE_JARS:$HADOOP_CLASSPATH
/usr/lib/pig/conf/pig-env.sh
## added
export HBASE_HOME=/usr/lib/hbase
To get hive working I did the following..
mkdir /usr/lib/hive/auxlib
cp /usr/lib/hive/lib/hive-hbase-handler-0.12.0.2.0.6.0-76.jar /usr/lib/hive/auxlib
cp /usr/lib/zookeeper/zookeeper-3.4.5.2.0.6.0-76.jar /usr/lib/hive/auxlib
cp /usr/lib/hbase/lib/hbase*.jar /usr/lib/hive/auxlib
cp /usr/lib/hbase/lib/guava-12.0.1.jar /usr/lib/hive/auxlib
Create the hive table..
set hbase.zookeeper.quorum=localhost;
set hive.zookeeper.client.port=2181;
set zookeeper.znode.parent=/hbase-unsecure;
create external table pigdemo(id int, name string, state string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,a:name,a:state')
TBLPROPERTIES ('hbase.table.name' = 'pigdemo');
@tomthetrainer
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment