thanoojgithub/HBase basics

## HBase basics
ubuntu@ubuntu:~$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-secondarynamenode-ubuntu.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-nodemanager-ubuntu.out

ubuntu@ubuntu:~$ jps
3714 Jps
2531 NameNode
3091 ResourceManager
2692 DataNode
3256 NodeManager
2890 SecondaryNameNode

ubuntu@ubuntu:~$ start-hbase.sh
localhost: starting zookeeper, logging to /home/ubuntu/hbase-1.0.1.1/bin/../logs/hbase-ubuntu-zookeeper-ubuntu.out
starting master, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-master-ubuntu.out
starting regionserver, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-1-regionserver-ubuntu.out

ubuntu@ubuntu:~$ jps
8704 HRegionServer
7489 NodeManager
8514 HQuorumPeer
6757 NameNode
8581 HMaster
6919 DataNode
7145 SecondaryNameNode
8925 Jps
7325 ResourceManager


ubuntu@ubuntu:~$ hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hbase-1.0.1.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.0.1.1, re1dbf4df30d214fca14908df71d038081577ea46, Sun May 17 12:34:26 PDT 2015

hbase(main):001:0> create 'emp', 'personal data', 'professional data'
0 row(s) in 0.5120 seconds

=> Hbase::Table - emp
hbase(main):002:0> put 'emp','1','personal data:name','sriram'
0 row(s) in 0.0830 seconds

hbase(main):003:0> put 'emp','1','personal data:location','ayodhya'
0 row(s) in 0.0140 seconds

hbase(main):004:0> put 'emp','1','personal data:mobile','7204437072'
0 row(s) in 0.0080 seconds

hbase(main):005:0> put 'emp','1','professional data:job title','lead'
0 row(s) in 0.0210 seconds

hbase(main):006:0> put 'emp','1','professional data:location','sri lanka'
0 row(s) in 0.0080 seconds

hbase(main):007:0> put 'emp','1','professional data:mobile','7204437072'
0 row(s) in 0.0090 seconds

hbase(main):009:0> list
TABLE
emp
test
2 row(s) in 0.0250 seconds

=> ["emp", "test"]


hbase(main):010:0> scan 'emp'
ROW                                         COLUMN+CELL
 1                                          column=personal data:location, timestamp=1454498924898, value=ayodhya
 1                                          column=personal data:mobile, timestamp=1454498936233, value=7204437072
 1                                          column=personal data:name, timestamp=1454498917042, value=sriram
 1                                          column=professional data:job title, timestamp=1454498941627, value=lead
 1                                          column=professional data:location, timestamp=1454498947357, value=sri lanka
 1                                          column=professional data:mobile, timestamp=1454498957809, value=7204437072
1 row(s) in 0.0360 seconds

hbase(main):011:0> put 'emp','2','personal data:name','seeta'
0 row(s) in 0.0140 seconds

hbase(main):012:0> put 'emp','2','personal data:location','midhila'
0 row(s) in 0.0080 seconds

hbase(main):013:0> put 'emp','2','personal data:mobile','9742681255'
0 row(s) in 0.0050 seconds

hbase(main):014:0> put 'emp','2','professional data:job title','sse'
0 row(s) in 0.0060 seconds

hbase(main):015:0> put 'emp','2','professional data:location','sri lanka'
0 row(s) in 0.0100 seconds

hbase(main):016:0> put 'emp','2','professional data:mobile','9742681255'
0 row(s) in 0.0040 seconds

hbase(main):017:0> scan 'emp'
ROW                                         COLUMN+CELL
 1                                          column=personal data:location, timestamp=1454498924898, value=ayodhya
 1                                          column=personal data:mobile, timestamp=1454498936233, value=7204437072
 1                                          column=personal data:name, timestamp=1454498917042, value=sriram
 1                                          column=professional data:job title, timestamp=1454498941627, value=lead
 1                                          column=professional data:location, timestamp=1454498947357, value=sri lanka
 1                                          column=professional data:mobile, timestamp=1454498957809, value=7204437072
 2                                          column=personal data:location, timestamp=1454499143334, value=midhila
 2                                          column=personal data:mobile, timestamp=1454499152049, value=9742681255
 2                                          column=personal data:name, timestamp=1454499136556, value=seeta
 2                                          column=professional data:job title, timestamp=1454499165866, value=sse
 2                                          column=professional data:location, timestamp=1454499173236, value=sri lanka
 2                                          column=professional data:mobile, timestamp=1454499180039, value=9742681255
2 row(s) in 0.0350 seconds

hbase(main):019:0> get 'emp','2'
	COLUMN                                      CELL
 personal data:location                     timestamp=1454499143334,value=midhila
 personal data:mobile                       timestamp=1454499152049, value=9742681255
 personal data:name                         timestamp=1454499136556, value=seeta
 professional data:job title                timestamp=1454499165866, value=sse
 professional data:location                 timestamp=1454499173236, value=sri lanka
 professional data:mobile                   timestamp=1454499180039, value=9742681255

6 row(s) in 0.0190 seconds

hbase(main):006:0> get 'emp', '1', 'personal data:name'
COLUMN                                      CELL
 personal data:name                         timestamp=1454498917042, value=sriram
1 row(s) in 0.0070 seconds

hbase(main):007:0> get 'emp', '1', 'personal data'
COLUMN                                      CELL
 personal data:location                     timestamp=1454498924898, value=ayodhya
 personal data:mobile                       timestamp=1454498936233, value=7204437072
 personal data:name                         timestamp=1454498917042, value=sriram
3 row(s) in 0.0070 seconds


###updating
hbase(main):002:0> put 'emp','1','personal data:city','Delhi'

hbase(main):007:0> delete 'emp', '1', 'personal data:city',

hbase(main):012:0> delete 'emp','1','personal data:city'
0 row(s) in 0.0180 seconds

hbase(main):013:0> get 'emp', '1', 'personal data:city'
COLUMN                                      CELL
0 row(s) in 0.0050 seconds

hbase(main):014:0> get 'emp', '1', 'personal data'
COLUMN                                      CELL
 personal data:location                     timestamp=1454498924898, value=ayodhya
 personal data:mobile                       timestamp=1454498936233, value=7204437072
 personal data:name                         timestamp=1454498917042, value=sriram
3 row(s) in 0.0090 seconds


hbase(main):016:0> put 'emp','3','personal data:name','hanuma'
0 row(s) in 0.0070 seconds

hbase(main):017:0> scan 'emp'
ROW                                         COLUMN+CELL
 1                                          column=personal data:location, timestamp=1454498924898, value=ayodhya
 1                                          column=personal data:mobile, timestamp=1454498936233, value=7204437072
 1                                          column=personal data:name, timestamp=1454498917042, value=sriram
 1                                          column=professional data:job title, timestamp=1454498941627, value=lead
 1                                          column=professional data:location, timestamp=1454498947357, value=sri lanka
 1                                          column=professional data:mobile, timestamp=1454498957809, value=7204437072
 2                                          column=personal data:location, timestamp=1454499143334, value=midhila
 2                                          column=personal data:mobile, timestamp=1454499152049, value=9742681255
 2                                          column=personal data:name, timestamp=1454500493425, value=seeta
 2                                          column=professional data:job title, timestamp=1454499165866, value=sse
 2                                          column=professional data:location, timestamp=1454499173236, value=sri lanka
 2                                          column=professional data:mobile, timestamp=1454499180039, value=9742681255
 3                                          column=personal data:name, timestamp=1454500510145, value=hanuma
3 row(s) in 0.0240 seconds

hbase(main):018:0> deleteall 'emp','3'
0 row(s) in 0.0130 seconds

hbase(main):019:0> scan 'emp'
ROW                                         COLUMN+CELL
 1                                          column=personal data:location, timestamp=1454498924898, value=ayodhya
 1                                          column=personal data:mobile, timestamp=1454498936233, value=7204437072
 1                                          column=personal data:name, timestamp=1454498917042, value=sriram
 1                                          column=professional data:job title, timestamp=1454498941627, value=lead
 1                                          column=professional data:location, timestamp=1454498947357, value=sri lanka
 1                                          column=professional data:mobile, timestamp=1454498957809, value=7204437072
 2                                          column=personal data:location, timestamp=1454499143334, value=midhila
 2                                          column=personal data:mobile, timestamp=1454499152049, value=9742681255
 2                                          column=personal data:name, timestamp=1454500493425, value=seeta
 2                                          column=professional data:job title, timestamp=1454499165866, value=sse
 2                                          column=professional data:location, timestamp=1454499173236, value=sri lanka
 2                                          column=professional data:mobile, timestamp=1454499180039, value=9742681255
2 row(s) in 0.0250 seconds


hbase(main):020:0> count 'emp'
2 row(s) in 0.0350 seconds
=> 2

hbase(main):021:0> create 'emptemp', 'personal data', 'professional data'
0 row(s) in 0.2370 seconds
=> Hbase::Table - emptemp

hbase(main):022:0> truncate 'emptemp'
Truncating 'emptemp' table (it may take a while):
 - Disabling table...
 - Truncating table...
0 row(s) in 1.4410 seconds

hbase(main):023:0> describe 'emptemp'
Table emptemp is ENABLED
emptemp
COLUMN FAMILIES DESCRIPTION
{NAME => 'personal data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', C
OMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'professional data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER
', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0190 seconds

hbase(main):024:0> scan 'emptemp'
ROW                                         COLUMN+CELL
0 row(s) in 0.0060 seconds


hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds

hbase(main):019:0> drop 'emp'
0 row(s) in 0.3060 seconds


hbase> drop_all 't.*
Note: Before dropping a table, you must disable it.


ubuntu@ubuntu:~$ stop-hbase.sh
stopping hbase..........................
localhost: stopping zookeeper.


ubuntu@ubuntu:~$ hadoop fs -ls /hbase
Found 6 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/.tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/WALs
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data
-rw-r--r--   1 ubuntu supergroup         42 2016-02-04 16:20 /hbase/hbase.id
-rw-r--r--   1 ubuntu supergroup          7 2016-02-04 16:20 /hbase/hbase.version
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:30 /hbase/oldWALs
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/WALs
Found 2 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/WALs/hregion-04717635
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/WALs/ubuntu,16201,1454583002554

ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data
Found 2 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:33 /hbase/data/default
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase
Found 2 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/meta
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/namespace
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/meta
Found 3 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/meta/.tabledesc
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/meta/.tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/meta/1588230740
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/namespace
Found 3 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tabledesc
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tmp
drwxr-xr-x   - ubuntu supergroup          0 2016-02-04 16:20 /hbase/data/hbase/namespace/f5b52c99ade0ca0d46213dfe3f1da63a
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/hbase.id
-rw-r--r--   1 ubuntu supergroup         42 2016-02-04 16:20 /hbase/hbase.id
ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.id
PBUF
ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.version
PBUF


## HBase basics - version
A {row, column, version} tuple exactly specifies a cell in HBase. It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:

If multiple writes to a cell have the same version, are all versions maintained or just the last?
 - Currently, only the last written is fetchable.
Is it OK to write cells in a non-increasing version order?
 - Yes

Below we describe how the version dimension in HBase currently works.

##Gets are implemented on top of Scans.
By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:
  to return more than one version, see Get.setMaxVersions()
  to return versions other than the latest, see Get.setTimeRange()

Get get = new Get(Bytes.toBytes("row1"));
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value

Get get = new Get(Bytes.toBytes("row1"));
get.setMaxVersions(3);  // will return last 3 versions of row
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns all versions of this column

##Put
Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis.

  Put put = new Put(Bytes.toBytes(row));
  put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
  htable.put(put);

  Put put = new Put( Bytes.toBytes(row));
  long explicitTimeInMs = 555;  // just an example
  put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
  htable.put(put);

##Delete
There are three different types of internal delete markers :
  Delete: for a specific version of a column.
  Delete column: for all versions of a column.
  Delete family: for all columns of a particular ColumnFamily

When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).
Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is “delete all cells where the version is less than or equal to this version”. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 3. This is an important parameter because as described in Chapter 5, Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.

It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.


hbase(main):026:0> alter 'rawdocs', NAME=>'personal details', VERSIONS =>3
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.3390 seconds

hbase(main):027:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN                                               CELL
 personal details:dt                                 timestamp=1454939129933, value=2015-10-12
 personal details:name                               timestamp=1455027779087, value=raghuram
 personal details:name                               timestamp=1454939129933, value=sriram
3 row(s) in 0.0480 seconds

hbase(main):028:0> put 'rawdocs', '1', 'personal details:name', 'seetaram'
0 row(s) in 0.0320 seconds

hbase(main):029:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN                                               CELL
 personal details:dt                                 timestamp=1454939129933, value=2015-10-12
 personal details:name                               timestamp=1455027956861, value=seetaram
 personal details:name                               timestamp=1455027779087, value=raghuram
 personal details:name                               timestamp=1454939129933, value=sriram
4 row(s) in 0.0470 seconds

hbase(main):030:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN                                               CELL
 personal details:dt                                 timestamp=1454939129933, value=2015-10-12
 personal details:name                               timestamp=1455027956861, value=seetaram
 personal details:name                               timestamp=1455027779087, value=raghuram
 personal details:name                               timestamp=1454939129933, value=sriram
4 row(s) in 0.0770 seconds

hbase(main):031:0> put 'rawdocs', '1', 'personal details:name', 'ram'
0 row(s) in 0.0260 seconds

hbase(main):032:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN                                               CELL
 personal details:dt                                 timestamp=1454939129933, value=2015-10-12
 personal details:name                               timestamp=1455027989430, value=ram
 personal details:name                               timestamp=1455027956861, value=seetaram
 personal details:name                               timestamp=1455027779087, value=raghuram
4 row(s) in 0.0630 seconds

hbase(main):033:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN                                               CELL
 personal details:dt                                 timestamp=1454939129933, value=2015-10-12
 personal details:name                               timestamp=1455027989430, value=ram
 personal details:name                               timestamp=1455027956861, value=seetaram
 personal details:name                               timestamp=1455027779087, value=raghuram
4 row(s) in 0.0800 seconds

hbase(main):059:0> create 'emp', {NAME=>'personal details', VERSIONS=>3, KEEP_DELETED_CELLS => true}, {NAME=>'professional details', VERSIONS=>5}
0 row(s) in 1.2830 seconds

=> Hbase::Table - emp

hbase(main):062:0> put 'emp','1','personal details:name','sriram'
0 row(s) in 0.0170 seconds

hbase(main):063:0> put 'emp','1','professional details:job title','lead'
0 row(s) in 0.0080 seconds


hbase(main):086:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>5}
COLUMN                                               CELL
 professional details:job title                      timestamp=1455029722801, value=ceo
 professional details:job title                      timestamp=1455029709250, value=sr.mgr
 professional details:job title                      timestamp=1455029706301, value=mgr
 professional details:job title                      timestamp=1455029696376, value=sse
 professional details:job title                      timestamp=1455029291946, value=lead
5 row(s) in 0.0300 seconds

hbase(main):087:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>3}
COLUMN                                               CELL
 professional details:job title                      timestamp=1455029722801, value=ceo
 professional details:job title                      timestamp=1455029709250, value=sr.mgr
 professional details:job title                      timestamp=1455029706301, value=mgr
3 row(s) in 0.0080 seconds

hbase(main):088:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN                                               CELL
 personal details:name                               timestamp=1455029437968, value=ram
 personal details:name                               timestamp=1455029377211, value=seetaram
 personal details:name                               timestamp=1455029368125, value=raghuram
3 row(s) in 0.0220 seconds

hbase(main):089:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>2}
COLUMN                                               CELL
 personal details:name                               timestamp=1455029437968, value=ram
 personal details:name                               timestamp=1455029377211, value=seetaram
2 row(s) in 0.0070 seconds


hbase(main):103:0> put 'emp','1','personal details:name','ram'
0 row(s) in 0.0300 seconds

hbase(main):104:0> put 'emp','1','personal details:location','ayodhya'
0 row(s) in 0.0060 seconds

hbase(main):105:0> scan 'emp'
ROW                                                  COLUMN+CELL
 1                                                   column=personal details:location, timestamp=1455030553278, value=ayodhya
 1                                                   column=personal details:name, timestamp=1455030537916, value=ram
 1                                                   column=professional details:job title, timestamp=1455029722801, value=ceo
1 row(s) in 0.0160 seconds

hbase(main):106:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN                                               CELL
 personal details:location                           timestamp=1455030553278, value=ayodhya
 personal details:name                               timestamp=1455030537916, value=ram
2 row(s) in 0.0100 seconds

hbase(main):107:0> delete 'emp', 1, 'personal details:name'
0 row(s) in 0.0070 seconds

hbase(main):108:0> scan 'emp'
ROW                                                  COLUMN+CELL
 1                                                   column=personal details:location, timestamp=1455030553278, value=ayodhya
 1                                                   column=professional details:job title, timestamp=1455029722801, value=ceo
1 row(s) in 0.0110 seconds

## Hive Vs HBase
Hive Vs HBase
--------------

# Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language that gets translated to MapReduce jobs. Despite providing SQL functionality, Hive does not provide interactive querying yet - it only runs batch processes on Hadoop.
# Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don’t require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop’s infrastructure and scales horizontally using off the shelf servers.
# Features
Hive can help the SQL savvy to run MapReduce jobs. Since it’s JDBC compliant, it also integrates with existing SQL based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
# Limitations
Hive does not currently support update statements. Additionally, since it runs batch processing on Hadoop, it can take minutes or even hours to get back results for queries. Hive must also be provided with a predefined schema to map files and directories into columns and it is not ACID compliant.
HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.
# Use Cases
Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.
HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.
# Summary
Hive and HBase are two different Hadoop based technologies - Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.


MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
Both Hive and Pig queries get converted into MapReduce jobs under the hood.


# Objective
To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
To replace daily aggregation data generated thru MySQL with Hive
Build Custom reports thru queries in Hive
Architecture Options
I benchmarked the following options 1. Hive+HDFS 2. Hive+HBase - queries were too slow so dumped this option
# Design
Daily log Files were transported to HDFS
MR jobs parsed these log files and output files in HDFS
Create Hive tables with partitions and locations pointing to HDFS locations
Create Hive query scripts (call it HQL if u like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
# Summary
HBase is like a Map. If u know the key, u can instantly get the value. But if u want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone. If you have data to be aggregated, rolled up, analyzed across rows then consider Hive


#Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
#Batch Processing
Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
#Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.


# Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

# There are four main modules in Hadoop:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

# Before going further, Let's note that we have three different types of data.
Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.

HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
Pig: A high-level data-flow language and execution framework for parallel computation.

# Hive Vs PIG comparison can be found at this article and my other post at this SE question.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce

You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce

Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.

PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.


## Integrating Hive with HBase
hive>CREATE TABLE thanooj.hbase_docs(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");

hive>CREATE TABLE thanooj.hbase_docs_raw (ID INT, name STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
hive>LOAD DATA LOCAL INPATH '/home/ubuntu/input/raw.txt' OVERWRITE INTO TABLE thanooj.hbase_docs_raw;

hive>INSERT OVERWRITE TABLE thanooj.hbase_docs SELECT * FROM thanooj.hbase_docs_raw;


hive> select * from hbase_docs;
OK
1	sriram	2015-10-12
2	seeta	2015-09-12
3	lakshman	2015-11-12
Time taken: 0.377 seconds, Fetched: 3 row(s)


hbase(main):003:0> scan 'rawdocs'
ROW                   COLUMN+CELL
 1                    column=personal details:dt, timestamp=1454939129933, value
                      =2015-10-12
 1                    column=personal details:name, timestamp=1454939129933, val
                      ue=sriram
 2                    column=personal details:dt, timestamp=1454939129933, value
                      =2015-09-12
 2                    column=personal details:name, timestamp=1454939129933, val
                      ue=seeta
 3                    column=personal details:dt, timestamp=1454939129933, value
                      =2015-11-12
 3                    column=personal details:name, timestamp=1454939129933, val
                      ue=lakshman
3 row(s) in 0.3250 seconds


ubuntu@ubuntu:~$ hadoop fs -ls /home/
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:15 /home/ubuntu
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:15 /home/ubuntu/softwares
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
Found 3 items
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/_tmp.hbase_docs
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:45 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
drwxr-xr-x   - ubuntu supergroup          0 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw;
Found 1 items
-rwxr-xr-x   1 ubuntu supergroup         61 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw/raw.txt
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs

hive> CREATE EXTERNAL TABLE thanooj.hbase_docs_09(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");
OK
Time taken: 1.669 seconds
hive> select * from hbase_docs_09;
OK
1	sriram	2015-10-12
2	seeta	2015-09-12
3	lakshman	2015-11-12
Time taken: 2.514 seconds, Fetched: 3 row(s)
hive>


Reference: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
set hive.hbase.wal.enabled=false;
Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.

If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:

CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table");

Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.


hbase-site.xml
--------------
<configuration>
  <property>
  <name>hbase.rootdir</name>
  <value>hdfs://localhost:54310/hbase</value>
  </property>

  <!--<property>
  <name>hbase.zookeeper.property.dataDir</name>
  <value>/home/ubuntu</value>
  </property> -->

  <property>
  <name>hbase.zookeeper.quorum</name>
  <value>localhost</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
</configuration>
	ubuntu@ubuntu:~$ start-all.sh
	This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
	Starting namenodes on [localhost]
	localhost: starting namenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-namenode-ubuntu.out
	localhost: starting datanode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-datanode-ubuntu.out
	Starting secondary namenodes [0.0.0.0]
	0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-secondarynamenode-ubuntu.out
	starting yarn daemons
	starting resourcemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-resourcemanager-ubuntu.out
	localhost: starting nodemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-nodemanager-ubuntu.out

	ubuntu@ubuntu:~$ jps
	3714 Jps
	2531 NameNode
	3091 ResourceManager
	2692 DataNode
	3256 NodeManager
	2890 SecondaryNameNode

	ubuntu@ubuntu:~$ start-hbase.sh
	localhost: starting zookeeper, logging to /home/ubuntu/hbase-1.0.1.1/bin/../logs/hbase-ubuntu-zookeeper-ubuntu.out
	starting master, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-master-ubuntu.out
	starting regionserver, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-1-regionserver-ubuntu.out

	ubuntu@ubuntu:~$ jps
	8704 HRegionServer
	7489 NodeManager
	8514 HQuorumPeer
	6757 NameNode
	8581 HMaster
	6919 DataNode
	7145 SecondaryNameNode
	8925 Jps
	7325 ResourceManager


	ubuntu@ubuntu:~$ hbase shell
	SLF4J: Class path contains multiple SLF4J bindings.
	SLF4J: Found binding in [jar:file:/home/ubuntu/hbase-1.0.1.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
	SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
	SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
	SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
	HBase Shell; enter 'help<RETURN>' for list of supported commands.
	Type "exit<RETURN>" to leave the HBase Shell
	Version 1.0.1.1, re1dbf4df30d214fca14908df71d038081577ea46, Sun May 17 12:34:26 PDT 2015

	hbase(main):001:0> create 'emp', 'personal data', 'professional data'
	0 row(s) in 0.5120 seconds

	=> Hbase::Table - emp
	hbase(main):002:0> put 'emp','1','personal data:name','sriram'
	0 row(s) in 0.0830 seconds

	hbase(main):003:0> put 'emp','1','personal data:location','ayodhya'
	0 row(s) in 0.0140 seconds

	hbase(main):004:0> put 'emp','1','personal data:mobile','7204437072'
	0 row(s) in 0.0080 seconds

	hbase(main):005:0> put 'emp','1','professional data:job title','lead'
	0 row(s) in 0.0210 seconds

	hbase(main):006:0> put 'emp','1','professional data:location','sri lanka'
	0 row(s) in 0.0080 seconds

	hbase(main):007:0> put 'emp','1','professional data:mobile','7204437072'
	0 row(s) in 0.0090 seconds

	hbase(main):009:0> list
	TABLE
	emp
	test
	2 row(s) in 0.0250 seconds

	=> ["emp", "test"]


	hbase(main):010:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal data:location, timestamp=1454498924898, value=ayodhya
	1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
	1 column=personal data:name, timestamp=1454498917042, value=sriram
	1 column=professional data:job title, timestamp=1454498941627, value=lead
	1 column=professional data:location, timestamp=1454498947357, value=sri lanka
	1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
	1 row(s) in 0.0360 seconds

	hbase(main):011:0> put 'emp','2','personal data:name','seeta'
	0 row(s) in 0.0140 seconds

	hbase(main):012:0> put 'emp','2','personal data:location','midhila'
	0 row(s) in 0.0080 seconds

	hbase(main):013:0> put 'emp','2','personal data:mobile','9742681255'
	0 row(s) in 0.0050 seconds

	hbase(main):014:0> put 'emp','2','professional data:job title','sse'
	0 row(s) in 0.0060 seconds

	hbase(main):015:0> put 'emp','2','professional data:location','sri lanka'
	0 row(s) in 0.0100 seconds

	hbase(main):016:0> put 'emp','2','professional data:mobile','9742681255'
	0 row(s) in 0.0040 seconds

	hbase(main):017:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal data:location, timestamp=1454498924898, value=ayodhya
	1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
	1 column=personal data:name, timestamp=1454498917042, value=sriram
	1 column=professional data:job title, timestamp=1454498941627, value=lead
	1 column=professional data:location, timestamp=1454498947357, value=sri lanka
	1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
	2 column=personal data:location, timestamp=1454499143334, value=midhila
	2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
	2 column=personal data:name, timestamp=1454499136556, value=seeta
	2 column=professional data:job title, timestamp=1454499165866, value=sse
	2 column=professional data:location, timestamp=1454499173236, value=sri lanka
	2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
	2 row(s) in 0.0350 seconds

	hbase(main):019:0> get 'emp','2'
	COLUMN CELL
	personal data:location timestamp=1454499143334,value=midhila
	personal data:mobile timestamp=1454499152049, value=9742681255
	personal data:name timestamp=1454499136556, value=seeta
	professional data:job title timestamp=1454499165866, value=sse
	professional data:location timestamp=1454499173236, value=sri lanka
	professional data:mobile timestamp=1454499180039, value=9742681255

	6 row(s) in 0.0190 seconds

	hbase(main):006:0> get 'emp', '1', 'personal data:name'
	COLUMN CELL
	personal data:name timestamp=1454498917042, value=sriram
	1 row(s) in 0.0070 seconds

	hbase(main):007:0> get 'emp', '1', 'personal data'
	COLUMN CELL
	personal data:location timestamp=1454498924898, value=ayodhya
	personal data:mobile timestamp=1454498936233, value=7204437072
	personal data:name timestamp=1454498917042, value=sriram
	3 row(s) in 0.0070 seconds



	###updating
	hbase(main):002:0> put 'emp','1','personal data:city','Delhi'

	hbase(main):007:0> delete 'emp', '1', 'personal data:city',

	hbase(main):012:0> delete 'emp','1','personal data:city'
	0 row(s) in 0.0180 seconds

	hbase(main):013:0> get 'emp', '1', 'personal data:city'
	COLUMN CELL
	0 row(s) in 0.0050 seconds

	hbase(main):014:0> get 'emp', '1', 'personal data'
	COLUMN CELL
	personal data:location timestamp=1454498924898, value=ayodhya
	personal data:mobile timestamp=1454498936233, value=7204437072
	personal data:name timestamp=1454498917042, value=sriram
	3 row(s) in 0.0090 seconds


	hbase(main):016:0> put 'emp','3','personal data:name','hanuma'
	0 row(s) in 0.0070 seconds

	hbase(main):017:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal data:location, timestamp=1454498924898, value=ayodhya
	1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
	1 column=personal data:name, timestamp=1454498917042, value=sriram
	1 column=professional data:job title, timestamp=1454498941627, value=lead
	1 column=professional data:location, timestamp=1454498947357, value=sri lanka
	1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
	2 column=personal data:location, timestamp=1454499143334, value=midhila
	2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
	2 column=personal data:name, timestamp=1454500493425, value=seeta
	2 column=professional data:job title, timestamp=1454499165866, value=sse
	2 column=professional data:location, timestamp=1454499173236, value=sri lanka
	2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
	3 column=personal data:name, timestamp=1454500510145, value=hanuma
	3 row(s) in 0.0240 seconds

	hbase(main):018:0> deleteall 'emp','3'
	0 row(s) in 0.0130 seconds

	hbase(main):019:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal data:location, timestamp=1454498924898, value=ayodhya
	1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
	1 column=personal data:name, timestamp=1454498917042, value=sriram
	1 column=professional data:job title, timestamp=1454498941627, value=lead
	1 column=professional data:location, timestamp=1454498947357, value=sri lanka
	1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
	2 column=personal data:location, timestamp=1454499143334, value=midhila
	2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
	2 column=personal data:name, timestamp=1454500493425, value=seeta
	2 column=professional data:job title, timestamp=1454499165866, value=sse
	2 column=professional data:location, timestamp=1454499173236, value=sri lanka
	2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
	2 row(s) in 0.0250 seconds


	hbase(main):020:0> count 'emp'
	2 row(s) in 0.0350 seconds
	=> 2

	hbase(main):021:0> create 'emptemp', 'personal data', 'professional data'
	0 row(s) in 0.2370 seconds
	=> Hbase::Table - emptemp

	hbase(main):022:0> truncate 'emptemp'
	Truncating 'emptemp' table (it may take a while):
	- Disabling table...
	- Truncating table...
	0 row(s) in 1.4410 seconds

	hbase(main):023:0> describe 'emptemp'
	Table emptemp is ENABLED
	emptemp
	COLUMN FAMILIES DESCRIPTION
	{NAME => 'personal data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', C
	OMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
	{NAME => 'professional data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER
	', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
	2 row(s) in 0.0190 seconds

	hbase(main):024:0> scan 'emptemp'
	ROW COLUMN+CELL
	0 row(s) in 0.0060 seconds



	hbase(main):018:0> disable 'emp'
	0 row(s) in 1.4580 seconds

	hbase(main):019:0> drop 'emp'
	0 row(s) in 0.3060 seconds


	hbase> drop_all 't.*
	Note: Before dropping a table, you must disable it.




	ubuntu@ubuntu:~$ stop-hbase.sh
	stopping hbase..........................
	localhost: stopping zookeeper.



	ubuntu@ubuntu:~$ hadoop fs -ls /hbase
	Found 6 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/.tmp
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data
	-rw-r--r-- 1 ubuntu supergroup 42 2016-02-04 16:20 /hbase/hbase.id
	-rw-r--r-- 1 ubuntu supergroup 7 2016-02-04 16:20 /hbase/hbase.version
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:30 /hbase/oldWALs
	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/WALs
	Found 2 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs/hregion-04717635
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs/ubuntu,16201,1454583002554

	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data
	Found 2 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:33 /hbase/data/default
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase
	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase
	Found 2 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace
	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/meta
	Found 3 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/.tabledesc
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/.tmp
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/1588230740
	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/namespace
	Found 3 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tabledesc
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tmp
	drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/f5b52c99ade0ca0d46213dfe3f1da63a
	ubuntu@ubuntu:~$ hadoop fs -ls /hbase/hbase.id
	-rw-r--r-- 1 ubuntu supergroup 42 2016-02-04 16:20 /hbase/hbase.id
	ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.id
	PBUF
	ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.version
	PBUF
	A {row, column, version} tuple exactly specifies a cell in HBase. It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

	While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.

	The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

	There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:

	If multiple writes to a cell have the same version, are all versions maintained or just the last?
	- Currently, only the last written is fetchable.
	Is it OK to write cells in a non-increasing version order?
	- Yes

	Below we describe how the version dimension in HBase currently works.

	##Gets are implemented on top of Scans.
	By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:
	to return more than one version, see Get.setMaxVersions()
	to return versions other than the latest, see Get.setTimeRange()

	Get get = new Get(Bytes.toBytes("row1"));
	Result r = htable.get(get);
	byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value

	Get get = new Get(Bytes.toBytes("row1"));
	get.setMaxVersions(3); // will return last 3 versions of row
	Result r = htable.get(get);
	byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
	List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns all versions of this column

	##Put
	Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis.

	Put put = new Put(Bytes.toBytes(row));
	put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
	htable.put(put);

	Put put = new Put( Bytes.toBytes(row));
	long explicitTimeInMs = 555; // just an example
	put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
	htable.put(put);

	##Delete
	There are three different types of internal delete markers :
	Delete: for a specific version of a column.
	Delete column: for all versions of a column.
	Delete family: for all columns of a particular ColumnFamily

	When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).
	Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is “delete all cells where the version is less than or equal to this version”. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

	The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 3. This is an important parameter because as described in Chapter 5, Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.

	It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.




	hbase(main):026:0> alter 'rawdocs', NAME=>'personal details', VERSIONS =>3
	Updating all regions with the new schema...
	0/1 regions updated.
	1/1 regions updated.
	Done.
	0 row(s) in 3.3390 seconds

	hbase(main):027:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
	COLUMN CELL
	personal details:dt timestamp=1454939129933, value=2015-10-12
	personal details:name timestamp=1455027779087, value=raghuram
	personal details:name timestamp=1454939129933, value=sriram
	3 row(s) in 0.0480 seconds

	hbase(main):028:0> put 'rawdocs', '1', 'personal details:name', 'seetaram'
	0 row(s) in 0.0320 seconds

	hbase(main):029:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
	COLUMN CELL
	personal details:dt timestamp=1454939129933, value=2015-10-12
	personal details:name timestamp=1455027956861, value=seetaram
	personal details:name timestamp=1455027779087, value=raghuram
	personal details:name timestamp=1454939129933, value=sriram
	4 row(s) in 0.0470 seconds

	hbase(main):030:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
	COLUMN CELL
	personal details:dt timestamp=1454939129933, value=2015-10-12
	personal details:name timestamp=1455027956861, value=seetaram
	personal details:name timestamp=1455027779087, value=raghuram
	personal details:name timestamp=1454939129933, value=sriram
	4 row(s) in 0.0770 seconds

	hbase(main):031:0> put 'rawdocs', '1', 'personal details:name', 'ram'
	0 row(s) in 0.0260 seconds

	hbase(main):032:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
	COLUMN CELL
	personal details:dt timestamp=1454939129933, value=2015-10-12
	personal details:name timestamp=1455027989430, value=ram
	personal details:name timestamp=1455027956861, value=seetaram
	personal details:name timestamp=1455027779087, value=raghuram
	4 row(s) in 0.0630 seconds

	hbase(main):033:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
	COLUMN CELL
	personal details:dt timestamp=1454939129933, value=2015-10-12
	personal details:name timestamp=1455027989430, value=ram
	personal details:name timestamp=1455027956861, value=seetaram
	personal details:name timestamp=1455027779087, value=raghuram
	4 row(s) in 0.0800 seconds

	hbase(main):059:0> create 'emp', {NAME=>'personal details', VERSIONS=>3, KEEP_DELETED_CELLS => true}, {NAME=>'professional details', VERSIONS=>5}
	0 row(s) in 1.2830 seconds

	=> Hbase::Table - emp

	hbase(main):062:0> put 'emp','1','personal details:name','sriram'
	0 row(s) in 0.0170 seconds

	hbase(main):063:0> put 'emp','1','professional details:job title','lead'
	0 row(s) in 0.0080 seconds



	hbase(main):086:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>5}
	COLUMN CELL
	professional details:job title timestamp=1455029722801, value=ceo
	professional details:job title timestamp=1455029709250, value=sr.mgr
	professional details:job title timestamp=1455029706301, value=mgr
	professional details:job title timestamp=1455029696376, value=sse
	professional details:job title timestamp=1455029291946, value=lead
	5 row(s) in 0.0300 seconds

	hbase(main):087:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>3}
	COLUMN CELL
	professional details:job title timestamp=1455029722801, value=ceo
	professional details:job title timestamp=1455029709250, value=sr.mgr
	professional details:job title timestamp=1455029706301, value=mgr
	3 row(s) in 0.0080 seconds

	hbase(main):088:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
	COLUMN CELL
	personal details:name timestamp=1455029437968, value=ram
	personal details:name timestamp=1455029377211, value=seetaram
	personal details:name timestamp=1455029368125, value=raghuram
	3 row(s) in 0.0220 seconds

	hbase(main):089:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>2}
	COLUMN CELL
	personal details:name timestamp=1455029437968, value=ram
	personal details:name timestamp=1455029377211, value=seetaram
	2 row(s) in 0.0070 seconds


	hbase(main):103:0> put 'emp','1','personal details:name','ram'
	0 row(s) in 0.0300 seconds

	hbase(main):104:0> put 'emp','1','personal details:location','ayodhya'
	0 row(s) in 0.0060 seconds

	hbase(main):105:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal details:location, timestamp=1455030553278, value=ayodhya
	1 column=personal details:name, timestamp=1455030537916, value=ram
	1 column=professional details:job title, timestamp=1455029722801, value=ceo
	1 row(s) in 0.0160 seconds

	hbase(main):106:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
	COLUMN CELL
	personal details:location timestamp=1455030553278, value=ayodhya
	personal details:name timestamp=1455030537916, value=ram
	2 row(s) in 0.0100 seconds

	hbase(main):107:0> delete 'emp', 1, 'personal details:name'
	0 row(s) in 0.0070 seconds

	hbase(main):108:0> scan 'emp'
	ROW COLUMN+CELL
	1 column=personal details:location, timestamp=1455030553278, value=ayodhya
	1 column=professional details:job title, timestamp=1455029722801, value=ceo
	1 row(s) in 0.0110 seconds
	Hive Vs HBase
	--------------

	# Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language that gets translated to MapReduce jobs. Despite providing SQL functionality, Hive does not provide interactive querying yet - it only runs batch processes on Hadoop.
	# Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don’t require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop’s infrastructure and scales horizontally using off the shelf servers.
	# Features
	Hive can help the SQL savvy to run MapReduce jobs. Since it’s JDBC compliant, it also integrates with existing SQL based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
	HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
	# Limitations
	Hive does not currently support update statements. Additionally, since it runs batch processing on Hadoop, it can take minutes or even hours to get back results for queries. Hive must also be provided with a predefined schema to map files and directories into columns and it is not ACID compliant.
	HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.
	# Use Cases
	Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.
	HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.
	# Summary
	Hive and HBase are two different Hadoop based technologies - Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.



	MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
	Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
	Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
	While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
	Both Hive and Pig queries get converted into MapReduce jobs under the hood.




	# Objective
	To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
	To replace daily aggregation data generated thru MySQL with Hive
	Build Custom reports thru queries in Hive
	Architecture Options
	I benchmarked the following options 1. Hive+HDFS 2. Hive+HBase - queries were too slow so dumped this option
	# Design
	Daily log Files were transported to HDFS
	MR jobs parsed these log files and output files in HDFS
	Create Hive tables with partitions and locations pointing to HDFS locations
	Create Hive query scripts (call it HQL if u like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
	Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
	# Summary
	HBase is like a Map. If u know the key, u can instantly get the value. But if u want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone. If you have data to be aggregated, rolled up, analyzed across rows then consider Hive


	#Real Time Processing
	Apache Storm: Stream Data processing, Rule application
	HBase: Datastore for serving Realtime dashboard
	#Batch Processing
	Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
	#Event Handling layer
	Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.






	# Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

	# There are four main modules in Hadoop:
	Hadoop Common: The common utilities that support the other Hadoop modules.
	Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
	Hadoop YARN: A framework for job scheduling and cluster resource management.
	Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

	# Before going further, Let's note that we have three different types of data.
	Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
	Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
	Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.

	HBase: A scalable, distributed database that supports structured data storage for large tables.
	Hive: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
	Pig: A high-level data-flow language and execution framework for parallel computation.

	# Hive Vs PIG comparison can be found at this article and my other post at this SE question.
	HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
	You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce

	You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
	You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce

	Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
	HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.

	PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
	Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.
	hive>CREATE TABLE thanooj.hbase_docs(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");

	hive>CREATE TABLE thanooj.hbase_docs_raw (ID INT, name STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
	hive>LOAD DATA LOCAL INPATH '/home/ubuntu/input/raw.txt' OVERWRITE INTO TABLE thanooj.hbase_docs_raw;

	hive>INSERT OVERWRITE TABLE thanooj.hbase_docs SELECT * FROM thanooj.hbase_docs_raw;


	hive> select * from hbase_docs;
	OK
	1 sriram 2015-10-12
	2 seeta 2015-09-12
	3 lakshman 2015-11-12
	Time taken: 0.377 seconds, Fetched: 3 row(s)


	hbase(main):003:0> scan 'rawdocs'
	ROW COLUMN+CELL
	1 column=personal details:dt, timestamp=1454939129933, value
	=2015-10-12
	1 column=personal details:name, timestamp=1454939129933, val
	ue=sriram
	2 column=personal details:dt, timestamp=1454939129933, value
	=2015-09-12
	2 column=personal details:name, timestamp=1454939129933, val
	ue=seeta
	3 column=personal details:dt, timestamp=1454939129933, value
	=2015-11-12
	3 column=personal details:name, timestamp=1454939129933, val
	ue=lakshman
	3 row(s) in 0.3250 seconds




	ubuntu@ubuntu:~$ hadoop fs -ls /home/
	Found 1 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu
	Found 1 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares
	Found 1 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
	Found 1 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
	Found 1 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
	Found 3 items
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/_tmp.hbase_docs
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:45 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
	drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw;
	Found 1 items
	-rwxr-xr-x 1 ubuntu supergroup 61 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw/raw.txt
	ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs

	hive> CREATE EXTERNAL TABLE thanooj.hbase_docs_09(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");
	OK
	Time taken: 1.669 seconds
	hive> select * from hbase_docs_09;
	OK
	1 sriram 2015-10-12
	2 seeta 2015-09-12
	3 lakshman 2015-11-12
	Time taken: 2.514 seconds, Fetched: 3 row(s)
	hive>





	Reference: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

	Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
	set hive.hbase.wal.enabled=false;
	Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.

	If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:

	CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
	STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
	WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
	TBLPROPERTIES("hbase.table.name" = "some_existing_table");

	Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.


	hbase-site.xml
	--------------
	<configuration>
	<property>
	<name>hbase.rootdir</name>
	<value>hdfs://localhost:54310/hbase</value>
	</property>

	<!--<property>
	<name>hbase.zookeeper.property.dataDir</name>
	<value>/home/ubuntu</value>
	</property> -->

	<property>
	<name>hbase.zookeeper.quorum</name>
	<value>localhost</value>
	</property>
	<property>
	<name>hbase.cluster.distributed</name>
	<value>true</value>
	</property>
	</configuration>