Skip to content

Instantly share code, notes, and snippets.

@thanoojgithub
Last active May 29, 2018 08:00
Show Gist options
  • Save thanoojgithub/a0ef4c9e57ee225f5c46 to your computer and use it in GitHub Desktop.
Save thanoojgithub/a0ef4c9e57ee225f5c46 to your computer and use it in GitHub Desktop.
HBase basics
ubuntu@ubuntu:~$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop2/logs/hadoop-ubuntu-secondarynamenode-ubuntu.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /usr/local/hadoop2/logs/yarn-ubuntu-nodemanager-ubuntu.out
ubuntu@ubuntu:~$ jps
3714 Jps
2531 NameNode
3091 ResourceManager
2692 DataNode
3256 NodeManager
2890 SecondaryNameNode
ubuntu@ubuntu:~$ start-hbase.sh
localhost: starting zookeeper, logging to /home/ubuntu/hbase-1.0.1.1/bin/../logs/hbase-ubuntu-zookeeper-ubuntu.out
starting master, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-master-ubuntu.out
starting regionserver, logging to /home/ubuntu/hbase-1.0.1.1/logs/hbase-ubuntu-1-regionserver-ubuntu.out
ubuntu@ubuntu:~$ jps
8704 HRegionServer
7489 NodeManager
8514 HQuorumPeer
6757 NameNode
8581 HMaster
6919 DataNode
7145 SecondaryNameNode
8925 Jps
7325 ResourceManager
ubuntu@ubuntu:~$ hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hbase-1.0.1.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.0.1.1, re1dbf4df30d214fca14908df71d038081577ea46, Sun May 17 12:34:26 PDT 2015
hbase(main):001:0> create 'emp', 'personal data', 'professional data'
0 row(s) in 0.5120 seconds
=> Hbase::Table - emp
hbase(main):002:0> put 'emp','1','personal data:name','sriram'
0 row(s) in 0.0830 seconds
hbase(main):003:0> put 'emp','1','personal data:location','ayodhya'
0 row(s) in 0.0140 seconds
hbase(main):004:0> put 'emp','1','personal data:mobile','7204437072'
0 row(s) in 0.0080 seconds
hbase(main):005:0> put 'emp','1','professional data:job title','lead'
0 row(s) in 0.0210 seconds
hbase(main):006:0> put 'emp','1','professional data:location','sri lanka'
0 row(s) in 0.0080 seconds
hbase(main):007:0> put 'emp','1','professional data:mobile','7204437072'
0 row(s) in 0.0090 seconds
hbase(main):009:0> list
TABLE
emp
test
2 row(s) in 0.0250 seconds
=> ["emp", "test"]
hbase(main):010:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal data:location, timestamp=1454498924898, value=ayodhya
1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
1 column=personal data:name, timestamp=1454498917042, value=sriram
1 column=professional data:job title, timestamp=1454498941627, value=lead
1 column=professional data:location, timestamp=1454498947357, value=sri lanka
1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
1 row(s) in 0.0360 seconds
hbase(main):011:0> put 'emp','2','personal data:name','seeta'
0 row(s) in 0.0140 seconds
hbase(main):012:0> put 'emp','2','personal data:location','midhila'
0 row(s) in 0.0080 seconds
hbase(main):013:0> put 'emp','2','personal data:mobile','9742681255'
0 row(s) in 0.0050 seconds
hbase(main):014:0> put 'emp','2','professional data:job title','sse'
0 row(s) in 0.0060 seconds
hbase(main):015:0> put 'emp','2','professional data:location','sri lanka'
0 row(s) in 0.0100 seconds
hbase(main):016:0> put 'emp','2','professional data:mobile','9742681255'
0 row(s) in 0.0040 seconds
hbase(main):017:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal data:location, timestamp=1454498924898, value=ayodhya
1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
1 column=personal data:name, timestamp=1454498917042, value=sriram
1 column=professional data:job title, timestamp=1454498941627, value=lead
1 column=professional data:location, timestamp=1454498947357, value=sri lanka
1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
2 column=personal data:location, timestamp=1454499143334, value=midhila
2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
2 column=personal data:name, timestamp=1454499136556, value=seeta
2 column=professional data:job title, timestamp=1454499165866, value=sse
2 column=professional data:location, timestamp=1454499173236, value=sri lanka
2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
2 row(s) in 0.0350 seconds
hbase(main):019:0> get 'emp','2'
COLUMN CELL
personal data:location timestamp=1454499143334,value=midhila
personal data:mobile timestamp=1454499152049, value=9742681255
personal data:name timestamp=1454499136556, value=seeta
professional data:job title timestamp=1454499165866, value=sse
professional data:location timestamp=1454499173236, value=sri lanka
professional data:mobile timestamp=1454499180039, value=9742681255
6 row(s) in 0.0190 seconds
hbase(main):006:0> get 'emp', '1', 'personal data:name'
COLUMN CELL
personal data:name timestamp=1454498917042, value=sriram
1 row(s) in 0.0070 seconds
hbase(main):007:0> get 'emp', '1', 'personal data'
COLUMN CELL
personal data:location timestamp=1454498924898, value=ayodhya
personal data:mobile timestamp=1454498936233, value=7204437072
personal data:name timestamp=1454498917042, value=sriram
3 row(s) in 0.0070 seconds
###updating
hbase(main):002:0> put 'emp','1','personal data:city','Delhi'
hbase(main):007:0> delete 'emp', '1', 'personal data:city',
hbase(main):012:0> delete 'emp','1','personal data:city'
0 row(s) in 0.0180 seconds
hbase(main):013:0> get 'emp', '1', 'personal data:city'
COLUMN CELL
0 row(s) in 0.0050 seconds
hbase(main):014:0> get 'emp', '1', 'personal data'
COLUMN CELL
personal data:location timestamp=1454498924898, value=ayodhya
personal data:mobile timestamp=1454498936233, value=7204437072
personal data:name timestamp=1454498917042, value=sriram
3 row(s) in 0.0090 seconds
hbase(main):016:0> put 'emp','3','personal data:name','hanuma'
0 row(s) in 0.0070 seconds
hbase(main):017:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal data:location, timestamp=1454498924898, value=ayodhya
1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
1 column=personal data:name, timestamp=1454498917042, value=sriram
1 column=professional data:job title, timestamp=1454498941627, value=lead
1 column=professional data:location, timestamp=1454498947357, value=sri lanka
1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
2 column=personal data:location, timestamp=1454499143334, value=midhila
2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
2 column=personal data:name, timestamp=1454500493425, value=seeta
2 column=professional data:job title, timestamp=1454499165866, value=sse
2 column=professional data:location, timestamp=1454499173236, value=sri lanka
2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
3 column=personal data:name, timestamp=1454500510145, value=hanuma
3 row(s) in 0.0240 seconds
hbase(main):018:0> deleteall 'emp','3'
0 row(s) in 0.0130 seconds
hbase(main):019:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal data:location, timestamp=1454498924898, value=ayodhya
1 column=personal data:mobile, timestamp=1454498936233, value=7204437072
1 column=personal data:name, timestamp=1454498917042, value=sriram
1 column=professional data:job title, timestamp=1454498941627, value=lead
1 column=professional data:location, timestamp=1454498947357, value=sri lanka
1 column=professional data:mobile, timestamp=1454498957809, value=7204437072
2 column=personal data:location, timestamp=1454499143334, value=midhila
2 column=personal data:mobile, timestamp=1454499152049, value=9742681255
2 column=personal data:name, timestamp=1454500493425, value=seeta
2 column=professional data:job title, timestamp=1454499165866, value=sse
2 column=professional data:location, timestamp=1454499173236, value=sri lanka
2 column=professional data:mobile, timestamp=1454499180039, value=9742681255
2 row(s) in 0.0250 seconds
hbase(main):020:0> count 'emp'
2 row(s) in 0.0350 seconds
=> 2
hbase(main):021:0> create 'emptemp', 'personal data', 'professional data'
0 row(s) in 0.2370 seconds
=> Hbase::Table - emptemp
hbase(main):022:0> truncate 'emptemp'
Truncating 'emptemp' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 1.4410 seconds
hbase(main):023:0> describe 'emptemp'
Table emptemp is ENABLED
emptemp
COLUMN FAMILIES DESCRIPTION
{NAME => 'personal data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', C
OMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'professional data', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER
', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0190 seconds
hbase(main):024:0> scan 'emptemp'
ROW COLUMN+CELL
0 row(s) in 0.0060 seconds
hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds
hbase(main):019:0> drop 'emp'
0 row(s) in 0.3060 seconds
hbase> drop_all 't.*
Note: Before dropping a table, you must disable it.
ubuntu@ubuntu:~$ stop-hbase.sh
stopping hbase..........................
localhost: stopping zookeeper.
ubuntu@ubuntu:~$ hadoop fs -ls /hbase
Found 6 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/.tmp
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data
-rw-r--r-- 1 ubuntu supergroup 42 2016-02-04 16:20 /hbase/hbase.id
-rw-r--r-- 1 ubuntu supergroup 7 2016-02-04 16:20 /hbase/hbase.version
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:30 /hbase/oldWALs
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/WALs
Found 2 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs/hregion-04717635
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/WALs/ubuntu,16201,1454583002554
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data
Found 2 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:33 /hbase/data/default
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase
Found 2 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/meta
Found 3 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/.tabledesc
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/.tmp
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/meta/1588230740
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/data/hbase/namespace
Found 3 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tabledesc
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/.tmp
drwxr-xr-x - ubuntu supergroup 0 2016-02-04 16:20 /hbase/data/hbase/namespace/f5b52c99ade0ca0d46213dfe3f1da63a
ubuntu@ubuntu:~$ hadoop fs -ls /hbase/hbase.id
-rw-r--r-- 1 ubuntu supergroup 42 2016-02-04 16:20 /hbase/hbase.id
ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.id
PBUF
ubuntu@ubuntu:~$ hadoop fs -cat /hbase/hbase.version
PBUF
A {row, column, version} tuple exactly specifies a cell in HBase. It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.
While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.
The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.
There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:
If multiple writes to a cell have the same version, are all versions maintained or just the last?
- Currently, only the last written is fetchable.
Is it OK to write cells in a non-increasing version order?
- Yes
Below we describe how the version dimension in HBase currently works.
##Gets are implemented on top of Scans.
By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:
to return more than one version, see Get.setMaxVersions()
to return versions other than the latest, see Get.setTimeRange()
Get get = new Get(Bytes.toBytes("row1"));
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
Get get = new Get(Bytes.toBytes("row1"));
get.setMaxVersions(3); // will return last 3 versions of row
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns all versions of this column
##Put
Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis.
Put put = new Put(Bytes.toBytes(row));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
htable.put(put);
Put put = new Put( Bytes.toBytes(row));
long explicitTimeInMs = 555; // just an example
put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
htable.put(put);
##Delete
There are three different types of internal delete markers :
Delete: for a specific version of a column.
Delete column: for all versions of a column.
Delete family: for all columns of a particular ColumnFamily
When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).
Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is “delete all cells where the version is less than or equal to this version”. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.
The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 3. This is an important parameter because as described in Chapter 5, Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.
It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.
hbase(main):026:0> alter 'rawdocs', NAME=>'personal details', VERSIONS =>3
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.3390 seconds
hbase(main):027:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN CELL
personal details:dt timestamp=1454939129933, value=2015-10-12
personal details:name timestamp=1455027779087, value=raghuram
personal details:name timestamp=1454939129933, value=sriram
3 row(s) in 0.0480 seconds
hbase(main):028:0> put 'rawdocs', '1', 'personal details:name', 'seetaram'
0 row(s) in 0.0320 seconds
hbase(main):029:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN CELL
personal details:dt timestamp=1454939129933, value=2015-10-12
personal details:name timestamp=1455027956861, value=seetaram
personal details:name timestamp=1455027779087, value=raghuram
personal details:name timestamp=1454939129933, value=sriram
4 row(s) in 0.0470 seconds
hbase(main):030:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN CELL
personal details:dt timestamp=1454939129933, value=2015-10-12
personal details:name timestamp=1455027956861, value=seetaram
personal details:name timestamp=1455027779087, value=raghuram
personal details:name timestamp=1454939129933, value=sriram
4 row(s) in 0.0770 seconds
hbase(main):031:0> put 'rawdocs', '1', 'personal details:name', 'ram'
0 row(s) in 0.0260 seconds
hbase(main):032:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN CELL
personal details:dt timestamp=1454939129933, value=2015-10-12
personal details:name timestamp=1455027989430, value=ram
personal details:name timestamp=1455027956861, value=seetaram
personal details:name timestamp=1455027779087, value=raghuram
4 row(s) in 0.0630 seconds
hbase(main):033:0> get 'rawdocs', 1, {COLUMN => 'personal details', VERSIONS=>10}
COLUMN CELL
personal details:dt timestamp=1454939129933, value=2015-10-12
personal details:name timestamp=1455027989430, value=ram
personal details:name timestamp=1455027956861, value=seetaram
personal details:name timestamp=1455027779087, value=raghuram
4 row(s) in 0.0800 seconds
hbase(main):059:0> create 'emp', {NAME=>'personal details', VERSIONS=>3, KEEP_DELETED_CELLS => true}, {NAME=>'professional details', VERSIONS=>5}
0 row(s) in 1.2830 seconds
=> Hbase::Table - emp
hbase(main):062:0> put 'emp','1','personal details:name','sriram'
0 row(s) in 0.0170 seconds
hbase(main):063:0> put 'emp','1','professional details:job title','lead'
0 row(s) in 0.0080 seconds
hbase(main):086:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>5}
COLUMN CELL
professional details:job title timestamp=1455029722801, value=ceo
professional details:job title timestamp=1455029709250, value=sr.mgr
professional details:job title timestamp=1455029706301, value=mgr
professional details:job title timestamp=1455029696376, value=sse
professional details:job title timestamp=1455029291946, value=lead
5 row(s) in 0.0300 seconds
hbase(main):087:0> get 'emp', 1, {COLUMN => 'professional details', VERSIONS=>3}
COLUMN CELL
professional details:job title timestamp=1455029722801, value=ceo
professional details:job title timestamp=1455029709250, value=sr.mgr
professional details:job title timestamp=1455029706301, value=mgr
3 row(s) in 0.0080 seconds
hbase(main):088:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN CELL
personal details:name timestamp=1455029437968, value=ram
personal details:name timestamp=1455029377211, value=seetaram
personal details:name timestamp=1455029368125, value=raghuram
3 row(s) in 0.0220 seconds
hbase(main):089:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>2}
COLUMN CELL
personal details:name timestamp=1455029437968, value=ram
personal details:name timestamp=1455029377211, value=seetaram
2 row(s) in 0.0070 seconds
hbase(main):103:0> put 'emp','1','personal details:name','ram'
0 row(s) in 0.0300 seconds
hbase(main):104:0> put 'emp','1','personal details:location','ayodhya'
0 row(s) in 0.0060 seconds
hbase(main):105:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal details:location, timestamp=1455030553278, value=ayodhya
1 column=personal details:name, timestamp=1455030537916, value=ram
1 column=professional details:job title, timestamp=1455029722801, value=ceo
1 row(s) in 0.0160 seconds
hbase(main):106:0> get 'emp', 1, {COLUMN => 'personal details', VERSIONS=>3}
COLUMN CELL
personal details:location timestamp=1455030553278, value=ayodhya
personal details:name timestamp=1455030537916, value=ram
2 row(s) in 0.0100 seconds
hbase(main):107:0> delete 'emp', 1, 'personal details:name'
0 row(s) in 0.0070 seconds
hbase(main):108:0> scan 'emp'
ROW COLUMN+CELL
1 column=personal details:location, timestamp=1455030553278, value=ayodhya
1 column=professional details:job title, timestamp=1455029722801, value=ceo
1 row(s) in 0.0110 seconds
Hive Vs HBase
--------------
# Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language that gets translated to MapReduce jobs. Despite providing SQL functionality, Hive does not provide interactive querying yet - it only runs batch processes on Hadoop.
# Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don’t require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop’s infrastructure and scales horizontally using off the shelf servers.
# Features
Hive can help the SQL savvy to run MapReduce jobs. Since it’s JDBC compliant, it also integrates with existing SQL based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
# Limitations
Hive does not currently support update statements. Additionally, since it runs batch processing on Hadoop, it can take minutes or even hours to get back results for queries. Hive must also be provided with a predefined schema to map files and directories into columns and it is not ACID compliant.
HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.
# Use Cases
Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.
HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.
# Summary
Hive and HBase are two different Hadoop based technologies - Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
Both Hive and Pig queries get converted into MapReduce jobs under the hood.
# Objective
To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
To replace daily aggregation data generated thru MySQL with Hive
Build Custom reports thru queries in Hive
Architecture Options
I benchmarked the following options 1. Hive+HDFS 2. Hive+HBase - queries were too slow so dumped this option
# Design
Daily log Files were transported to HDFS
MR jobs parsed these log files and output files in HDFS
Create Hive tables with partitions and locations pointing to HDFS locations
Create Hive query scripts (call it HQL if u like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
# Summary
HBase is like a Map. If u know the key, u can instantly get the value. But if u want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone. If you have data to be aggregated, rolled up, analyzed across rows then consider Hive
#Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
#Batch Processing
Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
#Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor. Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.
# Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
# There are four main modules in Hadoop:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
# Before going further, Let's note that we have three different types of data.
Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
Pig: A high-level data-flow language and execution framework for parallel computation.
# Hive Vs PIG comparison can be found at this article and my other post at this SE question.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.
hive>CREATE TABLE thanooj.hbase_docs(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");
hive>CREATE TABLE thanooj.hbase_docs_raw (ID INT, name STRING, dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
hive>LOAD DATA LOCAL INPATH '/home/ubuntu/input/raw.txt' OVERWRITE INTO TABLE thanooj.hbase_docs_raw;
hive>INSERT OVERWRITE TABLE thanooj.hbase_docs SELECT * FROM thanooj.hbase_docs_raw;
hive> select * from hbase_docs;
OK
1 sriram 2015-10-12
2 seeta 2015-09-12
3 lakshman 2015-11-12
Time taken: 0.377 seconds, Fetched: 3 row(s)
hbase(main):003:0> scan 'rawdocs'
ROW COLUMN+CELL
1 column=personal details:dt, timestamp=1454939129933, value
=2015-10-12
1 column=personal details:name, timestamp=1454939129933, val
ue=sriram
2 column=personal details:dt, timestamp=1454939129933, value
=2015-09-12
2 column=personal details:name, timestamp=1454939129933, val
ue=seeta
3 column=personal details:dt, timestamp=1454939129933, value
=2015-11-12
3 column=personal details:name, timestamp=1454939129933, val
ue=lakshman
3 row(s) in 0.3250 seconds
ubuntu@ubuntu:~$ hadoop fs -ls /home/
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:15 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db
Found 3 items
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:30 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/_tmp.hbase_docs
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:45 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
drwxr-xr-x - ubuntu supergroup 0 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw;
Found 1 items
-rwxr-xr-x 1 ubuntu supergroup 61 2016-02-08 05:29 /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs_raw/raw.txt
ubuntu@ubuntu:~$ hadoop fs -ls /home/ubuntu/softwares/apache-hive-2.1.0-SNAPSHOT-bin/warehouse/thanooj.db/hbase_docs
hive> CREATE EXTERNAL TABLE thanooj.hbase_docs_09(ID INT, name STRING, dt STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal details:name,personal details:dt") TBLPROPERTIES ("hbase.table.name" = "rawdocs");
OK
Time taken: 1.669 seconds
hive> select * from hbase_docs_09;
OK
1 sriram 2015-10-12
2 seeta 2015-09-12
3 lakshman 2015-11-12
Time taken: 2.514 seconds, Fetched: 3 row(s)
hive>
Reference: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
set hive.hbase.wal.enabled=false;
Warning: disabling WAL may lead to data loss if an HBase failure occurs, so only use this if you have some other recovery strategy available.
If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:
CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table");
Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.
hbase-site.xml
--------------
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<!--<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ubuntu</value>
</property> -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment