Hive is designed to enable easy data summarization and ad-hoc analysis of large volumes of data. It uses a query language called Hive-QL which is similar to SQL.
In this tutorial, we will explore the following:
- Load a data file into a Hive table
- Create a table using RCFormat
- Query tables
- Managed tables vs external tables
- ORC format
- PARTITIONED a Table
- Bucketing a Table
A working HDP cluster – the easiest way to have a HDP cluster is to download the Hortonworks Sandbox.
First of all, download data file from here click here and name the file as TwitterData.txt .
As the file is small, you can simply open it, copy and create a local file in the sandbox manually as well.
We will use the Ambari UI here.
Open http://localhost:8080 in your browser.
Now, click on the HDFS Files button from the Off-canvas menu at the top and you will see the following screen.
Navigate to the /tmp
folder.
Now click on Upload option and select file Twitterdata.txt
from your computer.
Here is a sample syntax to create a table and load datafile into the table.
Let’s create this table and load data.
For this we can use Ambari
as well or the command line
.
Open the Hive View
by clicking on the Hive button in the views menu.
And type into the composition area the following query:
CREATE TABLE TwitterExampletextexample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA INPATH '/tmp/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;
To query sample data of your previously created table click the icon next to the table name, it executes a select query.
Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.
Here is a sample Create RC file format table syntax:
CREATE TABLE TwitterExampleRCtable(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING COMMENT 'Location of user',
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;
Here is the step on how to Load Data into the RC Table. Please execute and see the results.
INSERT OVERWRITE TABLE TwitterExampleRCtable select * from TwitterExampletextexample;
Run the sample select query again.
Let’s find top 10 countries who tweeted most using TwitterExampleRCtable.
SELECT profileLocation, COUNT(txt) as count1
FROM TwitterExampleRCtable
GROUP BY profileLocation
ORDER BY count1 desc limit 10;
Please see the following log and the results:
Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data files in place, therefore it has to be a folder you point to.
Here is the syntax for creating these tables.
Managed:
CREATE TABLE ManagedExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
For the External table we need to create a new directory /tmp/admin and Upload Twitterdata.txt.
Before we create the table and load the data in it, we have to change the permission for the /tmp/admin folder.
Open the HDFS view and navigate to /tmp
.
Right click on admin and select Permissions:
Now check the Write buttons
and modify recursively
and press save.
Verify that the permissions look now like this:
External:
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/tmp/admin';
Also, when you drop a Managed table, it deletes the metadata, and it also deletes the data.
When you drop an External table, it only deletes the metadata.
By creating a managed table the file you load in is moved to /apps/hive/warehouse
that means that the data is controlled by hive.
Whereas the external tables points to the /tmp/admin directory in which we put the Twitterdata.txt. If we run the sample query you should see the data from this file.
As a next step, you could describe the above tables as below and compare the output with managed vs. external tables.
describe formatted ManagedExample;
describe formatted ExternalExample;
Optimized Row Columnar (ORC) File format is used as it further compresses data files. It could result in a small performance loss in writing, but there will be huge performance gain in reading.
Let’s try it out. Please see that the table is stored as ORC.
CREATE TABLE ORCFileFormatExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING COMMENT 'Location of user',
favc INT,retweet STRING,retcount INT,followerscount INT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORC tblproperties ("orc.compress"="ZLIB");
Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks. Here is the sample syntax to create a partitioned table and load data into partitions.
CREATE TABLE PARTITIONEDExample(
tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,
retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY(CreatedAt STRING, profileLocation STRING)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;
INSERT OVERWRITE TABLE PARTITIONEDExample
PARTITION (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago")
SELECT tweetId,username,txt,favc,retweet,retcount,followerscount
FROM twitterexampletextexample
where profileLocation='Chicago' limit 100;
Let's execute the sample query to check if it's been overwritten correctly.
Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.
Here is an example for creating a table with buckets and load data into it.
CREATE TABLE BucketingExample(
tweetId BIGINT, username STRING,
txt STRING,CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY( profileLocation STRING)
CLUSTERED BY(tweetId) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
set hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE BucketingExample PARTITION (profileLocation="Chicago")
SELECT tweetId,username,txt,CreatedAt,favc,retweet,retcount,followerscount
FROM twitterexampletextexample
where profileLocation='Chicago' limit 100;
Here is a reference when you executed a sample query.
You can go to hdfs folder and see the directory structure behind these Hive tables that you have just created. That could help you to design your tables and file distributions which is very important in designing your warehouse.
Hope, this was helpful and simple enough to give you a glimpse of the Hive world.