Skip to content

Instantly share code, notes, and snippets.

In this tutorial, we’ll focus on taking advantage of improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative. 

In this tutorial, we are going to cover:

  • Performance improvements of Hive on Tez
  • Performance improvements of Vectorized Query
  • Cost-based Optimization Plans
  • Multi-tenancy with HiveServer2
  • SQL Compliance Improvements

Overview

In this tutorial we will walk through the process of

  • cleaning and aggregating 10 years of raw stock ticker data from NYSE
  • enriching the data model by looking up additional attributes from Wikipedia
  • creating an interactive visualization on the model

Prerequisites:

Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the da

@saptak
saptak / hello.md
Last active November 5, 2015 11:29

Introduction

In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.

Prerequisite:

Goals of the Tutorial

In this section we are going to walk through the process of using Apache Zeppelin and Apache Spark to interactively analyze data on a Apache Hadoop Cluster.

By the end of this tutorial, you will have learned:

  1. How to interact with Apache Spark from Apache Zeppelin
  2. How to read a text file from HDFS and create a RDD
  3. How to interactively analyze a data set through a rich set of Spark API operations

Getting started

import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
public class WeatherDataParser {
/**
* Given a string of the form returned by the api call:
* http://api.openweathermap.org/data/2.5/forecast/daily?q=94043&mode=json&units=metric&cnt=7
* retrieve the maximum temperature for the day indicated by dayIndex
@saptak
saptak / week4.pynb
Created February 18, 2016 01:56
Week 4 solution
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
@saptak
saptak / lab4.ipynb
Created November 10, 2016 00:22
lab4
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@saptak
saptak / dell-emc-ready-bundle-hdp.md
Last active June 15, 2017 18:28
Announcing the release of the Dell EMC Ready Bundle for Hortonworks Hadoop

Dell EMC and Hortonworks brings together industry leading solutions for enterprise-ready open data platforms and modern data applications, helping our customers Modernize, Automate and Transform how they deliver IT services to their critical business applications while simultaneously realizing cost savings allowing them to fund and invest in new technologies, methodologies and skills to succeed in the emerging digital economy. Empower your organization with deeper insights and enhanced data-driven decision making by using the right infrastructure for the right data. With solutions that integrate, store, manage, and protect your data, you can rapidly deploy Big Data analytics applications or start to develop your own.

As a Select member of the Dell EMC Technology Connect Partner Program, Dell EMC is able to resell Hortonworks Data Platform (HDP™), giving customers a simple way to procure Open Enterprise Hadoop as a complementary component of their data architectures to enable a broad range of new applications