Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@saptak
saptak / hello.md
Last active November 5, 2015 11:29

Introduction

In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.

Prerequisite:

Goals of the Tutorial

Tutorial Overview

In this tutorial you will gain a working knowledge of Pig through the hands-on experience creating Pig scripts to carry out essential data operations and tasks.

We will first read in two data files that contain New York Stock Exchange dividend prices and stock prices, and then use these files to perform a number of Pig operations including:

  • Define a relation with and without schema
  • Define a new relation from an existing relation
  • Select specific columns from within a relation
  • Join two relations
  • Sort the data using ‘ORDER BY’

Introduction

This tutorial describes how to refine data for a Trucking IoT  Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is interested in tying together location information with your analytic data.

Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.

For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed back to a datacenter where we will be processing the data.  The company wants to use this data to better understand risk.

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the da

Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

Overview

In this tutorial we will walk through the process of

  • cleaning and aggregating 10 years of raw stock ticker data from NYSE
  • enriching the data model by looking up additional attributes from Wikipedia
  • creating an interactive visualization on the model

Prerequisites:

In this tutorial, we’ll focus on taking advantage of improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative. 

In this tutorial, we are going to cover:

  • Performance improvements of Hive on Tez
  • Performance improvements of Vectorized Query
  • Cost-based Optimization Plans
  • Multi-tenancy with HiveServer2
  • SQL Compliance Improvements

Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

@saptak
saptak / frequency.py
Last active October 5, 2015 16:37
Twitter Sentiment Python
import sys
import json
def main():
tweet_file = open(sys.argv[1])
terms_freq = {}
totterm = 0.0
for line in tweet_file:
tweet=json.loads(line)
@saptak
saptak / indexing-documents-with-apache-sol.md
Last active October 1, 2015 16:01
Indexing documents with Apache Solr

In this tutorial, we will learn to:

  • Configure Solr to store indexes in HDFS
  • Create a solr cluster of 2 solr instances running on port 8983 and 8984
  • Index documents in HDFS using the Hadoop connectors
  • Use Solr to search documents

Pre-Requisite