Saptak Sen saptak

## hello.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hello.md
            
            
              Last active
              November 5, 2015 11:29
            
          
    Introduction

In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.
Prerequisite:


Hortonworks Sandbox 2.3 (installed and running)

Goals of the Tutorial


## beginners-guide-to-pig.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / beginners-guide-to-pig.md
            
            
              Last active
              March 5, 2024 08:17
            
          
    Tutorial Overview

In this tutorial you will gain a working knowledge of Pig through the hands-on experience creating Pig scripts to carry out essential data operations and tasks.
We will first read in two data files that contain New York Stock Exchange dividend prices and stock prices, and then use these files to perform a number of Pig operations including:

Define a relation with and without schema
Define a new relation from an existing relation
Select specific columns from within a relation
Join two relations
Sort the data using ‘ORDER BY’


## hello-world.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hello-world.md
            
            
              Created
              October 25, 2015 18:55
            
          
Introduction
This tutorial describes how to refine data for a Trucking IoT  Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is interested in tying together location information with your analytic data.
Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed back to a datacenter where we will be processing the data.  The company wants to use this data to better understand risk.

  
## processing_data_pig.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / processing_data_pig.md
            
            
              Last active
              October 23, 2015 17:32
            
          
    What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the da

  
## processing_with_hive.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                saptak
                / processing_with_hive.md
            
            
              Last active
              October 23, 2015 16:27
            
          
    Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.
People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

  
## deriving_business_insights_from_data_using_microsoft_excel_with_hortonworks_data_platform.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / deriving_business_insights_from_data_using_microsoft_excel_with_hortonworks_data_platform.md
            
            
              Created
              October 12, 2015 22:25
            
          
    Overview

In this tutorial we will walk through the process of

cleaning and aggregating 10 years of raw stock ticker data from NYSE
enriching the data model by looking up additional attributes from Wikipedia
creating an interactive visualization on the model

Prerequisites:


## interactive_query_with_apache_hive.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / interactive_query_with_apache_hive.md
            
            
              Last active
              October 12, 2015 22:14
            
          
    In this tutorial, we’ll focus on taking advantage of improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative. 
In this tutorial, we are going to cover:

Performance improvements of Hive on Tez
Performance improvements of Vectorized Query
Cost-based Optimization Plans
Multi-tenancy with HiveServer2
SQL Compliance Improvements


## how-to-process-data-with-hive.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / how-to-process-data-with-hive.md
            
            
              Created
              October 8, 2015 15:07
            
          
    Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.
People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

  
## frequency.py
import sys
import json

def main():
    tweet_file = open(sys.argv[1])
    terms_freq = {}
    totterm = 0.0

    for line in tweet_file:
        tweet=json.loads(line)

## indexing-documents-with-apache-sol.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / indexing-documents-with-apache-sol.md
            
            
              Last active
              October 1, 2015 16:01
            
              
                Indexing documents with Apache Solr
              
          
    In this tutorial, we will learn to:

Configure Solr to store indexes in HDFS
Create a solr cluster of 2 solr instances running on port 8983 and 8984
Index documents in HDFS using the Hadoop connectors
Use Solr to search documents

Pre-Requisite


Hortonworks Sandbox
	import sys
	import json

	def main():
	tweet_file = open(sys.argv[1])
	terms_freq = {}
	totterm = 0.0

	for line in tweet_file:
	tweet=json.loads(line)