Saptak Sen saptak

## interactive_query_with_apache_hive.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / interactive_query_with_apache_hive.md
            
            
              Last active
              October 12, 2015 22:14
            
          
    In this tutorial, we’ll focus on taking advantage of improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative. 
In this tutorial, we are going to cover:

Performance improvements of Hive on Tez
Performance improvements of Vectorized Query
Cost-based Optimization Plans
Multi-tenancy with HiveServer2
SQL Compliance Improvements


## deriving_business_insights_from_data_using_microsoft_excel_with_hortonworks_data_platform.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / deriving_business_insights_from_data_using_microsoft_excel_with_hortonworks_data_platform.md
            
            
              Created
              October 12, 2015 22:25
            
          
    Overview

In this tutorial we will walk through the process of

cleaning and aggregating 10 years of raw stock ticker data from NYSE
enriching the data model by looking up additional attributes from Wikipedia
creating an interactive visualization on the model

Prerequisites:


## processing_with_hive.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                saptak
                / processing_with_hive.md
            
            
              Last active
              October 23, 2015 16:27
            
          
    Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.
People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet

  
## processing_data_pig.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / processing_data_pig.md
            
            
              Last active
              October 23, 2015 17:32
            
          
    What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.
A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the da

  
## hello.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hello.md
            
            
              Last active
              November 5, 2015 11:29
            
          
    Introduction

In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.
Prerequisite:


Hortonworks Sandbox 2.3 (installed and running)

Goals of the Tutorial


## Interacting-with-Data-on-HDP-using-Apache-Zeppelin-and-Apache-Spark.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / Interacting-with-Data-on-HDP-using-Apache-Zeppelin-and-Apache-Spark.md
            
            
              Last active
              November 7, 2015 00:40
            
          
    In this section we are going to walk through the process of using Apache Zeppelin and Apache Spark to interactively analyze data on a Apache Hadoop Cluster.
By the end of this tutorial, you will have learned:

How to interact with Apache Spark from Apache Zeppelin
How to read a text file from HDFS and create a RDD
How to interactively analyze a data set through a rich set of Spark API operations

Getting started


## weatherdataparser.java
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;

public class WeatherDataParser {

	/**
	 * Given a string of the form returned by the api call:
	 * http://api.openweathermap.org/data/2.5/forecast/daily?q=94043&mode=json&units=metric&cnt=7
	 * retrieve the maximum temperature for the day indicated by dayIndex

## week4.pynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {

## lab4.ipynb

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / lab4.ipynb
            
            
              Created
              November 10, 2016 00:22
            
              
                lab4
              
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## dell-emc-ready-bundle-hdp.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / dell-emc-ready-bundle-hdp.md
            
            
              Last active
              June 15, 2017 18:28
            
              
                Announcing the release of the Dell EMC Ready Bundle for Hortonworks Hadoop
              
          
    Dell EMC and Hortonworks brings together industry leading solutions for enterprise-ready open data platforms and modern data applications, helping our customers Modernize, Automate and Transform how they deliver IT services to their critical business applications while simultaneously realizing cost savings allowing them to fund and invest in new technologies, methodologies and skills to succeed in the emerging digital economy. Empower your organization with deeper insights and enhanced data-driven decision making by using the right infrastructure for the right data. With solutions that integrate, store, manage, and protect your data, you can rapidly deploy Big Data analytics applications or start to develop your own.
As a Select member of the Dell EMC Technology Connect Partner Program, Dell EMC is able to resell Hortonworks Data Platform (HDP™), giving customers a simple way to procure Open Enterprise Hadoop as a complementary component of their data architectures to enable a broad range of new applications
	import org.json.JSONArray;
	import org.json.JSONException;
	import org.json.JSONObject;

	public class WeatherDataParser {

	/**
	* Given a string of the form returned by the api call:
	* http://api.openweathermap.org/data/2.5/forecast/daily?q=94043&mode=json&units=metric&cnt=7
	* retrieve the maximum temperature for the day indicated by dayIndex
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{