Saptak Sen saptak

## beginners-guide-to-pig.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / beginners-guide-to-pig.md
            
            
              Last active
              March 5, 2024 08:17
            
          
    Tutorial Overview

In this tutorial you will gain a working knowledge of Pig through the hands-on experience creating Pig scripts to carry out essential data operations and tasks.
We will first read in two data files that contain New York Stock Exchange dividend prices and stock prices, and then use these files to perform a number of Pig operations including:

Define a relation with and without schema
Define a new relation from an existing relation
Select specific columns from within a relation
Join two relations
Sort the data using ‘ORDER BY’


## Using-Hive-for-Data-Analysis.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              2 stars
            
          
                saptak
                / Using-Hive-for-Data-Analysis.md
            
            
              Created
              November 7, 2015 18:54
            
          
    Overview

Hive is designed to enable easy data summarization and ad-hoc analysis of large volumes of data. It uses a query language called Hive-QL which is similar to SQL.
In this tutorial, we will explore the following:

Load a data file into a Hive table
Create a table using RCFormat
Query tables
Managed tables vs external tables


## 2015-09-25-processing-real-time-event-stream-with-apache-storm.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              4 stars
            
          
                saptak
                / 2015-09-25-processing-real-time-event-stream-with-apache-storm.md
            
            
              Last active
              December 1, 2019 09:53
            
              
                Processing realtime event stream with Apache Storm
              
          
    Processing realtime event stream with Apache Storm

Introduction

In this tutorial, we will explore Apache Storm and use it with Apache Kafka to develop a multi-stage event processing pipeline.

In an event processing pipeline, each stage is a purpose-built step that performs some real-time processing against upstream event streams for downstream analysis. This produces increasingly richer event streams, as data flows through the pipeline:

  
## hello-world.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hello-world.md
            
            
              Created
              October 25, 2015 18:55
            
          
Introduction
This tutorial describes how to refine data for a Trucking IoT  Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is interested in tying together location information with your analytic data.
Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed back to a datacenter where we will be processing the data.  The company wants to use this data to better understand risk.

  
## how-to-load-data-hdfs.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                saptak
                / how-to-load-data-hdfs.md
            
            
              Last active
              November 28, 2017 19:22
            
          
    Summary

This tutorial describes how to load data into the Hortonworks sandbox.
The Hortonworks sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.
In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.
Prerequisites:


Hortonworks Sandbox 2.3 (installed and running)


## hdinsight1.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hdinsight1.md
            
            
              Last active
              July 27, 2017 03:02
            
              
                Democratizing Big Data with Microsoft HDInsight
              
          
    Democratizing Big Data with Azure HDInsight

by Saptak Sen

Azure HDInsight, is an enterprise grade cloud platform for industry's leading open source big data technologies.
The best way to explain big data is to look at how customers are leveraging big data to be more productive on Azure HDInsight.
Case Study

AccuWeather is a global technology firm which is leveraging Microsoft cloud to build predictive analytics as part of the solutions. With the power of Microsoft cloud and Azure HDInsight, AccuWeather has been able to scale to billions of requests a day and to scale petabytes of data in size.

  
## hive_testbench.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / hive_testbench.md
            
            
              Created
              July 17, 2017 21:58
            
          
Sudo to the hdfs user to begin generating data. Change to the home directory for the hdfs user:

sudo -u hdfs -s
cd /home/hdfs


Download the testbench utilities from Github and unzip them:

wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
unzip hive14.zip


## load_to_orc.sql
create database if not exists llap
location 's3a://<your_S3_bucket>/llap.db';

drop table if exists llap.customer;
create table llap.customer
stored as orc
as select * from tpch_text_2.customer;

drop table if exists llap.lineitem;
create table llap.lineitem

## dell-emc-ready-bundle-hdp.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / dell-emc-ready-bundle-hdp.md
            
            
              Last active
              June 15, 2017 18:28
            
              
                Announcing the release of the Dell EMC Ready Bundle for Hortonworks Hadoop
              
          
    Dell EMC and Hortonworks brings together industry leading solutions for enterprise-ready open data platforms and modern data applications, helping our customers Modernize, Automate and Transform how they deliver IT services to their critical business applications while simultaneously realizing cost savings allowing them to fund and invest in new technologies, methodologies and skills to succeed in the emerging digital economy. Empower your organization with deeper insights and enhanced data-driven decision making by using the right infrastructure for the right data. With solutions that integrate, store, manage, and protect your data, you can rapidly deploy Big Data analytics applications or start to develop your own.
As a Select member of the Dell EMC Technology Connect Partner Program, Dell EMC is able to resell Hortonworks Data Platform (HDP™), giving customers a simple way to procure Open Enterprise Hadoop as a complementary component of their data architectures to enable a broad range of new applications

  
## lab4.ipynb

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                saptak
                / lab4.ipynb
            
            
              Created
              November 10, 2016 00:22
            
              
                lab4
              
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	create database if not exists llap
	location 's3a://<your_S3_bucket>/llap.db';

	drop table if exists llap.customer;
	create table llap.customer
	stored as orc
	as select * from tpch_text_2.customer;

	drop table if exists llap.lineitem;
	create table llap.lineitem