Skip to content

Instantly share code, notes, and snippets.

@saptak
saptak / 0_reuse_code.js
Last active August 29, 2015 14:15
Here are some things you can do with Gists in GistBox.
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console
@saptak
saptak / gist:5f1de4173f9d6e85e122
Created April 22, 2015 18:06
find location of the port number in a URL
var str1="http://127.0.0.1:5000";
var str2="http://blah.blah.com";
var i1=str1.search(/:\d/);
console.log(i1);
i1=str2.search(/:\d/);
console.log(i1);
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.lang.reflect.Array;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
public class MP1 {
Random generator;
@saptak
saptak / jreport.md
Last active September 2, 2015 22:22

Using JReport to visualize data with the Hortonworks Data Platform

###Introduction

JReport is a embedded BI reporting tool can easily extract and visualize data from the Hortonworks Data Platform 2.3 using the Apache Hive JDBC driver. You can then create reports, dashboards, and data analysis, which can be embedded into your own applications.

In this tutorial we are going to walkthrough the folllowing steps to demonstrate Apache Hive with JReport:

  1. Install the Apache Hive JDBC driver with JReport.
  2. Create a new JReport Catalog to manage the Hive connection.

Introduction

Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection.

Apache Ranger already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time in Hadoop.

In this tutorial, we cover using Apache Ranger for HDP 2.3 to secure your Hadoop environment. We will walkthrough the following topics:

  1. Support for Knox authorization and audit
  2. Command line policies in Hive
@saptak
saptak / nifi.md
Last active September 20, 2015 20:42

###Introduction to Apache NiFi

A very common scenario in many large organizations is to define, operationalize and manage complex dataflow between myriad distributed systems that often speak different protocols and understand different data formats. Messaging-based solutions are a popular answer these days, but they don’t address many of the fundamental challenges of enterprise dataflow.

###Data Workflow scenario Let's dive deeper into the dataflow requirement. On one end we have systems that acquire data, whether they are sensors, business, or organizations gathering data for your business. That information that is collected needs to be sent to processing systems, analytics systems like Hadoop, Storm, Spark, etc and then ulimately needs to be persisted into a backing store where business users can apply analytics on the data at rest to derive business value.

Let's consisder the scenario of IoT or Remote Sensor Delivery. As the data gets collected by remote sensors on factory floors, oil rigs or travelling

A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR.

###Prerequisite

Step-by-step guide

  • **Install dependencies - **this will provide you support for processing pngs, jpegs, and tiffs
@saptak
saptak / indexing-documents-with-apache-sol.md
Last active October 1, 2015 16:01
Indexing documents with Apache Solr

In this tutorial, we will learn to:

  • Configure Solr to store indexes in HDFS
  • Create a solr cluster of 2 solr instances running on port 8983 and 8984
  • Index documents in HDFS using the Hadoop connectors
  • Use Solr to search documents

Pre-Requisite

@saptak
saptak / frequency.py
Last active October 5, 2015 16:37
Twitter Sentiment Python
import sys
import json
def main():
tweet_file = open(sys.argv[1])
terms_freq = {}
totterm = 0.0
for line in tweet_file:
tweet=json.loads(line)

Data processing with Hive

Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial we used Pig which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.

People often ask why do Pig and Hive exist when they seem to do much of the same thing. Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complet