Prateek Rungta prateek

## NSObject+setValuesForKeysWithJSONDictionary.h
//
//  NSObject+setValuesForKeysWithJSONDictionary.h
//  SafeSetDemo
//
//  Created by Tom Harrington on 12/29/11.
//  Copyright (c) 2011 Atomic Bird, LLC. All rights reserved.
//

#import <Foundation/Foundation.h>

## OptionBuilder.java
/*
 * Copyright 2013 Cloudera Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software

## GiraphInstall
# Giraph does not have a central repository as of writing this
# So we build a local version and store it in our local maven repository
git clone https://github.com/apache/giraph.git

# And retrieve a patch for `GIRAPH-442`
wget http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff

# We also revert to version 1.0.0 for this codebase
cd giraph
git checkout release-1.0.0-RC3

## grab links.bookmarklet
javascript:(function(){var p=document.createElement("p");p.innerHTML="<strong>Loading&hellip;</strong>";p.id="loadingp";p.style.padding="20px";p.style.background="#fff";p.style.left="20px";p.style.top=0;p.style.position="fixed";p.style.zIndex="9999999";p.style.opacity=".85";document.body.appendChild(p);document.body.appendChild(document.createElement("script")).src="https://gist.github.com/ttscoff/5834741/raw/grablinks.js?x="+(Math.random());})();

## oozie-java-driver-hook
public class DriverHook implements Runnable{
static public DriverHookNewApi create(Job job){
return new DriverHookNewApi(job); }
Job job;
private DriverHookNewApi(Job job){
this.job = job; }
public void run(){
System.out.println("Hello from MyHook");
if(job == null)
throw new NullPointerException("err msg");

## cleancsv.py
#!/usr/bin/python
import csv
import sys
import argparse
import io

csv.field_size_limit(sys.maxsize)

parser = argparse.ArgumentParser(description='Clean csv of in-line newlines')
parser.add_argument('infile',help='Path to input CSV file');

## impala-hll.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                prateek
                / impala-hll.md
            
            
              Created
              April 7, 2014 01:33
                — forked from avibryant/impala-hll.md
            
          
    Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select

  
## hadoop-administration-presentations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                prateek
                / hadoop-administration-presentations.md
            
            
              Created
              April 10, 2014 05:18
            
          
    Hadoop Administration Resources


Official Docs - overwhelming, and invaluable link


Kathleen Ting: 7 Deadly Hadoop Misconfigurations videoslides


Philip Zeyliger: The guy who's day job it is to write CM, presented Debugging Distributed Systems - slides


Gwen Shapira: Scaling ETL with Hadoop slides


Cloudera's Resource Library link


## vizoozie-usage-notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                prateek
                / vizoozie-usage-notes.md
            
            
              Created
              April 10, 2014 16:04
            
          
    Usage Instructions for Viz-Oozie

    # install graphviz
    $ sudo yum install -y graphviz

    # install vizoozie
    $ git clone https://github.com/iprovalo/vizoozie
    $ cd vizoozie
 $ sudo python setup.py install


## 2014-04-21-Monday-CDH5-VM-Install-Notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                prateek
                / 2014-04-21-Monday-CDH5-VM-Install-Notes.md
            
            
              Created
              April 22, 2014 01:48
            
          
    CDH5 Cluster Setup

These are the steps I followed to setup a 6.5 CentOS VM, and install CDH5 and CM5 on it. All these commands should be run on a single node if running on a cluster, it will serve as the master node.
Caveats:

This is going to use the embedded postgres db for the services, this is a TERRIBLE idea if the enviornment is anything but short lived POC.
This was done for a 4 node POC cluster where a single instance was going to be the dedicated master - all the CM management and Hadoop master daemons would run on it. And the 3 remaining nodes would be data nodes.

Steps to follow


Create new local CentOS 6.5 Image based on this ISO.
	//
	// NSObject+setValuesForKeysWithJSONDictionary.h
	// SafeSetDemo
	//
	// Created by Tom Harrington on 12/29/11.
	// Copyright (c) 2011 Atomic Bird, LLC. All rights reserved.
	//

	#import <Foundation/Foundation.h>
	/*
	* Copyright 2013 Cloudera Inc.
	*
	* Licensed under the Apache License, Version 2.0 (the "License");
	* you may not use this file except in compliance with the License.
	* You may obtain a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*
	* Unless required by applicable law or agreed to in writing, software
	# Giraph does not have a central repository as of writing this
	# So we build a local version and store it in our local maven repository
	git clone https://github.com/apache/giraph.git

	# And retrieve a patch for `GIRAPH-442`
	wget http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff

	# We also revert to version 1.0.0 for this codebase
	cd giraph
	git checkout release-1.0.0-RC3
	public class DriverHook implements Runnable{
	static public DriverHookNewApi create(Job job){
	return new DriverHookNewApi(job); }
	Job job;
	private DriverHookNewApi(Job job){
	this.job = job; }
	public void run(){
	System.out.println("Hello from MyHook");
	if(job == null)
	throw new NullPointerException("err msg");
	#!/usr/bin/python
	import csv
	import sys
	import argparse
	import io

	csv.field_size_limit(sys.maxsize)

	parser = argparse.ArgumentParser(description='Clean csv of in-line newlines')
	parser.add_argument('infile',help='Path to input CSV file');