Airawat airawat

## DotNet-StreamingAnalyticsEventPublisher
using System;
using System.Text;
using Microsoft.ServiceBus.Messaging;
using System.Net;
using System.IO;

namespace StreamingAnalyticsEventPublisher
{
    class MeetupRSVPEventSender
    {

## 00-OozieWorkflowCallWithJavaAPI
import java.util.Properties;

import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;

public class myOozieWorkflowJavaAPICall {

	public static void main(String[] args) {
		OozieClient wc = new OozieClient("http://cdh-dev01:11000/oozie");

## 00-LogParser-JavaMapReduce-Regex
This gist includes a mapper, reducer and driver in java that can parse log files using
regex; The code for combiner is the same as reducer;
Usecase:  Count the number of occurances of processes that got logged, inception to date.

Includes:
---------
Sample data and scripts for download:01-ScriptAndDataDownload
Sample data and structure:           02-SampleDataAndStructure
Mapper:                              03-LogEventCountMapper.java
Reducer:                             04-LogEventCountReducer.java

## 00-MapSideJoinDistCacheThruGenericOptionsParser
This gist is part of a series of gists related to Map-side joins in Java map-reduce.
In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
in HDFS to the distributed cache from the driver code.

This gist demonstrates adding a local file via command line to distributed cache.
Refer gist at https://gist.github.com/airawat/6597557 for-
1.  Data samples and structure
2.  Expected results
3.  Commands to load data to HDFS

## Security-GlossaryOfTerms
Kerberos
Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography

Kerberos Principals
A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm.
A Kerberos principal is used in a Kerberos-secured system to represent a unique identity.
The first component of the principal is called the primary, or sometimes the user component.
The primary component is an arbitrary string and may be the operating system username of the user or the name of a service.
The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example.
An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service.

## CompactRawLogs
spark-submit --class com.khanolkar.bda.util.CompactRawLogs \
............
MyJar-1.0.jar \
"/user/akhanolk/data/raw/streaming/to-be-compacted/" \
"/user/akhanolk/data/raw/compacted/" \
"2" "128" "oozie-124"


## CompactParsedLogs
package com.khanolkar.bda.util
/**
 * @author Anagha Khanolkar
 */

import org.apache.spark.sql.SparkSession
import org.apache.hadoop.fs.{ FileSystem, Path }
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql._
import com.databricks.spark.avro._

## 00-OozieCoordinatorJobWithTimeAsTrigger
This gist includes components of a oozie (time initiated) coordinator application - scripts/code, sample data
and commands;  Oozie actions covered: hdfs action, email action, java main action,
hive action;  Oozie controls covered: decision, fork-join; The workflow includes a
sub-workflow that runs two hive actions concurrently.  The hive table is partitioned;
Parsing uses hive-regex serde, and Java-regex.  Also, the java mapper, gets the input
directory path and includes part of it in the key.

Usecase: Parse Syslog generated log files to generate reports;

Pictorial overview of job:

## mkfser.sh
#!/bin/sh
set -x

# create the input file based on size (you can get size pattern by running fdisk -l as root)
# Be sure to exclude the Root disk if it is part of your config. You must edit this file to do so

size=$1
shift;
fdisk -l|grep $size|awk '{print $2}'|sed -e"s/\:$//g" > foo

## 00-ReduceSideJoin
My blog has an introduction to reduce side join in Java map reduce-
http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html
	using System;
	using System.Text;
	using Microsoft.ServiceBus.Messaging;
	using System.Net;
	using System.IO;

	namespace StreamingAnalyticsEventPublisher
	{
	class MeetupRSVPEventSender
	{
	import java.util.Properties;

	import org.apache.oozie.client.OozieClient;
	import org.apache.oozie.client.WorkflowJob;

	public class myOozieWorkflowJavaAPICall {

	public static void main(String[] args) {
	OozieClient wc = new OozieClient("http://cdh-dev01:11000/oozie");
	This gist includes a mapper, reducer and driver in java that can parse log files using
	regex; The code for combiner is the same as reducer;
	Usecase: Count the number of occurances of processes that got logged, inception to date.

	Includes:
	---------
	Sample data and scripts for download:01-ScriptAndDataDownload
	Sample data and structure: 02-SampleDataAndStructure
	Mapper: 03-LogEventCountMapper.java
	Reducer: 04-LogEventCountReducer.java
	This gist is part of a series of gists related to Map-side joins in Java map-reduce.
	In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
	in HDFS to the distributed cache from the driver code.

	This gist demonstrates adding a local file via command line to distributed cache.
	Refer gist at https://gist.github.com/airawat/6597557 for-
	1. Data samples and structure
	2. Expected results
	3. Commands to load data to HDFS
	Kerberos
	Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography

	Kerberos Principals
	A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm.
	A Kerberos principal is used in a Kerberos-secured system to represent a unique identity.
	The first component of the principal is called the primary, or sometimes the user component.
	The primary component is an arbitrary string and may be the operating system username of the user or the name of a service.
	The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example.
	An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service.
	spark-submit --class com.khanolkar.bda.util.CompactRawLogs \
	............
	MyJar-1.0.jar \
	"/user/akhanolk/data/raw/streaming/to-be-compacted/" \
	"/user/akhanolk/data/raw/compacted/" \
	"2" "128" "oozie-124"
	package com.khanolkar.bda.util
	/**
	* @author Anagha Khanolkar
	*/

	import org.apache.spark.sql.SparkSession
	import org.apache.hadoop.fs.{ FileSystem, Path }
	import org.apache.hadoop.conf.Configuration
	import org.apache.spark.sql._
	import com.databricks.spark.avro._
	This gist includes components of a oozie (time initiated) coordinator application - scripts/code, sample data
	and commands; Oozie actions covered: hdfs action, email action, java main action,
	hive action; Oozie controls covered: decision, fork-join; The workflow includes a
	sub-workflow that runs two hive actions concurrently. The hive table is partitioned;
	Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets the input
	directory path and includes part of it in the key.

	Usecase: Parse Syslog generated log files to generate reports;

	Pictorial overview of job:
	#!/bin/sh
	set -x

	# create the input file based on size (you can get size pattern by running fdisk -l as root)
	# Be sure to exclude the Root disk if it is part of your config. You must edit this file to do so

	size=$1
	shift;
	fdisk -l\|grep $size\|awk '{print $2}'\|sed -e"s/\:$//g" > foo
	My blog has an introduction to reduce side join in Java map reduce-
	http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html