Skip to content

Instantly share code, notes, and snippets.

using System;
using System.Text;
using Microsoft.ServiceBus.Messaging;
using System.Net;
using System.IO;
namespace StreamingAnalyticsEventPublisher
{
class MeetupRSVPEventSender
{
@airawat
airawat / 00-OozieWorkflowCallWithJavaAPI
Last active September 4, 2016 17:59
Oozie workflow - invoked from Java using Oozie Java API
import java.util.Properties;
import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;
public class myOozieWorkflowJavaAPICall {
public static void main(String[] args) {
OozieClient wc = new OozieClient("http://cdh-dev01:11000/oozie");
@airawat
airawat / 00-LogParser-JavaMapReduce-Regex
Last active September 18, 2016 09:36
00-JavaMapperReducerUsingRegex
This gist includes a mapper, reducer and driver in java that can parse log files using
regex; The code for combiner is the same as reducer;
Usecase: Count the number of occurances of processes that got logged, inception to date.
Includes:
---------
Sample data and scripts for download:01-ScriptAndDataDownload
Sample data and structure: 02-SampleDataAndStructure
Mapper: 03-LogEventCountMapper.java
Reducer: 04-LogEventCountReducer.java
@airawat
airawat / 00-MapSideJoinDistCacheThruGenericOptionsParser
Last active September 25, 2016 14:31
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with reference data (txt file), made available through DistributedCache via command line (GenericOptionsParser)
This gist is part of a series of gists related to Map-side joins in Java map-reduce.
In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
in HDFS to the distributed cache from the driver code.
This gist demonstrates adding a local file via command line to distributed cache.
Refer gist at https://gist.github.com/airawat/6597557 for-
1. Data samples and structure
2. Expected results
3. Commands to load data to HDFS
Kerberos
Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography
Kerberos Principals
A user in Kerberos is called a principal, which is made up of three distinct components: the primary, instance, and realm.
A Kerberos principal is used in a Kerberos-secured system to represent a unique identity.
The first component of the principal is called the primary, or sometimes the user component.
The primary component is an arbitrary string and may be the operating system username of the user or the name of a service.
The primary component is followed by an optional section called the instance, which is used to create principals that are used by users in special roles or to define the host on which a service runs, for example.
An instance, if it exists, is separated from the primary by a slash and then the content is used to disambiguate multiple principals for a single user or service.
spark-submit --class com.khanolkar.bda.util.CompactRawLogs \
............
MyJar-1.0.jar \
"/user/akhanolk/data/raw/streaming/to-be-compacted/" \
"/user/akhanolk/data/raw/compacted/" \
"2" "128" "oozie-124"
package com.khanolkar.bda.util
/**
* @author Anagha Khanolkar
*/
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.fs.{ FileSystem, Path }
import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql._
import com.databricks.spark.avro._
@airawat
airawat / 00-OozieCoordinatorJobWithTimeAsTrigger
Last active October 21, 2017 15:40
Oozie coordinator job example with time as trigger
This gist includes components of a oozie (time initiated) coordinator application - scripts/code, sample data
and commands; Oozie actions covered: hdfs action, email action, java main action,
hive action; Oozie controls covered: decision, fork-join; The workflow includes a
sub-workflow that runs two hive actions concurrently. The hive table is partitioned;
Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets the input
directory path and includes part of it in the key.
Usecase: Parse Syslog generated log files to generate reports;
Pictorial overview of job:
#!/bin/sh
set -x
# create the input file based on size (you can get size pattern by running fdisk -l as root)
# Be sure to exclude the Root disk if it is part of your config. You must edit this file to do so
size=$1
shift;
fdisk -l|grep $size|awk '{print $2}'|sed -e"s/\:$//g" > foo
@airawat
airawat / 00-ReduceSideJoin
Last active December 21, 2017 19:06
ReduceSideJoin - Sample Java mapreduce program for joining datasets with cardinality of 1..1, and 1..many on the join key
My blog has an introduction to reduce side join in Java map reduce-
http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html