Ian Raphael ianrapha

## dedup.scala
val dedupColumn = "..." // column name to use as a key for deduplication
val ttl = Some(Time.minutes(60)) // state TTL

stream
  .keyBy(row => row.getField(dedupColumn))
  .flatMap(new RichFlatMapFunction[Row, Row] {
    @transient
    private var seen: ValueState[Boolean] = _

    override def open(parameters: Configuration): Unit = {

## README.md

      
              3 files
            
          
              11 forks
            
          
                2 comments
              
            
              42 stars
            
          
                davideicardi
                / README.md
            
            
              Last active
              May 4, 2025 21:48
            
              
                Write and read Avro records from bytes array 
              
          
    Avro serialization

There are 4 possible serialization format when using avro:

Avro Json encoding
Avro Data Serialization (https://avro.apache.org/docs/current/spec.html#Data+Serialization)
Binary format with an header that contains the full schema, this is the format usually used when writing Avro files
Avro Single Object Encoding (https://avro.apache.org/docs/current/spec.html#single_object_encoding)
Binary format with an header with only the fingerprint/id of the schema, this it the format used by Kafka (see this
Avro Binary Encoding (https://avro.apache.org/docs/current/spec.html#binary_encoding)


## spark-submit-streaming-yarn.sh
#!/bin/bash

# Minimum TODOs on a per job basis:
# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)
# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings

# the two most important settings:
num_executors=6
executor_memory=3g

## ordered_parallel.go
/*
Parallel processing with ordered output in Go
(you can use this pattern by importing https://github.com/MarianoGappa/parseq)

This example implementation is useful when the following 3 conditions are true:
1) the rate of input is higher than the rate of output on the system (i.e. it queues up)
2) the processing of input can be parallelised, and overall throughput increases by doing so
3) the order of output of the system needs to respect order of input

- if 1 is false, KISS!

## install_java_scala_sbt.sh
#!/bin/bash
#
# Script to Install
#     Oracle Java 1.8
#     Scala 2.11.7
#     sbt 0.13.9
#
##############################

#Script must be run as root/sudo

## Hadoop 2.6.0 Multi Node Installation and Test Run in Ubuntu 14.04
1. You can follow https://gist.github.com/piyasde/17d4f7bc97c0f0820d40 for single node setup for all the machines in cluster
2. Make a user group for Hadoop Users
3. Make same user for hadoop in all the machines where the hadoop setup will be done
4. Select a machine which will be master node.
On that machine vi /etc/hosts and add the ip a name as master not localhost.

	vi /etc/hosts
	#127.0.0.1	localhost
	#127.0.1.1	ubupc1


## etc_sensu_conf.d_extensions_mailer-ses.json
{
  "handlers": {
    "mailer": {
      "command": null,
      "type": "extension",
      "severities": [
        "ok",
        "warning",
        "critical",
        "unknown"

## SparkGrep.scala
package spark.example

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SparkGrep {
	def main(args: Array[String]) {
		if (args.length < 3) {
			System.err.println("Usage: SparkGrep <host> <input_file> <match_term>")

## hadoop_multinode_cluster_setup
#Hadoop 2.6.0 Multinode cluster Setup
From Blog http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

###Machine 1(master).

Prequisite:

java version

	java -version

## gist:74e7635275331d5afa6b

      
              1 file
            
          
              30 forks
            
          
                12 comments
              
            
              246 stars
            
          
                cridenour
                / gist:74e7635275331d5afa6b
            
            
              Last active
              June 17, 2025 16:47
            
              
                Setting up Vim as your Go IDE
              
          
    Setting up Vim as your Go IDE


Intro

I've been wanting to do a serious project in Go. One thing holding me back has been a my working environment. As a huge PyCharm user, I was hoping the Go IDE plugin for IntelliJ IDEA would fit my needs. However, it never felt quite right. After a previous experiment a few years ago using Vim, I knew how powerful it could be if I put in the time to make it so. Luckily there are plugins for almost anything you need to do with Go or what you would expect form and IDE. While this is no where near comprehensive, it will get you writing code, building and testing with the power you would expect from Vim.
Getting Started

I'm assuming you're coming with a clean slate. For me this was OSX so I used MacVim. There is nothing in my config files that assumes this is the case.
	val dedupColumn = "..." // column name to use as a key for deduplication
	val ttl = Some(Time.minutes(60)) // state TTL

	stream
	.keyBy(row => row.getField(dedupColumn))
	.flatMap(new RichFlatMapFunction[Row, Row] {
	@transient
	private var seen: ValueState[Boolean] = _

	override def open(parameters: Configuration): Unit = {
	#!/bin/bash

	# Minimum TODOs on a per job basis:
	# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
	# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)
	# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings

	# the two most important settings:
	num_executors=6
	executor_memory=3g
	/*
	Parallel processing with ordered output in Go
	(you can use this pattern by importing https://github.com/MarianoGappa/parseq)

	This example implementation is useful when the following 3 conditions are true:
	1) the rate of input is higher than the rate of output on the system (i.e. it queues up)
	2) the processing of input can be parallelised, and overall throughput increases by doing so
	3) the order of output of the system needs to respect order of input

	- if 1 is false, KISS!
	#!/bin/bash
	#
	# Script to Install
	# Oracle Java 1.8
	# Scala 2.11.7
	# sbt 0.13.9
	#
	##############################

	#Script must be run as root/sudo
	1. You can follow https://gist.github.com/piyasde/17d4f7bc97c0f0820d40 for single node setup for all the machines in cluster
	2. Make a user group for Hadoop Users
	3. Make same user for hadoop in all the machines where the hadoop setup will be done
	4. Select a machine which will be master node.
	On that machine vi /etc/hosts and add the ip a name as master not localhost.

	vi /etc/hosts
	#127.0.0.1 localhost
	#127.0.1.1 ubupc1
	{
	"handlers": {
	"mailer": {
	"command": null,
	"type": "extension",
	"severities": [
	"ok",
	"warning",
	"critical",
	"unknown"
	package spark.example

	import org.apache.spark.SparkContext
	import org.apache.spark.SparkContext._
	import org.apache.spark.SparkConf

	object SparkGrep {
	def main(args: Array[String]) {
	if (args.length < 3) {
	System.err.println("Usage: SparkGrep <host> <input_file> <match_term>")
	#Hadoop 2.6.0 Multinode cluster Setup
	From Blog http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

	###Machine 1(master).

	Prequisite:

	java version

	java -version