Skip to content

Instantly share code, notes, and snippets.

View tangoAnkur's full-sized avatar
💭
Learning and Executing knowledge in the World of Curious Data and its Analysis

tangoAnkur

💭
Learning and Executing knowledge in the World of Curious Data and its Analysis
View GitHub Profile
@tangoAnkur
tangoAnkur / setup.sh
Created September 6, 2019 08:56 — forked from n3tr/setup.sh
Install Spark + Zeppelin on EC2
# scala install
wget www.scala-lang.org/files/archive/scala-2.11.7.deb
sudo dpkg -i scala-2.11.7.deb
# sbt installation
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get update
sudo apt-get install sbt
#!/usr/bin/python
# -*- coding: utf-8 -*-
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
@tangoAnkur
tangoAnkur / 00-MultipleOutputs
Created July 17, 2019 10:29 — forked from airawat/00-MultipleOutputs
MultipleOutputs sample program - A program that demonstrates how to generate an output file for each key
********************************
Gist
********************************
Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number. There are scenarios where we
may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
functionality.