Skip to content

Instantly share code, notes, and snippets.

View yaravind's full-sized avatar
💭
Constraints Liberate. Liberties Constrain.

Aravind Yarram yaravind

💭
Constraints Liberate. Liberties Constrain.
View GitHub Profile
val today = LocalDate.now
val todayTransactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json(s"s3n://bucket-name/${today}/transaction.json")
val yesterdayTransactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
@yaravind
yaravind / spark-duplicates.scala
Created May 31, 2017 14:39 — forked from crocker/spark-duplicates.scala
Find duplicates in a Spark DataFrame
val transactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json("s3n://bucket-name/transaction.json")
transactions.groupBy("id", "organization").count.sort($"count".desc).show
package com.databricks.spark.jira
import scala.io.Source
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.sources.{TableScan, BaseRelation, RelationProvider}
@yaravind
yaravind / 00-LogParser-Hive-Regex
Created May 18, 2018 02:41 — forked from airawat/00-LogParser-Hive-Regex
Log parser in Hive using regex serde
This gist includes hive ql scripts to create an external partitioned table for Syslog
generated log files using regex serde;
Usecase: Count the number of occurances of processes that got logged, by year, month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data download: 02-DataDownload
Data load commands: 03-DataLoadCommands
@yaravind
yaravind / install-docker-ce-on-elementaryos-juno.sh
Last active December 28, 2018 04:32 — forked from BeerOnBeard/install-docker-ce-on-elementaryos-loki.sh
Install Docker CE on ElementaryOS 0.4.1 Loki
#!/bin/bash
set -e
##########################################################
# Install script for Docker-CE on ElementaryOS 0.4.1 Loki
# Had to update the repository to point to xenial instead
# of using 'lsb_release -cs' because there's no loki
# repository at download.docker.com.
##########################################################
@yaravind
yaravind / .gitconfig
Created December 29, 2018 01:15 — forked from johnpolacek/.gitconfig
My current .gitconfig aliases
[alias]
co = checkout
cob = checkout -b
coo = !git fetch && git checkout
br = branch
brd = branch -d
brD = branch -D
merged = branch --merged
dmerged = "git branch --merged | grep -v '\\*' | xargs -n 1 git branch -d"
st = status
# Custom history configuration
# Run script using:
# chmod u+x better_history.sh
# sudo su
# ./better_history.sh
echo ">>> Starting"
echo ">>> Loading configuration into /etc/bash.bashrc"
echo "HISTTIMEFORMAT='%F %T '" >> /etc/bash.bashrc
echo 'HISTFILESIZE=-1' >> /etc/bash.bashrc
@yaravind
yaravind / DataFrameWithFileName.scala
Created April 15, 2020 03:22 — forked from satendrakumar/DataFrameWithFileName.scala
Add file name as Spark DataFrame column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
object DataFrameWithFileNameApp extends App {
val spark: SparkSession =
SparkSession
.builder()
.appName("DataFrameApp")
.config("spark.master", "local[*]")
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.yarn.webapp.hamlet.HamletSpec.P;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
@yaravind
yaravind / WikiPageClustering.java
Created April 28, 2020 18:04 — forked from Jeffwan/WikiPageClustering.java
Machine Learning Pipleline
package com.diorsding.spark.ml;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;