Skip to content

Instantly share code, notes, and snippets.

View prakashrd's full-sized avatar
🏠
Working from home

Jai Prakash prakashrd

🏠
Working from home
View GitHub Profile
# To figure out duplicat values in a column and extract those rows
awk 'BEGIN { FS="," } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.csv
# replace $2 to which ever column you want to look for duplicates
# Same above code with more comments
BEGIN { FS = ";" }
{
/**
* Null out the columns specified in the meta data
*
* @param inputDataframe The input dataframe to apply nulling out on
* @param sparkSession An active spark session
* @param sourceEntity The source entity name.
* @param targetMetaData The target meta data object
* @return A dataframe after applying nulling out on fields specified
*/
public static Dataset<Row> applyNullingOut(Dataset<Row> inputDataframe, SparkSession sparkSession,
@prakashrd
prakashrd / scala_java8.java
Created April 2, 2019 11:44
Java8 Snippets
import scala.Tuple2;
import java.util.stream.Collectors;
//Drop fields with same name of expression
List<String> fieldNames = Arrays.asList(inputDF.columns());
List<Tuple2<String, String>> fieldList = fieldNames.stream()
.filter(fieldName -> fieldName.trim().startsWith("__"))
.map(fieldName -> Tuple2.apply(fieldName, fieldName.substring(2)))
.filter(tuple2 -> fieldNames.contains(tuple2._2))
.collect(Collectors.toList());
@prakashrd
prakashrd / pyspark_two_files.py
Created March 16, 2019 13:02
PySpark read two files join on a column and print the result df
import sys
from pyspark.sql import SparkSession
# Import data types
from pyspark.sql.types import *
from pyspark.sql.functions import when, lit, col, udf
spark = SparkSession.builder.appName("Python spark read two files").getOrCreate()
accounts_file = sys.argv[1]
@prakashrd
prakashrd / spark-join.scala
Created March 7, 2019 11:57
spark-joining-datasets
scala> val left = Seq((0), (1)).toDF("id")
left: org.apache.spark.sql.DataFrame = [id: int]
scala> left.join(right, "id").show
+---+-----+
| id|right|
+---+-----+
| 0| zero|
| 0| four|
+---+-----+
@prakashrd
prakashrd / regex.scala
Created July 5, 2018 08:45
Scala Regex
scala> val s = """(\d+)-(\d+)-(\d+).*""".r
s: scala.util.matching.Regex = (\d+)-(\d+)-(\d+).*
scala> val s(a,b,c) = "20-30-04 jfa"
a: String = 20
b: String = 30
c: String = 04
# List unique values in a DataFrame column
# h/t @makmanalp for the updated syntax!
df['Column Name'].unique()
# Convert Series datatype to numeric (will error if column has non-numeric values)
# h/t @makmanalp
pd.to_numeric(df['Column Name'])
# Convert Series datatype to numeric, changing non-numeric values to NaN
# h/t @makmanalp for the updated syntax!
@prakashrd
prakashrd / squash_commits_after_push.sh
Last active May 18, 2018 07:39
Git : Squash commits after push
git checkout my_branch
git reset --soft HEAD~4
git commit
git push --force origin my_branch
## The above resets four commits you have pushed. Though it can be done any branch but doing it on feature branch is a
## good practise
@prakashrd
prakashrd / read_encoded_file.py
Created April 11, 2018 02:17
Read an encoded UTF-16 file with python
import codecs
import json
json_data = json.load(codecs.open('url_entities.json', 'r', 'utf-16'))
json_rows = [r for r in json_data]
@prakashrd
prakashrd / 00-OozieWorkflowShellAction
Created July 5, 2017 07:34 — forked from airawat/00-OozieWorkflowShellAction
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------