Skip to content

Instantly share code, notes, and snippets.

View prakashrd's full-sized avatar
🏠
Working from home

Jai Prakash prakashrd

🏠
Working from home
View GitHub Profile
@prakashrd
prakashrd / pyspark_two_files.py
Created March 16, 2019 13:02
PySpark read two files join on a column and print the result df
import sys
from pyspark.sql import SparkSession
# Import data types
from pyspark.sql.types import *
from pyspark.sql.functions import when, lit, col, udf
spark = SparkSession.builder.appName("Python spark read two files").getOrCreate()
accounts_file = sys.argv[1]
@prakashrd
prakashrd / scala_java8.java
Created April 2, 2019 11:44
Java8 Snippets
import scala.Tuple2;
import java.util.stream.Collectors;
//Drop fields with same name of expression
List<String> fieldNames = Arrays.asList(inputDF.columns());
List<Tuple2<String, String>> fieldList = fieldNames.stream()
.filter(fieldName -> fieldName.trim().startsWith("__"))
.map(fieldName -> Tuple2.apply(fieldName, fieldName.substring(2)))
.filter(tuple2 -> fieldNames.contains(tuple2._2))
.collect(Collectors.toList());
/**
* Null out the columns specified in the meta data
*
* @param inputDataframe The input dataframe to apply nulling out on
* @param sparkSession An active spark session
* @param sourceEntity The source entity name.
* @param targetMetaData The target meta data object
* @return A dataframe after applying nulling out on fields specified
*/
public static Dataset<Row> applyNullingOut(Dataset<Row> inputDataframe, SparkSession sparkSession,
# To figure out duplicat values in a column and extract those rows
awk 'BEGIN { FS="," } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.csv
# replace $2 to which ever column you want to look for duplicates
# Same above code with more comments
BEGIN { FS = ";" }
{