Skip to content

Instantly share code, notes, and snippets.

@rainsunny
rainsunny / bash_snipts.sh
Last active August 11, 2017 09:40
bash snipts
##### get current directory in absolute path
# Add `> /dev/null 2>&1` to eliminate outputs of `cd` in some cases
base_dir="$( cd "$(dirname $0)" > /dev/null 2>&1 && pwd)"
# other options
current_dir=$(pwd) # current working directory
base_dir=$(dirname $0) # in relative path
@rainsunny
rainsunny / AppShutdown.java
Created August 11, 2017 12:38
Java application graceful shutdown
ESRunner runner = new ESRunner();
// Gracefully shutdown
final Thread mainThread = Thread.currentThread();
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
try {
runner.shutdown(); // Clean up work
mainThread.join(200); // wait main thread to stop
@rainsunny
rainsunny / expand_column.py
Created September 5, 2017 03:57
pandas: expand data frame column into multiple rows
"""
Turn this
days name
[1,3,5,7] John
into this
days name
1 John
@rainsunny
rainsunny / pandas_melt.py
Created September 5, 2017 09:19
Turn certain columns to key-value rows ( and the reverse )
"""
Turn this
location name Jan-2010 Feb-2010 March-2010
A "test" 12 20 30
B "foo" 18 20 25
into this
location name Date Value
@rainsunny
rainsunny / spark_dataframe_explode.md
Last active October 19, 2017 07:39
Derive multiple columns from a single column in a Spark DataFrame

UDF can return only a single column at the time. There are two different ways you can overcome this limitation:

Return a column of complex type.

The most general solution is a StructType but you can consider ArrayType or MapType as well.

import org.apache.spark.sql.functions.udf

val df = Seq(
  (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
@rainsunny
rainsunny / spark_withColumns.md
Created December 4, 2017 12:17
Spark/Scala repeated calls to withColumn() using the same function on multiple columns [foldLeft]

Suppose you need to apply the same function to multiple columns in one DataFrame, one straight way is like this:

val newDF = oldDF.withColumn("colA", func("colA")).withColumn("colB", func("colB")).withColumn("colC", func("colC"))

If you want to save some type, you can try this:

  1. Use select with varargs including *:
import spark.implicits._
@rainsunny
rainsunny / helper.py
Last active August 5, 2020 06:32
Plot histogram using python 绘制直方图的方法,以及二维直方图(热力图)的绘制方法,用于观察变量的分布情况
# 附带一个用spark将数据取回本地用于绘图的方法
def toArr(df, col, dtype=np.int32):
"""
将DataFrame的一列取回本地,并转成numpy.ndarray格式。
df: 目标DataFrame
col: 目标列名
dtype: 目标列的数据格式
return: 目标列的数据。np.ndarray
"""
@rainsunny
rainsunny / apply.py
Last active June 7, 2018 02:52 — forked from rjurney/apply.py
Plot a pyspark.RDD.histogram as a pyplot histogram (via bar)
%matplotlib inline
buckets = [-87.0, -15, 0, 30, 120]
rdd_histogram_data = ml_bucketized_features\
.select("ArrDelay")\
.rdd\
.flatMap(lambda x: x)\
.histogram(buckets)
create_hist(rdd_histogram_data)
@rainsunny
rainsunny / spark_pivot.md
Last active July 5, 2018 06:17
Pivot function: turn DataFrame column into rows

Spark DataFrame pivot functions

Turning column values into rows

Before

ntfLog.groupby("auth_method","auth_result").agg(F.count("*").alias("cnt"))
.sort("auth_method","auth_result").show(20,False)