Skip to content

Instantly share code, notes, and snippets.

@rainsunny
rainsunny / amortize_over.py
Created November 6, 2019 01:51 — forked from wzyboy/amortize_over.py
amortize_over beancount plugin
# Copyright (c) 2017 Cary Kempston
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all
# Used to flatten json object while using pandas
from pandas.io.json import json_normalize
def flatten_json(y):
out = {}
def __flatten(x, name=''):
if type(x) is dict:
for a in x:
__flatten(x[a], name + a + '_')
@rainsunny
rainsunny / spark_pivot.md
Last active July 5, 2018 06:17
Pivot function: turn DataFrame column into rows

Spark DataFrame pivot functions

Turning column values into rows

Before

ntfLog.groupby("auth_method","auth_result").agg(F.count("*").alias("cnt"))
.sort("auth_method","auth_result").show(20,False)
@rainsunny
rainsunny / apply.py
Last active June 7, 2018 02:52 — forked from rjurney/apply.py
Plot a pyspark.RDD.histogram as a pyplot histogram (via bar)
%matplotlib inline
buckets = [-87.0, -15, 0, 30, 120]
rdd_histogram_data = ml_bucketized_features\
.select("ArrDelay")\
.rdd\
.flatMap(lambda x: x)\
.histogram(buckets)
create_hist(rdd_histogram_data)
@rainsunny
rainsunny / helper.py
Last active August 5, 2020 06:32
Plot histogram using python 绘制直方图的方法,以及二维直方图(热力图)的绘制方法,用于观察变量的分布情况
# 附带一个用spark将数据取回本地用于绘图的方法
def toArr(df, col, dtype=np.int32):
"""
将DataFrame的一列取回本地,并转成numpy.ndarray格式。
df: 目标DataFrame
col: 目标列名
dtype: 目标列的数据格式
return: 目标列的数据。np.ndarray
"""
@rainsunny
rainsunny / spark_withColumns.md
Created December 4, 2017 12:17
Spark/Scala repeated calls to withColumn() using the same function on multiple columns [foldLeft]

Suppose you need to apply the same function to multiple columns in one DataFrame, one straight way is like this:

val newDF = oldDF.withColumn("colA", func("colA")).withColumn("colB", func("colB")).withColumn("colC", func("colC"))

If you want to save some type, you can try this:

  1. Use select with varargs including *:
import spark.implicits._
@rainsunny
rainsunny / spark_dataframe_explode.md
Last active October 19, 2017 07:39
Derive multiple columns from a single column in a Spark DataFrame

UDF can return only a single column at the time. There are two different ways you can overcome this limitation:

Return a column of complex type.

The most general solution is a StructType but you can consider ArrayType or MapType as well.

import org.apache.spark.sql.functions.udf

val df = Seq(
  (1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")
@rainsunny
rainsunny / pandas_melt.py
Created September 5, 2017 09:19
Turn certain columns to key-value rows ( and the reverse )
"""
Turn this
location name Jan-2010 Feb-2010 March-2010
A "test" 12 20 30
B "foo" 18 20 25
into this
location name Date Value
@rainsunny
rainsunny / expand_column.py
Created September 5, 2017 03:57
pandas: expand data frame column into multiple rows
"""
Turn this
days name
[1,3,5,7] John
into this
days name
1 John