Skip to content

Instantly share code, notes, and snippets.

@tomron
tomron / parquet_to_json.py
Created Nov 17, 2016
Converts parquet file to json using spark
View parquet_to_json.py
# impor spark, set spark context
from pyspark import SparkContext, SparkConf
from pyspark.sql.context import SQLContext
import sys
import os
if len(sys.argv) == 1:
sys.stderr.write("Must enter input file to convert")
sys.exit()
input_file = sys.argv[1]
@tomron
tomron / seasonal_decompose_plotly.py
Last active Jul 3, 2021
A nicer seasonal decompose chart using plotly.
View seasonal_decompose_plotly.py
from statsmodels.tsa.seasonal import seasonal_decompose
import plotly.tools as tls
def plotSeasonalDecompose(
x,
model='additive',
filt=None,
period=None,
two_sided=True,
extrapolate_trend=0,
@tomron
tomron / spark_aws_lambda.py
Created Feb 27, 2016
Example of python code to submit spark process as an emr step to AWS emr cluster in AWS lambda function
View spark_aws_lambda.py
import sys
import time
import boto3
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
@tomron
tomron / plotly_back_to_back_chart.py
Last active May 31, 2021
Back to back bar chart with Plotly
View plotly_back_to_back_chart.py
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
women_pop = np.array([5., 30., 45., 22.])
men_pop = np.array( [5., 25., 50., 20.])
y = list(range(len(women_pop)))
fig = go.Figure(data=[
go.Bar(y=y, x=women_pop, orientation='h', name="women", base=0),
@tomron
tomron / plotly_bar_chart_links.py
Created Nov 17, 2020
Add links to Plotly bar chart
View plotly_bar_chart_links.py
@tomron
tomron / spark_knn_approximation.py
Created Nov 19, 2015
A naive approximation of k-nn algorithm (k-nearest neighbors) in pyspark. Approximation quality can be controlled by number of repartitions and number of repartition
View spark_knn_approximation.py
from __future__ import print_function
import sys
from math import sqrt
import argparse
from collections import defaultdict
from random import randint
from pyspark import SparkContext
View networkx_post.py
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Multigraph example
G = nx.MultiGraph()
G.add_nodes_from([1, 2, 3])
@tomron
tomron / sprt.py
Last active Apr 30, 2019
Sequential probability ratio test implementation (https://en.wikipedia.org/wiki/Sequential_probability_ratio_test) for exponential distribution. Usage - `t = sprt.SPRT(0.05, 0.8, 1, 2); t.test([1, 2, 3, 4, 5])`
View sprt.py
import numpy as np
"""
Implements Sequential probability ratio test
https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
"""
class SPRT:
def __init__(self, alpha, beta, mu0, mu1):
@tomron
tomron / sprt.py
Created Apr 30, 2019
Sequential probability ratio test
View sprt.py
import numpy as np
"""
Implements Sequential probability ratio test
https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
"""
class SPRT:
def __init__(self, alpha, beta, mu0, mu1):
@tomron
tomron / welchtest.py
Created Aug 6, 2018
welchtest.py - based on the lazy programmer ttest implementation (https://github.com/lazyprogrammer/machine_learning_examples/blob/master/ab_testing/ttest.py). Numbers are not exactly the same but I suspect it have to do with rounding issues
View welchtest.py
import pandas as pd
import numpy as np
from scipy import stats
input_file='advertisement_clicks.csv'
df = pd.read_csv(input_file)
a = df[df['advertisement_id']== 'A']['action'].tolist()