Skip to content

Instantly share code, notes, and snippets.

View mkaranasou's full-sized avatar
🏠
Working from home

Maria Karanasou mkaranasou

🏠
Working from home
View GitHub Profile
@mkaranasou
mkaranasou / seq_id_explode_test.py
Created May 13, 2021 07:48
Testing explode as a way to get sequential ids in a spark dataframe
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
from pyspark import SparkConf
from pyspark.sql import functions as F
conf = SparkConf()
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Dataframe with Indexes') \
.getOrCreate()
@mkaranasou
mkaranasou / pyspark_shapley_calculation_cli.py
Last active March 20, 2021 11:17
Cli Shapley Calculation with pyspark
import operator
import os
import time
import warnings
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F, SparkSession, types as T, Window
@mkaranasou
mkaranasou / z_marginal_contribution.py
Created March 20, 2021 10:59
Output for the calculation of each z's marginal contibution
+----+-------------------------------------++-------------+---------------------+
|id |features |prediction |marginal_contribution|
+----+-------------------------------------+--------------+---------------------+
|1677|[0.349,0.141,0.162,0.162,0.162,0.349]|0.0 |null |
|1677|[0.886,0.141,0.162,0.162,0.162,0.349]|0.0 |0.0 |
|2250|[0.106,0.423,0.777,0.777,0.777,0.886]|0.0 |null |
|2250|[0.886,0.423,0.777,0.777,0.777,0.886]|0.0 |0.0 |
|2453|[0.801,0.423,0.777,0.777,0.87,0.886] |0.0 |null |
+----+-------------------------------------+--------------+---------------------+
only showing top 5 rows
@mkaranasou
mkaranasou / output.py
Created March 20, 2021 10:51
Sample output xj and exploded df for shapley values calculation algorithm
Row: Row(id=964, features=DenseVector([0.886, 0.423, 0.777, 0.777, 0.777, 0.886]))
Calculating SHAP values for "f0"...
+----+-----+-----+-----+-----+-----+-----+-------------------------------------+-----+-----------+------------------------+------------------------------------------------------------------------------+
|id |f0 |f1 |f2 |f3 |f4 |f5 |features |label|is_selected|features_permutations |x |
+----+-----+-----+-----+-----+-----+-----+-------------------------------------+-----+-----------+------------------------+------------------------------------------------------------------------------+
|1677|0.349|0.141|0.162|0.162|0.162|0.349|[0.349,0.141,0.162,0.162,0.162,0.349]|1 |false |[f5, f2, f1, f4, f3, f0]|[[0.349,0.141,0.162,0.162,0.162,0.349], [0.886,0.141,0.162,0.162,0.162,0.349]]|
|2250|0.106|0.938|0.434|0.434|0.434|0.106|[0.106,0.938,0.434,0.434,0.434,0.106]|0 |false
@mkaranasou
mkaranasou / pyspark_shapley_values_full_example_random_data.py
Last active March 20, 2021 11:17
A full example of Shapley Values calculation with pyspark and their benefits to the model with random data
import random
import numpy as np
import pyspark
from shapley_spark_calculation import \
calculate_shapley_values, select_row
from pyspark.ml.classification import RandomForestClassifier, LinearSVC, \
DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
@mkaranasou
mkaranasou / pyspark_xj_calculation.py
Last active March 19, 2021 17:15
Calculating xj for the Shapley values calculation
# broadcast the row of interest and ordered feature names
ROW_OF_INTEREST_BROADCAST = spark.sparkContext.broadcast(
row_of_interest[features_col]
)
ORDERED_FEATURE_NAMES = spark.sparkContext.broadcast(feature_names)
# set up the udf - x-j and x+j need to be calculated for every row
def calculate_x(
feature_j, z_features, curr_feature_perm
):
@mkaranasou
mkaranasou / pyspark_get_feature_permutations.py
Created March 19, 2021 16:58
Get feature permutations, one for each row, using pyspark
import pyspark
from pyspark.sql import functions as F
def get_features_permutations(
df: pyspark.DataFrame,
feature_names: list,
output_col='features_permutations'
):
"""
@mkaranasou
mkaranasou / shapley_spark_calculation.py
Last active January 5, 2022 10:23
Calculate the Shapley marginal contribution for each feature of a given dataset-model pair
import os
from psutil import virtual_memory
from pyspark import SparkConf
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F, SparkSession, types as T, Window
def get_spark_session():
"""
With an effort to optimize memory and partitions
@mkaranasou
mkaranasou / pyspark_column_dtypes.py
Created February 5, 2021 12:09
Get the data type of a spark dataframe column.
df = spark.DataFrame(...)
dict(df.dtypes).get('features')
@mkaranasou
mkaranasou / psql_alter_constraint.sql
Created April 9, 2020 11:23
Alter a postgres constraint check
ALTER TABLE IF EXISTS table_y2020_w15
DROP CONSTRAINT table_y2020_w15_created_at_check,
ADD CONSTRAINT table_y2020_w15_created_at_check CHECK (created_at >= '2020-04-06 00:00:00' AND created_at <= '2020-04-12 23:59:59.999999' );