Skip to content

Instantly share code, notes, and snippets.

View mkaranasou's full-sized avatar
🏠
Working from home

Maria Karanasou mkaranasou

🏠
Working from home
View GitHub Profile
@mkaranasou
mkaranasou / python_yaml_environment_variables.py
Last active May 14, 2024 16:33
Python Load a yaml configuration file and resolve any environment variables
import os
import re
import yaml
def parse_config(path=None, data=None, tag='!ENV'):
"""
Load a yaml configuration file and resolve any environment variables
The environment variables must have !ENV before them and be in this format
to be parsed: ${VAR_NAME}.
@mkaranasou
mkaranasou / pyspark_parallel_read_from_db.py
Last active March 14, 2023 05:38
Parallel read from db with pyspark
import os
q = '(select min(id) as min, max(id) as max from table_name where condition) as bounds'
user = 'postgres'
password = 'secret'
db_driver = 'org.postgresql.Driver'
host = '127.0.0.1'
db_url = f'jdbc:postgresql://{host}:5432/dbname?user={user}&password={password}'
partitions = os.cpu_count() * 2 # a good starting point
conn_properties = {
@mkaranasou
mkaranasou / shapley_spark_calculation.py
Last active January 5, 2022 10:23
Calculate the Shapley marginal contribution for each feature of a given dataset-model pair
import os
from psutil import virtual_memory
from pyspark import SparkConf
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F, SparkSession, types as T, Window
def get_spark_session():
"""
With an effort to optimize memory and partitions
@mkaranasou
mkaranasou / pyspark_scikit_isolation_forest.py
Last active October 27, 2021 08:29
How to use Scikit's Isolation Forest in Pyspark - udf and broadcast variables
from pyspark.sql import SparkSession, functions as F, types as T
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
conf = SparkConf()
spark_session = SparkSession.builder \
.config(conf=conf) \
.appName('test') \
@mkaranasou
mkaranasou / pyspark_autoincrement_ids_rdd_version.py
Last active September 23, 2021 02:02
Add auto-increment ids to a pyspark data frame using RDDs
>>> from pyspark.sql import SparkSession, functions as F
>>> from pyspark import SparkConf
>>> conf = SparkConf()
>>> spark = SparkSession.builder \
.config(conf=conf) \
.appName('Dataframe with Indexes') \
.getOrCreate()
@mkaranasou
mkaranasou / pyspark_vector_assembler_dense_and_sparse.py
Created March 24, 2020 14:30
VectorAssembler example - dense and sparse output
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
@mkaranasou
mkaranasou / seq_id_explode_test.py
Created May 13, 2021 07:48
Testing explode as a way to get sequential ids in a spark dataframe
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
from pyspark import SparkConf
from pyspark.sql import functions as F
conf = SparkConf()
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Dataframe with Indexes') \
.getOrCreate()
@mkaranasou
mkaranasou / pyspark_shapley_values_full_example_random_data.py
Last active March 20, 2021 11:17
A full example of Shapley Values calculation with pyspark and their benefits to the model with random data
import random
import numpy as np
import pyspark
from shapley_spark_calculation import \
calculate_shapley_values, select_row
from pyspark.ml.classification import RandomForestClassifier, LinearSVC, \
DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
@mkaranasou
mkaranasou / pyspark_shapley_calculation_cli.py
Last active March 20, 2021 11:17
Cli Shapley Calculation with pyspark
import operator
import os
import time
import warnings
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F, SparkSession, types as T, Window
@mkaranasou
mkaranasou / z_marginal_contribution.py
Created March 20, 2021 10:59
Output for the calculation of each z's marginal contibution
+----+-------------------------------------++-------------+---------------------+
|id |features |prediction |marginal_contribution|
+----+-------------------------------------+--------------+---------------------+
|1677|[0.349,0.141,0.162,0.162,0.162,0.349]|0.0 |null |
|1677|[0.886,0.141,0.162,0.162,0.162,0.349]|0.0 |0.0 |
|2250|[0.106,0.423,0.777,0.777,0.777,0.886]|0.0 |null |
|2250|[0.886,0.423,0.777,0.777,0.777,0.886]|0.0 |0.0 |
|2453|[0.801,0.423,0.777,0.777,0.87,0.886] |0.0 |null |
+----+-------------------------------------+--------------+---------------------+
only showing top 5 rows