Skip to content

Instantly share code, notes, and snippets.

@monocongo
monocongo / mssql_df_upsert.py
Created September 26, 2023 15:04 — forked from gordthompson/mssql_df_upsert.py
Build a T-SQL MERGE statement and upsert a DataFrame
# Copyright 2023 Gordon D. Thompson, gord@gordthompson.com
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@monocongo
monocongo / start_jupyter_pyspark.sh
Created July 29, 2022 01:06 — forked from BryanCutler/start_jupyter_pyspark.sh
How to start a Jupyter Notebook with PySpark Kernel
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
@monocongo
monocongo / compute_correlation_matrix.py
Created March 10, 2022 14:25 — forked from cameres/compute_correlation_matrix.py
Compute Pandas Correlation Matrix of a Spark Data Frame
from pyspark.mllib.stat import Statistics
import pandas as pd
# result can be used w/ seaborn's heatmap
def compute_correlation_matrix(df, method='pearson'):
# wrapper around
# https://forums.databricks.com/questions/3092/how-to-calculate-correlation-matrix-with-all-colum.html
df_rdd = df.rdd.map(lambda row: row[0:])
corr_mat = Statistics.corr(df_rdd, method=method)
corr_mat_df = pd.DataFrame(corr_mat,