Skip to content

Instantly share code, notes, and snippets.

@stefanthoss
stefanthoss / export-pyspark-schema-to-json.py
Created June 19, 2019 22:16
Export/import a PySpark schema to/from a JSON file
import json
from pyspark.sql.types import *
# Define the schema
schema = StructType(
[StructField("name", StringType(), True), StructField("age", IntegerType(), True)]
)
# Write the schema
with open("schema.json", "w") as f:
@towo
towo / gruvbox.theme
Created April 22, 2018 10:23
timewarrior gruvbox theme (needs terminal palette)
define theme:
description = "gruvbox.theme: A gruvbox-inspired theme"
colors:
exclusion = "color on color8"
today = "color208"
holiday = "color13"
label = "color243"
ids = "color4"
debug = "color14"
palette:
@LeCoupa
LeCoupa / redis_cheatsheet.bash
Last active October 26, 2025 14:15
Redis Cheatsheet - Basic Commands You Must Know --> UPDATED VERSION --> https://github.com/LeCoupa/awesome-cheatsheets
# Redis Cheatsheet
# All the commands you need to know
redis-server /path/redis.conf # start redis with the related configuration file
redis-cli # opens a redis prompt
# Strings.
@samuelsmal
samuelsmal / pyspark_udf_filtering.py
Created October 11, 2016 14:10
PySpark DataFrame filtering using a UDF and Regex
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*ALLYOURBASEBELONGTOUS.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
@joshlk
joshlk / faster_toPandas.py
Last active September 19, 2025 16:11
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
@evenv
evenv / Spark Dataframe Cheat Sheet.py
Last active August 3, 2025 19:50
Cheat sheet for Spark Dataframes (using Python)
# A simple cheat sheet of Spark Dataframe syntax
# Current for Spark 1.6.1
# import statements
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
#creating dataframes
df = sqlContext.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"]) # from manual data
@kachayev
kachayev / dijkstra.py
Last active March 4, 2025 23:42
Dijkstra shortest path algorithm based on python heapq heap implementation
from collections import defaultdict
from heapq import *
def dijkstra(edges, f, t):
g = defaultdict(list)
for l,r,c in edges:
g[l].append((c,r))
q, seen, mins = [(0,f,())], set(), {f: 0}
while q: