Skip to content

Instantly share code, notes, and snippets.

View stefanthoss's full-sized avatar
🏠
Working from home

Stefan Thoss stefanthoss

🏠
Working from home
View GitHub Profile
@stefanthoss
stefanthoss / cik_dict.py
Last active December 2, 2019 16:01 — forked from dougvk/cik_dict.py
(stock ticker -> CIK) dictionary using SEC EDGAR using stdout
import re
import requests
DEFAULT_TICKERS = ["BBRY", "VOD", "T", "S"]
URL = "http://www.sec.gov/cgi-bin/browse-edgar?CIK={}&Find=Search&owner=exclude&action=getcompany"
CIK_RE = re.compile(r".*CIK=(\d{10}).*")
cik_dict = {}
for ticker in DEFAULT_TICKERS:
results = CIK_RE.findall(requests.get(URL.format(ticker)).content.decode("ascii"))
@stefanthoss
stefanthoss / mysql-pandas-import.py
Last active December 26, 2023 19:48
Import data from a MySQL database table into a Pandas DataFrame using the pymysql package.
import pandas as pd
import pymysql
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://USER:PASSWORD@HOST:PORT/DBNAME")
df = pd.read_sql_query("SELECT * FROM table", engine)
df.head()
@stefanthoss
stefanthoss / git_backup_script.sh
Created March 21, 2016 21:45
Shell script to automatically commit all new/modified/deleted files in a Git repository. Can be used as a backup tool with cron.
#!/bin/bash
cd /path/to/git/repo/
git add -A
git commit -m "Backup on `date`"
git push origin
@stefanthoss
stefanthoss / postgres-csv-import.py
Last active May 5, 2020 05:27
Import data from a local CSV file to a PostgreSQL database table using pandas and psycopg2. 'null' values in the CSV file get replaced by real NULL values.
import pandas as pd
import numpy as np
import psycopg2
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DBNAME')
df = pd.read_csv('local-file.csv', sep=',').replace(to_replace='null', value=np.NaN)
df.to_sql('dbtable', engine, schema='dbschema', if_exists='replace')
@stefanthoss
stefanthoss / export-postgresql-table.sh
Created April 29, 2016 20:59
Linux command to export a PostgreSQL table from a remote server to a local CSV file.
psql -h hostname -U username -W -d database -t -A -F "," -c "SELECT * FROM table" > file.csv
# Explanation of the used options:
# -h Specifies the host name of the machine on which the server is running.
# -U Connect to the database as a specific user.
# -W Force psql to prompt for a password before connecting to a database.
# -d Specifies the name of the database to connect to.
# -t Turn off printing of column names and result row count footers, etc.
# -A Switches to unaligned output mode.
# -F Use separator as the field separator for unaligned output.
@stefanthoss
stefanthoss / export-pyspark-schema-to-json.py
Created June 19, 2019 22:16
Export/import a PySpark schema to/from a JSON file
import json
from pyspark.sql.types import *
# Define the schema
schema = StructType(
[StructField("name", StringType(), True), StructField("age", IntegerType(), True)]
)
# Write the schema
with open("schema.json", "w") as f:
@stefanthoss
stefanthoss / pbzip2.Dockerfile
Last active March 26, 2024 11:29
Install pbzip2 in an Alpine Linux Docker image
FROM alpine:3.10
RUN apk add --no-cache \
bzip2-dev \
g++ \
make
RUN cd /tmp/ && \
wget -q https://launchpad.net/pbzip2/1.1/1.1.13/+download/pbzip2-1.1.13.tar.gz && \
tar -xzf pbzip2-1.1.13.tar.gz && \
@stefanthoss
stefanthoss / advanced-dataframe-union.py
Last active January 28, 2020 19:34
Returns a new PySpark DataFrame containing the union of two DataFrames. This more advanced version works even when the two DataFrames have different columns and a different order of columns. If a column does not exist in either DataFrame, its fields will be empty.
def advanced_dataframe_union(df1, df2):
df1_fields = set((f.name, f.dataType) for f in df1.schema)
df2_fields = set((f.name, f.dataType) for f in df2.schema)
df2 = df2.select(
df2.columns
+ [
F.lit(None).cast(datatype).alias(name)
for name, datatype in df1_fields.difference(df2_fields)
]
@stefanthoss
stefanthoss / find-postgres-tables-by-column.sql
Created October 9, 2019 20:30
Find PostgreSQL tables that have columns which match a certain regex.
SELECT
t.table_schema,
t.table_name,
c.column_name
FROM
information_schema.tables t
INNER JOIN information_schema.columns c ON c.table_name = t.table_name
AND c.table_schema = t.table_schema
WHERE
c.column_name ~ 'regex'
@stefanthoss
stefanthoss / search-pandas-columns-for-string.py
Last active November 12, 2019 05:43
List all columns of a DataFrame that contain a certain character at least once.
import pandas as pd
# Input:
#            a     b        c    d    e
# 0  Text       Text  NaN      0.0  5
# 1  Text       NaN   1.1.1.1  0.0  55
# 2  Text.Text  Text  Text     0.4  555
data = [
    {"a": "Text", "b": "Text", "d": 0, "e": 5},