Skip to content

Instantly share code, notes, and snippets.

View vfrank66's full-sized avatar

Victor Frank vfrank66

View GitHub Profile
@vfrank66
vfrank66 / damerau_levenshtein-postgres
Created July 10, 2024 15:02
damerau_levenshtein in postgres
CREATE OR REPLACE FUNCTION damerau_levenshtein(s1 TEXT, s2 TEXT)
RETURNS INT AS $$
DECLARE
s1_len INT := LENGTH(s1);
s2_len INT := LENGTH(s2);
d INT[][];
i INT;
j INT;
cost INT;
BEGIN
@vfrank66
vfrank66 / damerau_levenshtein-lowrance-wagner
Last active July 10, 2024 15:02
damerau_levenshtein in postgres with an attempt an Lowrance-Wagner (LW) algorithm: https://doi.org/10.1145%2F321879.321880 to match duckdb
CREATE OR REPLACE FUNCTION damerau_levenshtein(source TEXT, target TEXT)
RETURNS INT AS $$
DECLARE
source_len INT := LENGTH(source);
target_len INT := LENGTH(target);
distance INT[][];
largest_source_chr_matching JSONB := '{}';
largest_target_chr_matching INT;
inf INT;
i INT;
@vfrank66
vfrank66 / migrate_s3_dbs.py
Last active February 8, 2026 21:00
Migrating AWS Glue databases/tables backed by S3 data in a single bucket from one bucket to another, this pulls from another gist data_sync.py to move the s3 data
"""
This script migrates databases/tables in AWS Glue backed by S3 data. It assumes you are migrating databases to a new
S3 bucket, it also assumes the data is stored in s3://<bucket>/<database name with standardization across dbs>/<tables>/<optional partitions/data.<ext>
bucket name/
database1/
tableA/
tableB/
database2/
table1/
@vfrank66
vfrank66 / data_sync.py
Created September 27, 2022 13:29
Use AWS DataSync to move data between buckets
"""AWS DataSync an aws service to move/copy large amounts of data."""
import logging
import os
from typing_extensions import Literal
import boto3
import tenacity
from botocore import waiter
from botocore.exceptions import WaiterError
@vfrank66
vfrank66 / pyspark_exceptions.py
Created September 27, 2022 13:26
My best attempt at getting pyspark exceptions when a stage failure occurs with stage retry logic that hides the original error
import logging
from py4j.protocol import Py4JJavaError
from pyspark import SparkContext
logger = logging.getLogger(__name__)
class ServiceApiError(Exception):
"""ServiceApiError exception."""