Skip to content

Instantly share code, notes, and snippets.

View do-me's full-sized avatar

Dominik Weckmüller do-me

View GitHub Profile
@do-me
do-me / untracked_md_files.sh
Last active July 9, 2024 12:50
Shell command to get all untracked markdown files in a git repo in mkdocs format
# What's this good for? If you're batch editing/creating new markdown files for
# mkdocs and need to add the entries in the mkdocs.yml file
git ls-files --others --exclude-standard '*.md' | grep '\.md$' | while read filename; do
echo "- $(basename "${filename%.*}" | sed 's/^[[:space:]]*//'): $(basename "$filename")"
done
# Output
# - copernicus_service_provider: copernicus_service_provider.md
# - expanded_uncertainty: expanded_uncertainty.md
@do-me
do-me / fetch_all_jsons_concat_to_geoparquet.py
Created June 19, 2024 10:29
Saarland parcel fetching from geoportal.saarland.de, saving to geoparquet with geopandas
import requests
import geopandas as gpd
from shapely.geometry import shape
import json
from tqdm import tqdm
import time
# Define the base URL and parameters
base_url = "https://geoportal.saarland.de/spatial-objects/408/collections/cp:CadastralParcel/items"
limit = 500
@do-me
do-me / bge-m3_batch_benchmarks.py
Created June 7, 2024 17:24
bge-m3 benchmarks
import time
import matplotlib.pyplot as plt
long_text = """Near a great forest there lived a poor woodcutter and his wife, and his two children; the boy's name was Hansel and the girl's Grethel. They had very little to bite or to sup, and once, when there was great dearth in the land, the man could not even gain the daily bread. As he lay in bed one night thinking of this, and turning and tossing, he sighed heavily, and said to his wife, "What will become of us? we cannot even feed our children; there is nothing left for ourselves."
"I will tell you what, husband," answered the wife; "we will take the children early in the morning into the forest, where it is thickest; we will make them a fire, and we will give each of them a piece of bread, then we will go to our work and leave them alone; they will never find the way home again, and we shall be quit of them."
"No, wife," said the man, "I cannot do that; I cannot find in my heart to take my children into the forest and to leave them there alone; the wild an
@do-me
do-me / overturemaps_places_plot_datashader.py
Created May 29, 2024 15:11
Plot 52Mio Overture Maps places with datashader in Python
import geopandas as gpd
import datashader as ds
from colorcet import fire
# download the data before with:
# overturemaps download -f geoparquet --type=place -o places.parquet
gdf = gpd.read_parquet("places.parquet") # takes 3 mins on my M3
# plotting takes 1 min
cvs = ds.Canvas(plot_width=2000, plot_height=1000)
@do-me
do-me / server.py
Last active May 27, 2024 08:28 — forked from mdonkers/server.py
Simple Python 3 HTTP server for logging all GET and POST requests (CORS enabled)
#!/usr/bin/env python3
"""
License: MIT License
Copyright (c) 2023 Miel Donkers
Very simple HTTP server in python for logging requests
Modified for CORS (Access-Control-Allow-Origin) when e.g. sending requests from the frontend
Usage::
./server.py [<port>]
"""
from http.server import BaseHTTPRequestHandler, HTTPServer
@do-me
do-me / cosine_similarity.py
Created May 21, 2024 07:28
Cosine similarity with nan checks
from numpy.linalg import norm
import numpy as np
# Define the cosine similarity function with automatic list-to-array conversion
def cos_sim(a, b):
# Check if either input is NaN, empty, or contains empty strings
if a is None or b is None or not a or not b:
return np.nan
if isinstance(a, list) and any(x == "" or x is None for x in a):
return np.nan
@do-me
do-me / pandas_pickle.py
Created May 20, 2024 08:45
Pandas custom parquet save with pickle for list of lists
import pandas as pd
import pickle
def write_pd_pickle(df, filename, pickle_cols=None):
"""
Writes a pandas DataFrame to a Parquet file, pickling specified columns.
The function takes a DataFrame and pickles the specified columns before saving
the DataFrame to a Parquet file. This is useful for saving columns that contain
data types that Parquet might not natively support, such as lists or dictionaries.
@do-me
do-me / semantic_text_splitter_pandarallel.py
Created May 16, 2024 15:23
semantic_text_splitter with pandarallel multiprocessing
from semantic_text_splitter import TextSplitter
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
splitter = TextSplitter((1500,2000)) # equals around 512 tokens embedding model context, referring to chars here
def wrap_func(text):
return splitter.chunks(text)
df["chunks"] = df["text"].parallel_apply(wrap_func)
@do-me
do-me / pandarallel.py
Created May 10, 2024 08:24
Pandas multiprocessing with pandarallel
import pandas as pd
import numpy as np
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
# Create a sample dataframe with 10,000 rows and 2 columns
np.random.seed(0) # for reproducibility
df = pd.DataFrame({'numbers': np.random.randint(1, 100, size=10000000)})
@do-me
do-me / replace_multiple_whitespaces.py
Last active April 23, 2024 15:27
Replace an arbitrary number of white spaces by just one white space in Python for data cleaning (useful for XML/HTML parsing)
import re
# Default replaces white spaces, tabs, line breaks etc.
def replace_multiple_whitespaces(text):
return re.sub(r'\s+', ' ', text) # use re.sub(r'[ \t]+', ' ', text) if line breaks should be preserved
# Use this function if you want to preserve exactly one line break and remove the rest like above
def replace_multiple_whitespaces_keep_one_linebreak(text):
text = re.sub(r'[ \t]*\r?\n[ \t\r\n]*', '\n', text)
# Replace one or more spaces or tabs with a single space (for remaining white spaces)