Skip to content

Instantly share code, notes, and snippets.

View jonashaag's full-sized avatar

Jonas Haag jonashaag

View GitHub Profile
@jonashaag
jonashaag / worksteal.py
Last active November 29, 2024 11:09
Python ThreadPoolExecutor Work Stealing
import concurrent.futures.thread as _thread_impl
import threading
import time
import weakref
from concurrent.futures import Future
class WorkStealThreadPoolExecutor(_thread_impl.ThreadPoolExecutor):
"""A ThreadPoolExecutor that supports work stealing.
@jonashaag
jonashaag / Use macOS OCR engine from Python.md
Last active November 13, 2024 09:42
Use macOS OCR engine from Python

macOS Live Text has a very good quality/speed tradeoff.

Compared to Tesseract, it has much higher quality and is up to 3x as fast.

@jonashaag
jonashaag / enum_with_label.py
Last active August 5, 2024 11:19
Python Enum with label / verbose name / description
import enum
class EnumWithDisplayName(enum.Enum):
def __new__(cls, value, name=None):
if not hasattr(cls, "_value_to_display_name"):
cls._value_to_display_name = {}
cls._display_name_to_value = {}
if name is not None:
if value in cls._value_to_display_name:
@jonashaag
jonashaag / tesseract-finetune.md
Last active August 13, 2024 02:38
Tesseract LSTM fine-tuning how-to
  1. Download lots of fonts (eg., .ttf files)
  2. git clone https://github.com/tesseract-ocr/tesstrain/
  3. git clone https://github.com/tesseract-ocr/langdata_lstm
  4. Install Tesseract
  5. Generate training data:
    cd src
    python -m tesstrain \
      --langdata_dir /path/to/langdata_lstm \
      --linedata_only \
    
@jonashaag
jonashaag / prompt.txt
Last active June 3, 2024 11:12
OVH AI Endpoint failure
I'm going to present you with a piece of text. Please classify it according to the classes outlined after the text.
The text:
"""
When should you choose MongoDB over a relational database management system (RDBMS) like MySQL?
By Dimitri Fague / 2024-05-23 / Databases, DBaaS, MongoDB, OVHcloud, Public Cloud
Of all the non-relational database engines (NoSQL) that have emerged in the last decade, MongoDB is without a doubt the most widely used. Source-available, powerful, flexible and scalable, MongoDB covers a wide range of use cases. Many, including startups, choose it to ensure they are not limited in their technological choices, so they can scale and adapt to different use cases. The possibility of switching from MySQL to MongoDB might come up when updating or revamping an existing app. So, let’s see when this switch might be relevant, and why using the MongoDB service managed by OVHcloud could be the ideal option.
The flexibility of the NoSQL data model
@jonashaag
jonashaag / xkcdpass.sh
Created February 21, 2024 10:17
XKCD password
curl -L https://raw.githubusercontent.com/redacted/XKCD-password-generator/master/xkcdpass/static/eff-long \
| sort -R \
| head -n 5 \
| tr '\n' -
@jonashaag
jonashaag / sp_count.sql
Last active January 8, 2024 16:35
SQL Server quickly count number of rows in table
-- Count number of rows in a table quickly (without a full table/index scan).
-- Usage:
-- sp_count 'mydb.dbo.mytable' Get the row count of the given table.
-- sp_count 'dbo.mytable' Get the row count of the given table from the current database.
-- sp_count Get a list of tables and row counts in the current database.
USE [master]
GO
DROP PROCEDURE IF EXISTS [dbo].[sp_count]
@jonashaag
jonashaag / snowflake_unload_parquet.py
Created October 8, 2023 18:22
Snowflake Connector Python download table or query as Parquet
def unload_to_parquet(query: str, target_dir: Path, conn, stage_name: str = "unload_stage"):
conn.execute(f"CREATE TEMP STAGE {stage_name}")
conn.execute(f"COPY INTO @{stage_Name} FROM ({query}) file_format=(type='parquet') header=true")
target_dir.mkdir(parents=True)
conn.execute(f"GET @{stage_name} file://{str(target_dir)}")
@jonashaag
jonashaag / pd_shrink_dtypes.py
Created September 11, 2023 12:09
Pandas shrink dtypes
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype
from pandas.core.dtypes.base import ExtensionDtype
def shrink_dtype(series: pd.Series) -> pd.Series:
smallest_dtype = get_smallest_dtype(series)
if smallest_dtype == series.dtype:
return series
import json
import sqlite3
repodata = json.load(open("497deca9.json"))
COLS = 'filename, build, build_number, depends, license, license_family, md5, name, sha256, size, subdir, timestamp, version'.split(', ')
db = sqlite3.connect("497deca9.sqlite")
db.execute("create table repodata ({}, primary key (filename))".format(','.join(COLS)))