Skip to content

Instantly share code, notes, and snippets.

View Nipsuli's full-sized avatar

Niko Korvenlaita Nipsuli

View GitHub Profile
@Nipsuli
Nipsuli / second_login.sql
Created August 18, 2022 07:19
Example on how to get timestamp for second login. Used to educate about Non-Equi JOIN in SQL
WITH
base AS (
SELECT
SAFE_CAST (
JSON_EXTRACT_SCALAR (payload, '$.timestamp') AS timestamp
) AS event_time,
JSON_EXTRACT_SCALAR (payload, '$.action') AS action_type,
JSON_EXTRACT_SCALAR (payload, '$.actor_id') AS actor_id,
JSON_EXTRACT_SCALAR (payload, '$.actor_username') AS actor_username,
FROM
@Nipsuli
Nipsuli / weighted_std.sql
Created May 22, 2022 06:47
Numerically stable parallel safe standard deviation in PostgreSQL
CREATE OR REPLACE FUNCTION _wstd_state(state numeric[3], val numeric, weight numeric)
RETURNS numeric[3] AS $$
DECLARE
s_n_1 CONSTANT numeric NOT NULL := state[1];
mu_n_1 CONSTANT numeric NOT NULL := state[2];
w_n_1 CONSTANT numeric NOT NULL := state[3];
s_n numeric;
mu_n numeric;
w_n numeric;
BEGIN
@Nipsuli
Nipsuli / two_pass_shuffle.py
Last active November 10, 2021 19:06
Two pass shuffle implementation for algorithm described: in https://blog.janestreet.com/how-to-shuffle-a-big-dataset/
import contextlib
import tempfile
import random
def two_pass_shuffle(input_files, output_files, temp_file_count, header_lines=0):
"""
two_pass_shuffle
Suffle data larger that can be shuffled in memory.
Implementation based on: