Skip to content

Instantly share code, notes, and snippets.

View mneedham's full-sized avatar

Mark Needham mneedham

View GitHub Profile
@mneedham
mneedham / 00_pip.bash
Created July 26, 2023 09:58
Confluent Kafka: DeprecationWarning: AvroProducer has been deprecated. Use AvroSerializer instead.
pip install confluent-kafka avro urllib3 requests fastavro
@mneedham
mneedham / 01_enums.py
Last active July 10, 2023 08:24
enums in duckdb
# Dataset: https://www.kaggle.com/datasets/wilmerarltstrmberg/recipe-dataset-over-2m
import duckdb
db1 = duckdb.connect('db1.duck.db')
db2 = duckdb.connect('db2.duck.db')
db1.sql("""
CREATE OR REPLACE TABLE recipes AS
FROM read_csv_auto('recipes_data.csv', header=True)
@mneedham
mneedham / queries.md
Created April 14, 2023 06:06
Intro to Window Functions
@mneedham
mneedham / queries.md
Last active April 14, 2023 05:54
SQL Aggregate vs Aggregate Window Functions
@mneedham
mneedham / matches.py
Last active March 19, 2023 17:53
DuckDB Relational API
import duckdb
import pandas as pd
con = duckdb.connect('atp-matches.db')
con.sql("INSTALL httpfs")
con.sql("LOAD httpfs")
csv_files = [
f"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{year}.csv"
@mneedham
mneedham / import.cql
Last active December 4, 2022 11:47
Neo4j Twitter Graph
CREATE CONSTRAINT ON(u:User)
ASSERT u.id IS unique;
:param keysToKeep => ["name", "username", "bio", "following", "followers"];
CALL apoc.load.json("https://gist.github.com/mneedham/3c6a59fb5e7d87e20a2f5f1ae4fa2920/raw/9d7c57997c09b3a105556adb6c6f1819792a4db4/query.json")
YIELD value
MERGE (u:User {id: value.user.id })
SET u += value.user
FOREACH (following IN value.following |
MERGE (f1:User {id: following})
MERGE (u)-[:FOLLOWS]->(f1))
@mneedham
mneedham / queries.sql
Created November 1, 2022 22:46
On the fly joins on CSV files with DuckDB
CREATE OR REPLACE TABLE players
AS SELECT * FROM read_csv_auto('atp_players.csv', SAMPLE_SIZE=-1);
CREATE OR REPLACE TABLE rankings AS
select *
from 'atp_rankings_*.csv';
SELECT player_id, name_first, name_last
FROM players
LIMIT 5;
@mneedham
mneedham / neo4j.yaml
Created November 25, 2016 16:37
Kubernetes + Neo4j
# Headless service to provide DNS lookup
apiVersion: v1
kind: Service
metadata:
labels:
app: neo4j
name: neo4j
spec:
clusterIP: None
ports:
@mneedham
mneedham / blog.py
Last active July 11, 2022 03:04
Meetup API -> JSON -> CSV using Python's Luigi library
import json
import os
import luigi
import requests
from collections import Counter
from luigi.contrib.external_program import ExternalProgramTask
class Meetup(luigi.WrapperTask):
def run(self):
@mneedham
mneedham / ppr.py
Created July 18, 2018 10:10
Personalized PageRank using networkx
# Dataset from https://blogs.oracle.com/bigdataspatialgraph/intuitive-explanation-of-personalized-page-rank-and-its-application-in-recommendation
import operator
import networkx as nx
G = nx.Graph()
G.add_nodes_from(["John", "Mary", "Jill", "Todd",
"iPhone5", "Kindle Fire", "Fitbit Flex Wireless", "Harry Potter", "Hobbit"])