Skip to content

Instantly share code, notes, and snippets.

View mneedham's full-sized avatar

Mark Needham mneedham

View GitHub Profile
@mneedham
mneedham / parquet-cli.sh
Created October 14, 2022 18:24
An intro to Apache Parquet
# The NYC Taxis Dataset - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
pip install parquet-cli
parq data/yellow_tripdata_2022-01.parquet
parq data/yellow_tripdata_2022-01.parquet --schema
parq data/yellow_tripdata_2022-01.parquet --head 10
@mneedham
mneedham / 00_install.sh
Last active April 1, 2025 16:55
Getting Neo4j Tweets
pip install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
pip install confluent-kafka[avro]
@mneedham
mneedham / app.py
Created March 31, 2023 05:16
ATP Head to Head
import streamlit as st
import duckdb
from streamlit_searchbox import st_searchbox
atp_duck = duckdb.connect('atp.duck.db', read_only=True)
def search_players(search_term):
query = '''
SELECT DISTINCT winner_name AS player
FROM matches
@mneedham
mneedham / duckdb.sql
Created October 21, 2022 13:58
Queries against DuckDB
SELECT count(*)
FROM 'data/*.parquet';
SELECT *
FROM 'data/*.parquet'
LIMIT 10;
DESCRIBE
SELECT *
FROM 'data/yellow_tripdata_2011-07.parquet';
@mneedham
mneedham / app.py
Last active June 4, 2024 18:09
Mapping Strava runs using Leaflet and Open Street Map
from flask import Flask
from flask import render_template
import csv
import json
app = Flask(__name__)
@app.route('/')
def my_runs():
runs = []
@mneedham
mneedham / contributors_local.md
Created May 3, 2024 13:16
Latest ClickHouse Contributors
docker run --rm clickhouse/clickhouse-server:24.3 clickhouse-local --query "SELECT * FROM system.contributors ORDER BY name" > contributors_24.3.txt
docker run --rm clickhouse/clickhouse-server:24.4 clickhouse-local --query "SELECT * FROM system.contributors ORDER BY name" > contributors_24.4.txt
./clickhouse --query "
import streamlit as st
import json
from sseclient import SSEClient
print("Listening for updates...")
if "messages" in st.session_state:
print("Closing old connection")
st.session_state["messages"].resp.close()
url = "http://127.0.0.1:8000/livetext"
@mneedham
mneedham / ingest.mjs
Last active March 21, 2024 14:32
LangChain Example
import { ClickHouseStore } from "@langchain/community/vectorstores/clickhouse";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { OpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { Document } from '@langchain/core/documents'
const openAIApiKey = "sk-xxx"
@mneedham
mneedham / 0_install.sh
Created October 28, 2023 08:30
Hugging Face's Text Embeddings Inference Library
git clone git@github.com:huggingface/text-embeddings-inference.git
cd text-embeddings-inference
cargo install --path router -F candle -F accelerate
model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
text-embeddings-router --model-id $model --revision $revision --port 8080
@mneedham
mneedham / queries.sql
Created October 28, 2022 12:26
Querying ATP matches using DuckDB
-- Fails because of weird date
CREATE TABLE players AS
select *
from 'atp_players.csv';
-- all varchar
CREATE TABLE players1 AS
select *
from read_csv_auto('atp_players.csv', ALL_VARCHAR=TRUE);