Skip to content

Instantly share code, notes, and snippets.

View ianmcook's full-sized avatar

Ian Cook ianmcook

View GitHub Profile
@ianmcook
ianmcook / write_parquet_float.cpp
Last active October 13, 2023 18:10
Write Parquet file with float32 column
#include <iostream>
#include <random>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
float GetRandomFloat()
{
static std::default_random_engine e;
@ianmcook
ianmcook / write_wide_parquet.cpp
Created October 11, 2023 21:02
Write a very wide Parquet file
#include <iostream>
#include <random>
#include <vector>
#include <string>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
std::vector<std::string> GenerateUniqueStrings() {
// generates 26^4 = 456,976 unique 4-letter combinations
@ianmcook
ianmcook / arrow_is_in.cpp
Created September 19, 2023 15:07
Standalone test of the Arrow C++ `is_in` kernel
#include <iostream>
#include <arrow/api.h>
#include <arrow/compute/api.h>
int main(int, char**) {
// lookup set
std::shared_ptr<arrow::Array> array;
arrow::Int32Builder builder;
if (!builder.Append(5).ok()) return 1;
@ianmcook
ianmcook / substrait_pyarrow_dataset_expressions.py
Created August 29, 2023 21:39
Use Substrait expressions to filter and project PyArrow datasets
import tempfile
import pathlib
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# create a small dataset for example purposes
@ianmcook
ianmcook / acero_sort.cpp
Created August 17, 2023 21:19
Sort an Arrow Table with Acero
#include <iostream>
#include <arrow/api.h>
#include <arrow/result.h>
#include <arrow/compute/api.h>
#include <arrow/compute/exec/exec_plan.h>
arrow::Status ExecutePlanAndCollectAsTable(
std::shared_ptr<arrow::compute::ExecPlan> plan,
std::shared_ptr<arrow::Schema> schema,
arrow::AsyncGenerator<std::optional<arrow::compute::ExecBatch>> sink_gen) {
@ianmcook
ianmcook / ibis_bigquery_github_nested.py
Created April 14, 2023 17:04
Ibis BigQuery github_nested example query
import google.auth
import ibis
from ibis import _
credentials, billing_project = google.auth.default()
conn = ibis.bigquery.connect(billing_project, 'bigquery-public-data.samples')
t = conn.table('github_nested')
expr = (
@ianmcook
ianmcook / ibis_snowflake_tpc-h_1.py
Last active April 12, 2023 18:07
Ibis Snowflake TPC-H Query 1
# before running:
# 1. install Ibis and its Snowflake backend: https://ibis-project.org/backends/Snowflake/
# 2. create and activate a Snowflake trial account
# 3. set environment variables SNOWSQL_USER, SNOWSQL_PWD, SNOWSQL_ACCOUNT
import os
import ibis
from ibis import _
ibis.options.interactive = True
@ianmcook
ianmcook / ibis_trino.py
Last active April 9, 2023 12:02
Simple Ibis Trino demo
# before running:
# 1. install Ibis and its Trino backend: https://ibis-project.org/backends/Trino/
# 2. pull and run the Trino docker container: https://trino.io/docs/current/installation/containers.html
import ibis
from ibis import _
# connect to Trino
conn = ibis.trino.connect(database='memory', schema='default')
@ianmcook
ianmcook / duckdb_ibis_example.py
Created January 24, 2023 18:01
Ibis + DuckDB example
# pip install 'ibis-framework[duckdb]'
import pandas as pd
import ibis
from ibis import _
# create a pandas DataFrame and write it to a Parquet file
df = pd.DataFrame(data={'repo': ['pandas', 'duckdb', 'ibis'],
'stars': [36622, 8074, 2336]})
df.to_parquet('repo_stars.parquet')
@ianmcook
ianmcook / clean_github_jira_ids.R
Last active October 26, 2022 21:26
Match Apache Arrow Jira user accounts with GitHub user accounts
# run this script second
library(dplyr)
df <- read.csv("dirty.csv")
agg <- df %>%
group_by(jira, github) %>%
summarise(n = n(), .groups = "keep") %>%
ungroup() %>%