Skip to content

Instantly share code, notes, and snippets.

View ianmcook's full-sized avatar

Ian Cook ianmcook

View GitHub Profile

Why IbisML?

This is a simple example demonstrating why you might want to use IbisML instead of just plain Ibis in an ML preprocessing pipeline.

Scenario

You are training an ML model that gets better accuracy when the floating point number columns in the training data are normalized (by subtracting the mean and dividing by the standard deviation). Your data contains multiple floating point columns.

To demonstrate this, we can use the iris flower dataset.

@ianmcook
ianmcook / ibis_union_different_column_order.py
Created August 21, 2024 16:02
Union two Ibis tables with columns in different orders
import ibis
import random
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)
ibis.options.interactive = True
@ianmcook
ianmcook / maintain_row_order.md
Last active August 20, 2024 21:20
Examples demonstrating whether systems maintain row order

This is a set of examples demonstrating whether various Python and R dataframe libraries and OLAP query engines preserve (or do not preserve) the original order of the records in the data.

Example data

The examples all use this dataset describing the 28 times when a person walked on the moon:

year mission name minutes
1969 Apollo 11 Neil Armstrong 151
1969 Apollo 11 Buzz Aldrin 151
@ianmcook
ianmcook / zero_null-masked_bytes.py
Last active August 5, 2024 14:48
Zero null-masked bytes of a fixed-width array in PyArrow
import pyarrow as pa
import numpy as np
import pandas as pd
# Create an array of some fixed-width type containing nulls
a = pa.array(obj=pd.Series([1, 2, 3]), type=pa.int64(), mask=np.array([1, 0, 1], dtype=bool))
# Get the values buffer as a bytearray
b = a.buffers()
v = bytearray(b[1].to_pybytes())
@ianmcook
ianmcook / ArrowHttpClient.cs
Last active March 18, 2024 15:02
C# example to receive Arrow record batches over HTTP and write to file
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
@ianmcook
ianmcook / ArrowHttpClient.java
Last active August 23, 2024 15:49
Java example to receive Arrow record batches over HTTP and write to file
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
@ianmcook
ianmcook / client.c
Last active March 10, 2024 16:43
C GLib example to receive Arrow record batches over HTTP and write to file
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
@ianmcook
ianmcook / ibis_create_duckdb_table.py
Created February 14, 2024 16:21
Different ways to create a DuckDB table from Ibis
import pandas as pd
import ibis
# Different ways to create a DuckDB table from Ibis
# ibis.memtable(...): ephemeral, all in-memory, stored as a view inside duckdb, removed when the session ends
# ibis.memtable(...).cache(): ephemeral, stored as temporary table in the duckdb database, removed when the session ends, expression is cached for the lifetime of the session
# con.create_table(..., temp=True): ephemeral, stored as temporary table in the duckdb database, removed when the session ends, expression is NOT cached for the lifetime of the session
# con.create_table(...): persistent, across sessions (assuming you're not using an in-memory connection)
@ianmcook
ianmcook / ibis_spark_pgsql.py
Last active January 30, 2024 21:01
Use Ibis to insert from Spark table into PostgreSQL table
import pandas as pd
import pyarrow as pa
import ibis
from pyspark.sql import SparkSession
# create example data in a pandas DataFrame
df = pd.DataFrame(data={'fruit': ['apple', 'apple', 'apple', 'orange', 'orange', 'orange'],
'variety': ['gala', 'honeycrisp', 'fuji', 'navel', 'valencia', 'cara cara'],
'weight': [134.2 , 158.6, None, 142.1, 96.7, None]})
@ianmcook
ianmcook / acero_tpch_06_decl_seq.cpp
Created January 22, 2024 23:24
Acero Sequence of Declarations for TPC-H Query 06
#include <iostream>
#include <arrow/api.h>
#include <arrow/type.h>
#include <arrow/result.h>
#include <arrow/io/api.h>
#include <arrow/compute/api.h>
#include <arrow/acero/exec_plan.h>
#include <arrow/acero/options.h>
#include <parquet/arrow/reader.h>