Skip to content

Instantly share code, notes, and snippets.

@drin
drin / create_groups_table.py
Last active March 11, 2023 02:14
Example Arrow code
import itertools
import pyarrow
# ------------------------------
# Define array types for readability
# id1: list<int64>
id1_type = pyarrow.list_(pyarrow.int64())
# id2: struct<type: str, value: str>
@drin
drin / HashIntResults.md
Created October 14, 2022 22:35
Trying to test that HashMultiColumn produces expected hash values for int32_t input values

A simplified version of HashIntImp for testing:

// hash_int based on key_hash.cc:HashIntImp (672431b)
template <typename T>
uint64_t hash_int(T val) {
  constexpr uint64_t int_const = 11400714785074694791ULL;
  uint64_t cast_val            = static_cast<uint64_t>(val);

  return static_cast<uint64_t>(BYTESWAP(cast_val * int_const));
}
@drin
drin / array_greater_equal_benchmark.cc
Last active August 30, 2022 18:46
Some Arrow Benchmarking
// A version that is directly comparable to
// https://gist.github.com/js8544/8569c0e0bb810f1254904e4584def167#file-benchmark-cc-L12
static void GreaterEqual(benchmark::State& state) { // NOLINT non-const reference
constexpr int64_t test_size = 10000;
constexpr int64_t max_val = std::numeric_limits<int64_t>::max();
auto test_vals = benchmark_rng.Int64(test_size, 0, max_val);
auto test_ints = std::static_pointer_cast<arrow::Int64Array>(test_vals);
while (state.KeepRunning()) {
arrow::BooleanBuilder builder;
@drin
drin / initial-timing.md
Last active March 11, 2022 00:11
Reproducible example of Arrow compute functions on composed and decomposed table

"Time by slice" is total time, summed from running the function on each slice. "Time by table" is total time, from running the function on a table created by concatenating each slice together.

Table ID Columns Rows Rows (slice) Slice count Time by slice (ms) Time by total (ms)
E-GEOD-100618 415 20631 299 69 644.065 410
E-GEOD-76312 2152 27120 48 565 25607.927 2953
E-GEOD-106540 2145 24480 45 544 25193.507 3088
@drin
drin / example_class.py
Created October 15, 2021 21:31
Random python example
class ExampleClass:
class_var = 'Class Variable'
def __init__(self, req_param, def_param=10, **kwargs):
# calling super class "constructor" is *optional*
super().__init__()
self.required_arg = req_param
self.optional_arg = def_param
@drin
drin / check-pyarrow-deps.bash
Last active September 20, 2021 22:11
Arrow from C++ and python
(my-poetry-venv) 14:17 >> python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__file__
'<path-to-my-poetry-venv>/lib/python3.9/site-packages/pyarrow/__init__.py'
>>> quit()
(my-poetry-venv) 15:00 >> ldd <path-to-my-poetry-venv>/lib/python3.9/site-packages/pyarrow/lib.cpython-39-x86_64-linux-gnu.so
@drin
drin / test.r
Last active July 14, 2021 22:10
R code for using skytether via python
# ------------------------------
# Dependencies
library(reticulate)
library(arrow)
# >> Set python interpreter (rely on pyenv and poetry)
use_python(Sys.which('python'), required=TRUE)
# >> Python dependencies (via reticulate)
skytether <- import('skytether')
@drin
drin / DESCRIPTION
Last active June 8, 2021 16:02
Using Arrow in C++ and R
Package: skytethr
Title: Integration to 'Skytether-singlecell'
Version: 0.1.0
LinkingTo: cpp11, boostfs, arrow
SystemRequirements: C++11
@drin
drin / Vagrantfile
Last active March 3, 2021 22:00
Almost default content of VagrantFile for ubuntu 21.04
# -*- mode: ruby -*-
# vi: set ft=ruby :
# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
# The most common configuration options are documented and commented below.
# For a complete reference, please see the online documentation at
@drin
drin / ArrowAptInstall.bash
Last active October 2, 2021 01:48
Install arrow using apt
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
# modified per comment; since bintray is retired
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev # For C++
sudo apt install -y -V libarrow-dataset-dev # For Arrow Dataset C++