Skip to content

Instantly share code, notes, and snippets.

@bivald
bivald / batch-insert-duckdb.md
Created May 22, 2023 11:41
Batch insert DuckDB

How I batch-read into DuckDB

On a machine with much ram the following works:

con.sql("CREATE TABLE new_tbl AS SELECT * FROM read_parquet('file.parq')")

It uses about 20GB of ram or more and takes 130s and the duckdb file is 3.42GB

Trying to read the parquet in batches and inserting them instead and not keeping anything more than needed (i.e each row group) in RAM:

# This file if written with pyarrow==2.0.0 can't be read by pyarrow==8.0.0
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
("col1", pa.int8()),
("col2", pa.string()),
("col3", pa.float64()),
("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
@bivald
bivald / arrow-conversion.py
Last active June 7, 2022 15:28
Convert a parquet file with dictionaries/categorical values into an arrow file row group per row group
import pandas as pd
from pyarrow import fs
import hashlib
import pyarrow as pa
import pyarrow.parquet as pq
input_file = 'input.parq'
output_file = 'data.arrow'
parquet_file = pq.ParquetFile(input_file)
@bivald
bivald / clean-raspberry.sh
Created December 1, 2012 15:19
Clean Raspberry Pi Debian
#!/bin/bash
sudo apt-get --yes purge xserver* x11-common x11-utils x11-xkb-utils x11-xserver-utils xarchiver xauth xkb-data console-setup xinit lightdm libx{composite,cb,cursor,damage,dmcp,ext,font,ft,i,inerama,kbfile,klavier,mu,pm,randr,render,res,t,xf86}* lxde* lx{input,menu-data,panel,polkit,randr,session,session-edit,shortcut,task,terminal} obconf openbox gtk* libgtk* alsa* nano python-pygame python-tk python3-tk scratch tsconf
sudo apt-get -y purge aspell hunspell-en-us iptraf libaspell15 libhunspell-1.2-0 lxde lxsession lxtask lxterminal squeak-vm whiptail zenity gdm gnome-themes-standard python-pygame
apt-get --yes purge xdg-tools desktop-file-utils omxplayer python3-numpy python3
sudo apt-get remove xserver-xorg
sudo apt-get purge ^lx
sudo apt-get --yes autoremove
sudo apt-get --yes autoclean
sudo apt-get --yes clean
@bivald
bivald / debian-networking-dom0-xen-wheezy.md
Last active September 17, 2021 13:12
Debian wheezy networking bridge, xen
root@787346ae580a:/arrow/python# cd /arrow/python/ && pypy setup.py build_ext --build-type=release
running build_ext
-- Runnning cmake for pyarrow
cmake -DPYTHON_EXECUTABLE=/usr/local/bin/pypy -DPYARROW_BOOST_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /arrow/python
INFOCompiler command: /usr/bin/c++
INFOCompiler version: Using built-in specs.
COLLECT_GCC=/usr/bin/c++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.3.0-18+deb9u1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disabl
@bivald
bivald / ffmpeg.sh
Last active November 6, 2020 11:27
FFmpeg commands
# TO MP4 (with quicktime/windows media player support)
# Convert to 720 MP4 (roughly 7MB for 30 seconds)
ffmpeg -i input.mp4 -s hd720 -c:v libx264 -crf 23 -c:a aac -strict -2 -pix_fmt yuv420p output.mp4
# Convert to 1080 MP4 (roughtly 60MB for 30 seconds) - high resolution, but not RAW
ffmpeg -i input.mp4 -s hd1080 -c:v libx264 -crf 23 -c:a aac -strict -2 -pix_fmt yuv420p output.mp4
# Animated gif to mp4
ffmpeg -i input.gif -strict -2 -pix_fmt yuv420p output.mp4
@bivald
bivald / enplore.md
Last active September 30, 2020 08:49
Fast growing, bootstrapped Tech Company

Fast growing, bootstrapped Tech Company

We’ve spent the last five years developing a kick-ass world renowned data analytical solution. Our close collaboration with Google gives us a unique technical edge and a behind the scenes view in the Google Cloud Platform. We’ve modeled our systems with the insights from the Google team and knowledge gained from the trenches at Spotify. Every line of code you write for us helps to transform the way our customers work with data. For that we will always be grateful.

We believe in:

  • Working closely together across teams and experiences
  • Focusing on solving challenging and interesting problems

We combine complex data sources with advanced (and customer specific) analyses to answer the customers burning questions. One case is in the airline industry, where we answer questions such as: How do we fly safer? What are the unknown contributing factors at takeoff? How do we fly more fuel efficient without compromises on safety?

import time
import random
import string
import uuid
import os
import pyarrow as pa
pa.jemalloc_set_decay_ms(100)
import pyarrow.parquet as pq
@bivald
bivald / xrf-brute-force.py
Created November 25, 2012 16:53
XRF Getting out of sleep mode - brute force
#!/usr/bin/python
# This is a crude brute-force tool to remove the cycling on a XRF module (http://shop.ciseco.co.uk/xrf-wireless-rf-radio-uart-rs232-serial-data-module-xbee-shape-arduino-pic-etc/)
# What it does is that every 20 ms it sends out a : WAKE + REMOVE INTLV + Reboot command
# This should be running when you power up your XRF (or when the XRF sends the battery reading, but on startup is easier)
#
# Start the script, start the XRF and it should reset straight away. Afterwards it will boot with the regular STARTED commands, without cycle.
# Everything else is the same (i.e device name etc.)
#
# 1. You need a XRF connected to your computer,
# Either a USB XRF receiver/sender(http://shop.ciseco.co.uk/urf-radio-module-and-serial-inteface-via-usb/)