Niklas B bivald

## batch-insert-duckdb.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bivald
                / batch-insert-duckdb.md
            
            
              Created
              May 22, 2023 11:41
            
              
                Batch insert DuckDB
              
          
    How I batch-read into DuckDB

On a machine with much ram the following works:
con.sql("CREATE TABLE new_tbl AS SELECT * FROM read_parquet('file.parq')")

It uses about 20GB of ram or more and takes 130s and the duckdb file is 3.42GB
Trying to read the parquet in batches and inserting them instead and not keeping anything more than needed (i.e each row group) in RAM:

  
## parquet-with-use_dictionary.py
# This file if written with pyarrow==2.0.0 can't be read by pyarrow==8.0.0

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
    ("col1", pa.int8()),
    ("col2", pa.string()),
    ("col3", pa.float64()),
    ("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))

## arrow-conversion.py
import pandas as pd
from pyarrow import fs
import hashlib
import pyarrow as pa
import pyarrow.parquet as pq

input_file = 'input.parq'
output_file = 'data.arrow'

parquet_file = pq.ParquetFile(input_file)

## parquet-writer-memory-test.py
import time
import random
import string
import uuid
import os
import pyarrow as pa
pa.jemalloc_set_decay_ms(100)

import pyarrow.parquet as pq

## pyarrow-on-pypy
root@787346ae580a:/arrow/python# cd /arrow/python/ && pypy setup.py build_ext --build-type=release
running build_ext
-- Runnning cmake for pyarrow
cmake -DPYTHON_EXECUTABLE=/usr/local/bin/pypy  -DPYARROW_BOOST_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /arrow/python
INFOCompiler command: /usr/bin/c++
INFOCompiler version: Using built-in specs.
COLLECT_GCC=/usr/bin/c++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.3.0-18+deb9u1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disabl

## safe_format_and_mount_ssd.sh
#! /bin/bash
# Copyright 2013 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software

## openrailwaymap-offline.js
// include necessary modules
var cluster = require('cluster');
var os = require('os');
var rbush = require('rbush');
var assert = require('assert');
var http = require("http");
var url = require("url");
var toobusy = require('toobusy-js');
var byline = require('byline');
var touch = require("touch");

## rsync-retry.sh
#!/bin/bash

### ABOUT
### Runs rsync, retrying on errors up to a maximum number of tries.
### Simply edit the rsync line in the script to whatever parameters you need.

# Trap interrupts and exit instead of continuing the loop
trap "echo Exited!; exit;" SIGINT SIGTERM

MAX_RETRIES=30

## tech-positions-at-enplore.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bivald
                / tech-positions-at-enplore.md
            
            
              Created
              March 9, 2017 16:39
            
              
                Tech positions at Enplore
              
          
    Tech positions at Enplore

My name is Niklas Bivald and I'm the CTO at Enplore. If you’re applying for any kind of tech job at Enplore odds are that I will be your manager. Here's a few things you might want to know,
About us:

Curiosity might have killed the cat, but it's essential to everything we do.
We’re a small company and we work closely together
We use a transparent salary ladder to combat wage gaps


## enplore.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                bivald
                / enplore.md
            
            
              Last active
              September 30, 2020 08:49
            
              
                Fast growing, bootstrapped Tech Company
              
          
    Fast growing, bootstrapped Tech Company

We’ve spent the last five years developing a kick-ass world renowned data analytical solution. Our close collaboration with Google gives us a unique technical edge and a behind the scenes view in the Google Cloud Platform. We’ve modeled our systems with the insights from the Google team and knowledge gained from the trenches at Spotify. Every line of code you write for us helps to transform the way our customers work with data. For that we will always be grateful.
We believe in:

Working closely together across teams and experiences
Focusing on solving challenging and interesting problems

We combine complex data sources with advanced (and customer specific) analyses to answer the customers burning questions. One case is in the airline industry, where we answer questions such as: How do we fly safer? What are the unknown contributing factors at takeoff? How do we fly more fuel efficient without compromises on safety?
	# This file if written with pyarrow==2.0.0 can't be read by pyarrow==8.0.0

	import pyarrow as pa
	import pyarrow.parquet as pq

	schema = pa.schema([
	("col1", pa.int8()),
	("col2", pa.string()),
	("col3", pa.float64()),
	("col4", pa.dictionary(pa.int32(), pa.string(), ordered=False))
	import pandas as pd
	from pyarrow import fs
	import hashlib
	import pyarrow as pa
	import pyarrow.parquet as pq

	input_file = 'input.parq'
	output_file = 'data.arrow'

	parquet_file = pq.ParquetFile(input_file)
	import time
	import random
	import string
	import uuid
	import os
	import pyarrow as pa
	pa.jemalloc_set_decay_ms(100)

	import pyarrow.parquet as pq
	root@787346ae580a:/arrow/python# cd /arrow/python/ && pypy setup.py build_ext --build-type=release
	running build_ext
	-- Runnning cmake for pyarrow
	cmake -DPYTHON_EXECUTABLE=/usr/local/bin/pypy -DPYARROW_BOOST_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /arrow/python
	INFOCompiler command: /usr/bin/c++
	INFOCompiler version: Using built-in specs.
	COLLECT_GCC=/usr/bin/c++
	COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
	Target: x86_64-linux-gnu
	Configured with: ../src/configure -v --with-pkgversion='Debian 6.3.0-18+deb9u1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disabl
	#! /bin/bash
	# Copyright 2013 Google Inc. All Rights Reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	// include necessary modules
	var cluster = require('cluster');
	var os = require('os');
	var rbush = require('rbush');
	var assert = require('assert');
	var http = require("http");
	var url = require("url");
	var toobusy = require('toobusy-js');
	var byline = require('byline');
	var touch = require("touch");
	#!/bin/bash

	### ABOUT
	### Runs rsync, retrying on errors up to a maximum number of tries.
	### Simply edit the rsync line in the script to whatever parameters you need.

	# Trap interrupts and exit instead of continuing the loop
	trap "echo Exited!; exit;" SIGINT SIGTERM

	MAX_RETRIES=30