Skip to content

Instantly share code, notes, and snippets.

View ottobricks's full-sized avatar

Otto von Sperling ottobricks

View GitHub Profile
def mergeFiles(spark: SparkSession, grouped: ListBuffer[ListBuffer[String]], targetDirectory: String): Unit = {
val startedAt = System.currentTimeMillis()
val forkJoinPool = new ForkJoinPool(grouped.size)
val parllelBatches = grouped.par
parllelBatches.tasksupport = new ForkJoinTaskSupport(forkJoinPool)
parllelBatches foreach (batch => {
logger.debug(s"Merging ${batch.size} files into one")
try {
spark.read.parquet(batch.toList: _*).coalesce(1).write.mode("append").parquet(targetDirectory.stripSuffix("/") + "/")
} catch {
@skyzyx
skyzyx / homebrew-gnubin.md
Last active June 29, 2024 15:22
Using GNU command line tools in macOS instead of FreeBSD tools

macOS is a Unix, and not built on Linux.

I think most of us realize that macOS isn't a Linux OS, but what that also means is that instead of shipping with the GNU flavor of command line tools, it ships with the FreeBSD flavor. As such, writing shell scripts which can work across both platforms can sometimes be challenging.

Homebrew

Homebrew can be used to install the GNU versions of tools onto your Mac, but they are all prefixed with "g" by default.

All commands have been installed with the prefix "g". If you need to use these commands with their normal names, you can add a "gnubin" directory to your PATH from your bashrc.

@balamurugana
balamurugana / running-minio-in-minikube.md
Last active August 2, 2023 05:32
Running minio in minikube

Prerequisites:

  • Run minikube with kvm driver by $ minikube start --vm-driver kvm

Minio FS mode:

  1. Deploy minio in fs mode with below yaml in a file like $ kubectl create -f my-minio-fs.yaml
## Create persistent volume claim for minio to store data.
apiVersion: v1
kind: PersistentVolumeClaim
@michalczukm
michalczukm / pre-push.sh
Last active December 11, 2020 01:10
Run linter on git pre-push
#!/bin/bash
# add it as `pre-push` file in your repository `.git/hooks folder`
# to test it - run `git push --dry-run`, it might be helpful :)
echo "============================= pre-push started ============================= "
remote="$1"
url="$2"
z40=0000000000000000000000000000000000000000
@joshlk
joshlk / faster_toPandas.py
Last active May 15, 2023 13:48
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
@EmmanuelOga
EmmanuelOga / commit-msg
Created June 13, 2012 22:01
commit-msg hook to add a prefix to commit messages
#!/usr/bin/env ruby
#
# Git commit-msg hook. If your branch name is in the form "US1234-postfix", or
# "US1234_postfix", it automatically adds the prefix "[US1234]" to commit
# messages.
#
# Example
# =======
#
# git checkout -b US1234-some-cool-feature