Skip to content

Instantly share code, notes, and snippets.

View aseigneurin's full-sized avatar

Alexis Seigneurin aseigneurin

View GitHub Profile
aseigneurin /
Last active October 18, 2022 08:26
Register an Avro schema against the Confluent Schema Registry
import os
import sys
import requests
schema_registry_url = sys.argv[1]
topic = sys.argv[2]
schema_file = sys.argv[3]
<?xml version="1.0"?>
"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
Checkstyle configuration that checks the Google coding conventions from:
- Google Java Style
aseigneurin / settings.yaml
Created March 6, 2017 12:50
region: Ile-de-France
departement: Paris
zipCode: 75011
city: Paris
name: Alexis S
phoneNumber: "0600000000"
hidePhoneNumber: false
password: xxxxxxxxx
aseigneurin / alexis.zsh-theme
Last active January 15, 2017 01:30
Oh-My-Zsh configuration
PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'
ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
aseigneurin / Spark
Created November 15, 2016 15:25
Spark - Parquet files

Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

  • Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
  • Columnar storage - more efficient when not all the columns are used or when filtering the data.
  • Partitioning - files are partitioned out of the box
  • Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

aseigneurin / Spark file formats and
Last active December 17, 2018 10:09
Spark - File formats and storage options

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
aseigneurin / Spark high
Created November 1, 2016 16:42
Spark - High availability

Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

  • The cluster:
    • Spark Master: coordinates the resources
    • Spark Workers: offer resources to run the applications
  • The application:
#!/bin/bash -e
if [ ! -d data/wikipedia-pagecounts-hours ]; then
mkdir -p data/wikipedia-pagecounts-hours
cd data/wikipedia-pagecounts-hours
<project xmlns="" xmlns:xsi=""
case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)
val pendingResults = ArrayBuffer.empty[PendingResult]
for (i <- 1 to iterations) {
val k1 = ...
val k2 = ...
val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
pendingResults += PendingResult(k1, k2, futureResults)