Skip to content

Instantly share code, notes, and snippets.


Alexis Seigneurin aseigneurin

View GitHub Profile
aseigneurin /
Last active Nov 18, 2020
Register an Avro schema against the Confluent Schema Registry
import os
import sys
import requests
schema_registry_url = sys.argv[1]
topic = sys.argv[2]
schema_file = sys.argv[3]
View google_checks.xml
<?xml version="1.0"?>
"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
Checkstyle configuration that checks the Google coding conventions from:
- Google Java Style
aseigneurin / settings.yaml
Created Mar 6, 2017
View settings.yaml
region: Ile-de-France
departement: Paris
zipCode: 75011
city: Paris
name: Alexis S
phoneNumber: "0600000000"
hidePhoneNumber: false
password: xxxxxxxxx
aseigneurin / alexis.zsh-theme
Last active Jan 15, 2017
Oh-My-Zsh configuration
View alexis.zsh-theme
PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'
ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
aseigneurin / Spark
Created Nov 15, 2016
Spark - Parquet files
View Spark

Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

  • Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
  • Columnar storage - more efficient when not all the columns are used or when filtering the data.
  • Partitioning - files are partitioned out of the box
  • Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

aseigneurin / Spark file formats and
Last active Dec 17, 2018
Spark - File formats and storage options
View Spark file formats and

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
View Spark high

Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

  • The cluster:
    • Spark Master: coordinates the resources
    • Spark Workers: offer resources to run the applications
  • The application:
#!/bin/bash -e
if [ ! -d data/wikipedia-pagecounts-hours ]; then
mkdir -p data/wikipedia-pagecounts-hours
cd data/wikipedia-pagecounts-hours
<project xmlns="" xmlns:xsi=""
View 38531095.scala
case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)
val pendingResults = ArrayBuffer.empty[PendingResult]
for (i <- 1 to iterations) {
val k1 = ...
val k2 = ...
val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
pendingResults += PendingResult(k1, k2, futureResults)