Skip to content

Instantly share code, notes, and snippets.

View aseigneurin's full-sized avatar

Alexis Seigneurin aseigneurin

View GitHub Profile
@aseigneurin
aseigneurin / Spark high availability.md
Created November 1, 2016 16:42
Spark - High availability

Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

  • The cluster:
    • Spark Master: coordinates the resources
    • Spark Workers: offer resources to run the applications
  • The application:
@aseigneurin
aseigneurin / copy_remotely
Last active January 5, 2024 19:06
Ansible module to copy a file if the MD5 sum of the target is different
#!/usr/bin/python
DOCUMENTATION = '''
---
module: copy_remotely
short_description: Copies a file from the remote server to the remote server.
description:
- Copies a file but, unlike the M(file) module, the copy is performed on the
remote server.
The copy is only performed if the source and destination files are different
(different MD5 sums) or if the destination file does not exist.
@aseigneurin
aseigneurin / Spark parquet.md
Created November 15, 2016 15:25
Spark - Parquet files

Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

  • Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
  • Columnar storage - more efficient when not all the columns are used or when filtering the data.
  • Partitioning - files are partitioned out of the box
  • Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

@aseigneurin
aseigneurin / parse.js
Created November 18, 2013 09:53
Parse a JSON file and output a SQL script with Node.js.
var fs = require('fs');
var data = fs.readFileSync(process.argv[2], {
encoding: 'ascii'
});
var json = JSON.parse(data);
for (var list in json) {
var devices = json[list];
for (var i = 0; i < devices.length; i++) {
@aseigneurin
aseigneurin / register_schema.py
Last active October 18, 2022 08:26
Register an Avro schema against the Confluent Schema Registry
#!/usr/bin/python
import os
import sys
import requests
schema_registry_url = sys.argv[1]
topic = sys.argv[2]
schema_file = sys.argv[3]
@aseigneurin
aseigneurin / Spark file formats and storage.md
Last active December 17, 2018 10:09
Spark - File formats and storage options

Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.

The following Scala code is run in a spark-shell:

val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()
<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
"http://www.puppycrawl.com/dtds/configuration_1_3.dtd">
<!--
Checkstyle configuration that checks the Google coding conventions from:
- Google Java Style
@aseigneurin
aseigneurin / settings.yaml
Created March 6, 2017 12:50
leboncoin-ad-manager
region: Ile-de-France
departement: Paris
zipCode: 75011
city: Paris
name: Alexis S
email: alexis@xxx.com
phoneNumber: "0600000000"
hidePhoneNumber: false
password: xxxxxxxxx
@aseigneurin
aseigneurin / alexis.zsh-theme
Last active January 15, 2017 01:30
Oh-My-Zsh configuration
PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'
ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
ZSH_THEME_GIT_PROMPT_SUFFIX="%{$fg[white]%})%{$reset_color%}"
ZSH_THEME_GIT_PROMPT_DIRTY="%{$fg[red]%}*"
ZSH_THEME_GIT_PROMPT_CLEAN=""
#!/bin/bash -e
if [ ! -d data/wikipedia-pagecounts-hours ]; then
mkdir -p data/wikipedia-pagecounts-hours
fi
cd data/wikipedia-pagecounts-hours
yyyy=2014
MM=06
dd=19