Alexis Seigneurin aseigneurin

## register_schema.py
#!/usr/bin/python

import os
import sys

import requests

schema_registry_url = sys.argv[1]
topic = sys.argv[2]
schema_file = sys.argv[3]

## google_checks.xml
<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
          "-//Puppy Crawl//DTD Check Configuration 1.3//EN"
          "http://www.puppycrawl.com/dtds/configuration_1_3.dtd">

<!--

    Checkstyle configuration that checks the Google coding conventions from:

    -  Google Java Style

## settings.yaml
region: Ile-de-France
departement: Paris
zipCode: 75011
city: Paris
name: Alexis S
email: alexis@xxx.com
phoneNumber: "0600000000"
hidePhoneNumber: false
password: xxxxxxxxx

## alexis.zsh-theme
PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'

ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
ZSH_THEME_GIT_PROMPT_SUFFIX="%{$fg[white]%})%{$reset_color%}"
ZSH_THEME_GIT_PROMPT_DIRTY="%{$fg[red]%}*"
ZSH_THEME_GIT_PROMPT_CLEAN=""

## Spark parquet.md

      
              1 file
            
          
              2 forks
            
          
              1 comment
            
          
              6 stars
            
          
                aseigneurin
                / Spark parquet.md
            
            
              Created
              November 15, 2016 15:25
            
              
                Spark - Parquet files
              
          
    Spark - Parquet files

Basic file formats - such as CSV, JSON or other text formats - can be useful when exchanging data between applications. When it comes to storing intermediate data between steps of an application, Parquet can provide more advanced capabilities:

Support for complex types, as opposed to string-based types (CSV) or a limited type system (JSON only supports strings, basic numbers, booleans).
Columnar storage - more efficient when not all the columns are used or when filtering the data.
Partitioning - files are partitioned out of the box
Compression - pages can be compressed with Snappy or Gzip (this preserves the partitioning)

The tests here are performed with Spark 2.0.1 on a cluster with 3 workers (c4.4xlarge, 16 vCPU and 30 GB each).

  
## Spark file formats and storage.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                aseigneurin
                / Spark file formats and storage.md
            
            
              Last active
              December 17, 2018 10:09
            
              
                Spark - File formats and storage options
              
          
    Spark - File formats and storage options

In this document, I'm using a data file containing 40 million records. The file is a text file with one record per line.
The following Scala code is run in a spark-shell:
val filename = "<path to the file>"
val file = sc.textFile(filename)
file.count()


## Spark high availability.md

      
              1 file
            
          
              6 forks
            
          
              4 comments
            
          
              23 stars
            
          
                aseigneurin
                / Spark high availability.md
            
            
              Created
              November 1, 2016 16:42
            
              
                Spark - High availability
              
          
    Spark - High availability

Components in play

As a reminder, here are the components in play to run an application:

The cluster:

Spark Master: coordinates the resources
Spark Workers: offer resources to run the applications


The application:


## get-wikipedia-pagecounts-hours.sh
#!/bin/bash -e

if [ ! -d data/wikipedia-pagecounts-hours ]; then
  mkdir -p data/wikipedia-pagecounts-hours
fi
cd data/wikipedia-pagecounts-hours

yyyy=2014
MM=06
dd=19

## SparkStringConsumer.java
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <modelVersion>4.0.0</modelVersion>
    <groupId>com.sample</groupId>
    <artifactId>Spark_Kafka_Streaming</artifactId>
    <packaging>jar</packaging>
    <version>0.0.1-SNAPSHOT</version>

    <properties>

## 38531095.scala
case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)

val pendingResults = ArrayBuffer.empty[PendingResult]
for (i <- 1 to iterations) {
  val k1 = ...
  val k2 = ...
  val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
  pendingResults += PendingResult(k1, k2, futureResults)
}
	#!/usr/bin/python

	import os
	import sys

	import requests

	schema_registry_url = sys.argv[1]
	topic = sys.argv[2]
	schema_file = sys.argv[3]
	<?xml version="1.0"?>
	<!DOCTYPE module PUBLIC
	"-//Puppy Crawl//DTD Check Configuration 1.3//EN"
	"http://www.puppycrawl.com/dtds/configuration_1_3.dtd">

	<!--

	Checkstyle configuration that checks the Google coding conventions from:

	- Google Java Style
	region: Ile-de-France
	departement: Paris
	zipCode: 75011
	city: Paris
	name: Alexis S
	email: alexis@xxx.com
	phoneNumber: "0600000000"
	hidePhoneNumber: false
	password: xxxxxxxxx
	PROMPT=$'%{$fg_bold[red]%}%D{%K:%M:%S}%{$reset_color%} %{$fg[cyan]%}%n%{$fg[grey]%}@%{$fg[green]%}%M%{$fg[grey]%}:%{$fg_bold[yellow]%}%d%{$fg[grey]%}$(git_prompt_info) $ %{$reset_color%}'

	ZSH_THEME_GIT_PROMPT_PREFIX=" %{$fg_bold[white]%}git:("
	ZSH_THEME_GIT_PROMPT_SUFFIX="%{$fg[white]%})%{$reset_color%}"
	ZSH_THEME_GIT_PROMPT_DIRTY="%{$fg[red]%}*"
	ZSH_THEME_GIT_PROMPT_CLEAN=""
	#!/bin/bash -e

	if [ ! -d data/wikipedia-pagecounts-hours ]; then
	mkdir -p data/wikipedia-pagecounts-hours
	fi
	cd data/wikipedia-pagecounts-hours

	yyyy=2014
	MM=06
	dd=19
	<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

	<modelVersion>4.0.0</modelVersion>
	<groupId>com.sample</groupId>
	<artifactId>Spark_Kafka_Streaming</artifactId>
	<packaging>jar</packaging>
	<version>0.0.1-SNAPSHOT</version>

	<properties>
	case class PendingResult(k1: Long, k2: String, futureResults: ResultSetFuture)

	val pendingResults = ArrayBuffer.empty[PendingResult]
	for (i <- 1 to iterations) {
	val k1 = ...
	val k2 = ...
	val futureResults = session.executeAsync(s"SELECT * FROM ${tableName} WHERE k1=${k1} AND k2='${k2}'")
	pendingResults += PendingResult(k1, k2, futureResults)
	}