Skip to content

Instantly share code, notes, and snippets.

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.sql.SaveMode._
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val inputPath = "/tmp/inputDir/"
@Mageswaran1989
Mageswaran1989 / spark_dataset.ipynb
Created January 30, 2020 03:24
A gentele introduction to Spark Datasets
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@CesarCapillas
CesarCapillas / add-by-id.sh
Last active April 23, 2024 21:14
SOLR bash recipes for creating, deleting or truncating collections, monitoring and searching.
#!/bin/bash
COLLECTION=${2:-zylk}
SERVER=${3:-localhost}
PORT=${4:-8983}
if [ -z "$1" ]; then
# Usage
echo 'Usage: add-by-id.sh <id> [<collection> <solr-server=localhost> <port=8383>]'
else
curl -X POST "http://${SERVER}:${PORT}/solr/${COLLECTION}/update?commit=true" -H "Content-Type: text/xml" --data-binary "<add><doc><field name='id'>$1</field><field name='url'>$1</field></doc></add>"
@max-mapper
max-mapper / bibtex.png
Last active March 10, 2024 21:53
How to make a scientific looking PDF from markdown (with bibliography)
bibtex.png
@allquest
allquest / leboncoin_rss.user.js
Last active August 2, 2023 11:00
Greasemonkey script for LeBonCoin - A kind of RSS for the website Le bon coin with your query -- each time you reload a page, a GET request is sent to lbc and match your query. If a new offer is available, the link is shown on the top of the page.
// ==UserScript==
// @name Leboncoin RSS
// @namespace http://gist.github.com/fb7b790fb6548bdec3ec5259bebd20c0
// @author Tegomass
// @description A kind of RSS for LeBonCoin with your personnal search
// @include *
// @require https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js
// @version 1.1
// @grant GM_addStyle
// @grant GM_setValue
@yoyama
yoyama / Schema2CaseClass.scala
Created January 20, 2017 07:36
Generate case class from spark DataFrame/Dataset schema.
/**
* Generate Case class from DataFrame.schema
*
* val df:DataFrame = ...
*
* val s2cc = new Schema2CaseClass
* import s2cc.implicit._
*
* println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
*
@longcao
longcao / SparkCopyPostgres.scala
Last active December 26, 2023 14:47
COPY Spark DataFrame rows to PostgreSQL (via JDBC)
import java.io.InputStream
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.postgresql.copy.CopyManager
import org.postgresql.core.BaseConnection
val jdbcUrl = s"jdbc:postgresql://..." // db credentials elided
val connectionProperties = {
@Alanaktion
Alanaktion / pacman.md
Last active April 21, 2020 14:49
Useful pacman commands and packages

Basic usage

pacman -S <package> # Install a package
pacman -Sy # Update package list
pacman -Su # Update installed packages
pacman -Ss <query> # Search packages
pacman -R <package> # Remove a package
pacman -Rs <package> # Remove a package and it's unneeded dependencies
@paulp
paulp / oddity.txt
Created January 11, 2016 22:22
Whitespace Oddity
WHITESPACE ODDITY
by Paul Phillips, in eternal admiration of David Bowie, RIP
Bound Ctrl to Major mode
Bound Ctrl to Major mode
Read inputrc and set extdebug on
Bound Ctrl to Major mode (Ten, Nine, Eight, Seven, Six)
Connecting readline, options on (Five, Four, Three)
Check the syntax, may terminfo be with you (Two, One, Exec)
@rampage644
rampage644 / spark_etl_resume.md
Created September 15, 2015 18:02
Spark ETL resume

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).