Skip to content

Instantly share code, notes, and snippets.

View MallikarjunaG's full-sized avatar

Mallikarjuna Gandhamsetty MallikarjunaG

  • Infosys Ltd
  • Hyderabad, India
View GitHub Profile
@MallikarjunaG
MallikarjunaG / 01GettingStartedWithTheScalableLanguage.ipynb
Created May 10, 2018 05:49
01GettingStartedWithTheScalableLanguage
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@MallikarjunaG
MallikarjunaG / ComplexDataFrames
Created April 15, 2018 15:56
DataFrame Operations with Complex Schema
'''
DataFrames From Complex Schema
@author: Mallikarjuna G
'''
if __name__ == '__main__':
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import column
@MallikarjunaG
MallikarjunaG / StreamCatsToHBase.py
Created April 28, 2017 12:31
PySpark HBase and Spark Streaming: Save RDDs to HBase - http://cjcroix.blogspot.in/
1: import sys
2: import json
3: from pyspark import SparkContext
4: from pyspark.streaming import StreamingContext
5:
6:
7: def SaveRecord(rdd):
8: host = 'sparkmaster.example.com'
9: table = 'cats'
10: keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

#!/usr/bin/env python
import sys, os, re
import json
import datetime, iso8601
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, OffsetRange, TopicAndPartition
@MallikarjunaG
MallikarjunaG / 0_reuse_code.js
Created December 7, 2016 03:25
Here are some things you can do with Gists in GistBox.
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console