Mallikarjuna Gandhamsetty MallikarjunaG

## spark_etl_resume.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                MallikarjunaG
                / spark_etl_resume.md
            
            
              Created
              April 19, 2017 10:37
                — forked from rampage644/spark_etl_resume.md
            
              
                Spark ETL resume
              
          
    Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.
I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.
TL;DR
Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

  
## a_make_predictions_streaming.py
#!/usr/bin/env python

import sys, os, re
import json
import datetime, iso8601

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, OffsetRange, TopicAndPartition

## 0_reuse_code.js
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console
	#!/usr/bin/env python

	import sys, os, re
	import json
	import datetime, iso8601

	from pyspark import SparkContext, SparkConf
	from pyspark.sql import SparkSession, Row
	from pyspark.streaming import StreamingContext
	from pyspark.streaming.kafka import KafkaUtils, OffsetRange, TopicAndPartition
	// Use Gists to store code you would like to remember later on
	console.log(window); // log the "window" object to the console