Sergei Turukin rampage644

## impala-hdp.md

      
              1 file
            
          
              6 forks
            
          
              6 comments
            
          
              4 stars
            
          
                rampage644
                / impala-hdp.md
            
            
              Last active
              March 21, 2019 15:07
            
              
                Impala + HDP
              
          
    Downloads

HDP sandbox
Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib


## dataframe.scala
import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, Row, SQLContext}
import com.databricks.spark.csv.CsvSchemaRDD
import org.apache.spark.sql.functions._


## spark_etl_resume.md

      
              1 file
            
          
              4 forks
            
          
              4 comments
            
          
              5 stars
            
          
                rampage644
                / spark_etl_resume.md
            
            
              Created
              September 15, 2015 18:02
            
              
                Spark ETL resume
              
          
    Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.
I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.
TL;DR
Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).
	import java.text.SimpleDateFormat
	import java.util.Date

	import org.apache.spark.{SparkContext, SparkConf}
	import org.apache.spark.sql.{SaveMode, Row, SQLContext}
	import com.databricks.spark.csv.CsvSchemaRDD
	import org.apache.spark.sql.functions._