Skip to content

Instantly share code, notes, and snippets.

View rampage644's full-sized avatar

Sergei Turukin rampage644

View GitHub Profile
@rampage644
rampage644 / impala-hdp.md
Last active March 21, 2019 15:07
Impala + HDP

Downloads

HDP sandbox

Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib
@rampage644
rampage644 / dataframe.scala
Last active June 19, 2019 11:54
spark etl sample, attempt #1
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, Row, SQLContext}
import com.databricks.spark.csv.CsvSchemaRDD
import org.apache.spark.sql.functions._
@rampage644
rampage644 / spark_etl_resume.md
Created September 15, 2015 18:02
Spark ETL resume

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).