Sergei Turukin rampage644

## ds-dev.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rampage644
                / ds-dev.md
            
            
              Created
              August 2, 2016 13:21
            
              
                DS dev process comments
              
          
    Dataservices spider development process

Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.
Common

Key information


## talk2907.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rampage644
                / talk2907.md
            
            
              Last active
              August 18, 2016 08:08
            
              
                Shub talk
              
          
    My Shub Talks

29/07 - Introduce workflow manager

Brief intro

First, I'd like to say hello to everyone and thank for coming.

  
## airflow_deploy_design.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              8 stars
            
          
                rampage644
                / airflow_deploy_design.md
            
            
              Created
              October 6, 2015 20:53
            
              
                Airflow flows deployment
              
          
    Introduction

This document describes how Airflow jobs (or workflows) get deployed onto production system.
Directory structure


HOME directory:/home/airflow
DAG directory: $HOME/airflow-git-dir/dags/
Config directory: $HOME/airflow-git-dir/configs/
Unittest directore: $HOME/airflow-git-dir/tests/. Preferable, discoverable by both nose and py.test
Credentials should be accessed by by some library


## spark_etl_resume.md

      
              1 file
            
          
              4 forks
            
          
              4 comments
            
          
              5 stars
            
          
                rampage644
                / spark_etl_resume.md
            
            
              Created
              September 15, 2015 18:02
            
              
                Spark ETL resume
              
          
    Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.
I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.
TL;DR
Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

  
## dataframe.scala
import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, Row, SQLContext}
import com.databricks.spark.csv.CsvSchemaRDD
import org.apache.spark.sql.functions._


## impala-build.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                rampage644
                / impala-build.md
            
            
              Created
              November 24, 2014 15:56
            
              
                Impala build
              
          
    Building Impala


Version: cdh5-2.0_5.2.0
OS: Archlinux 3.17.2-1-ARCH x86_64
gcc version 4.9.2

Berkeley DB version >= 5


## impala-hdp.md

      
              1 file
            
          
              6 forks
            
          
              6 comments
            
          
              4 stars
            
          
                rampage644
                / impala-hdp.md
            
            
              Last active
              March 21, 2019 15:07
            
              
                Impala + HDP
              
          
    Downloads

HDP sandbox
Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib


## osv.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rampage644
                / osv.md
            
            
              Last active
              August 29, 2015 14:07
            
              
                Osv
              
          
    OSv + Impala status


I think i get plan-fragment-executor-test run under OSv
But it fails very quickly
Problem is with tcmallocstatic. First, OSv doesn't support sbrk-based memory management. One has to tune tcmallocstatic not to use SbrkMemoryAllocator at all (comment #undef HAVE_SBRK in config.h.in). Second, it still fails with invalid opcode exception.

Issues

tcmallocstatic


## week-result.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                rampage644
                / week-result.md
            
            
              Last active
              August 29, 2015 14:05
            
              
                Week results
              
          
    Results


Haven't found how to cut-off hardware layer. Virtio lead didn't help.
Osv builds very tricky libraries. Impossible to use as is at host.
Bottom-up approach seems reasonable for now

01 Sep

Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.

  
## impalad-summary
## Git repo

Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) *README* file.

## Task description

Original task was to prune impalad to some sort of *executor* binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.

My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.
	import java.text.SimpleDateFormat
	import java.util.Date

	import org.apache.spark.{SparkContext, SparkConf}
	import org.apache.spark.sql.{SaveMode, Row, SQLContext}
	import com.databricks.spark.csv.CsvSchemaRDD
	import org.apache.spark.sql.functions._
	## Git repo

	Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) README file.

	## Task description

	Original task was to prune impalad to some sort of executor binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.

	My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.