Skip to content

Instantly share code, notes, and snippets.

Sergei Turukin rampage644

Block or report user

Report or block rampage644

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
rampage644 /
Created Aug 2, 2016
DS dev process comments

Dataservices spider development process

Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.


Key information


My Shub Talks

29/07 - Introduce workflow manager

Brief intro

First, I'd like to say hello to everyone and thank for coming.



This document describes how Airflow jobs (or workflows) get deployed onto production system.

Directory structure

  • HOME directory:/home/airflow
  • DAG directory: $HOME/airflow-git-dir/dags/
  • Config directory: $HOME/airflow-git-dir/configs/
  • Unittest directore: $HOME/airflow-git-dir/tests/. Preferable, discoverable by both nose and py.test
  • Credentials should be accessed by by some library


This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.



Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

rampage644 / dataframe.scala
Last active Jun 19, 2019
spark etl sample, attempt #1
View dataframe.scala
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, Row, SQLContext}
import com.databricks.spark.csv.CsvSchemaRDD
import org.apache.spark.sql.functions._

Building Impala

  • Version: cdh5-2.0_5.2.0
  • OS: Archlinux 3.17.2-1-ARCH x86_64
  • gcc version 4.9.2

Berkeley DB version >= 5



HDP sandbox


yum-config-manager --add-repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib

OSv + Impala status

  1. I think i get plan-fragment-executor-test run under OSv
  2. But it fails very quickly
  3. Problem is with tcmallocstatic. First, OSv doesn't support sbrk-based memory management. One has to tune tcmallocstatic not to use SbrkMemoryAllocator at all (comment #undef HAVE_SBRK in Second, it still fails with invalid opcode exception.





  • Haven't found how to cut-off hardware layer. Virtio lead didn't help.
  • Osv builds very tricky libraries. Impossible to use as is at host.
  • Bottom-up approach seems reasonable for now

01 Sep

Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.

View impalad-summary
## Git repo
Find modified impala [here]( First, have a look at [this]( *README* file.
## Task description
Original task was to prune impalad to some sort of *executor* binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.
My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.
You can’t perform that action at this time.