Skip to content

Instantly share code, notes, and snippets.

Sergei Turukin rampage644

Block or report user

Report or block rampage644

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@rampage644
rampage644 / ds-dev.md
Created Aug 2, 2016
DS dev process comments
View ds-dev.md

Dataservices spider development process

Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.

Common

Key information

View talk2907.md

My Shub Talks

29/07 - Introduce workflow manager

Brief intro

First, I'd like to say hello to everyone and thank for coming.

View airflow_deploy_design.md

Introduction

This document describes how Airflow jobs (or workflows) get deployed onto production system.

Directory structure

  • HOME directory:/home/airflow
  • DAG directory: $HOME/airflow-git-dir/dags/
  • Config directory: $HOME/airflow-git-dir/configs/
  • Unittest directore: $HOME/airflow-git-dir/tests/. Preferable, discoverable by both nose and py.test
  • Credentials should be accessed by by some library
View spark_etl_resume.md

Introduction

This document describes sample process of implementing part of existing Dim_Instance ETL.

I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible Spark usage as primary ETL tool.

TL;DR

Implementation

Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).

@rampage644
rampage644 / dataframe.scala
Last active Jun 19, 2019
spark etl sample, attempt #1
View dataframe.scala
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{SaveMode, Row, SQLContext}
import com.databricks.spark.csv.CsvSchemaRDD
import org.apache.spark.sql.functions._
View impala-build.md

Building Impala

  • Version: cdh5-2.0_5.2.0
  • OS: Archlinux 3.17.2-1-ARCH x86_64
  • gcc version 4.9.2

Berkeley DB version >= 5

View impala-hdp.md

Downloads

HDP sandbox

Installation

yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
yum install  impala-server impala-catalog impala-state-store impala-shell
ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib
ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib
View osv.md

OSv + Impala status

  1. I think i get plan-fragment-executor-test run under OSv
  2. But it fails very quickly
  3. Problem is with tcmallocstatic. First, OSv doesn't support sbrk-based memory management. One has to tune tcmallocstatic not to use SbrkMemoryAllocator at all (comment #undef HAVE_SBRK in config.h.in). Second, it still fails with invalid opcode exception.

Issues

tcmallocstatic

View week-result.md

Results

  • Haven't found how to cut-off hardware layer. Virtio lead didn't help.
  • Osv builds very tricky libraries. Impossible to use as is at host.
  • Bottom-up approach seems reasonable for now

01 Sep

Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.

View impalad-summary
## Git repo
Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) *README* file.
## Task description
Original task was to prune impalad to some sort of *executor* binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.
My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.
You can’t perform that action at this time.