Dataservices spider development process
Disclaimer: Everything described in this document is my personal opinion that doesn't have to be true for everyone.
This document describes how
Airflow jobs (or workflows) get deployed onto production system.
$HOME/airflow-git-dir/tests/. Preferable, discoverable by both
This document describes sample process of implementing part of existing
I took only Clound Block Storage source to simplify and speedup the process. I also ignnored creation of extended tables (specific for this particular ETL process). Below are code and final thoughts about possible
Spark usage as primary ETL tool.
Basic ETL implementation is really straightforward. The only real problem (I mean, really problem) is to find correct and comprehensive Mapping document (description what source fields go where).
yum-config-manager --add-repo http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo yum install impala-server impala-catalog impala-state-store impala-shell ln -sf /usr/lib/hbase/lib/hbase-client.jar /usr/lib/impala/lib ln -sf /usr/lib/hbase/lib/hbase-common.jar /usr/lib/impala/lib ln -sf /usr/lib/hbase/lib/hbase-protocol.jar /usr/lib/impala/lib
sbrk-based memory management. One has to tune
tcmallocstaticnot to use
SbrkMemoryAllocatorat all (comment
config.h.in). Second, it still fails with invalid opcode exception.
Just collecting information about unikernels/kvm and friends. Little osv source code digging with no actual result. Discussions.
|## Git repo|
|Find modified impala [here](https://github.com/rampage644/impala-cut). First, have a look at [this](https://github.com/rampage644/impala-cut/blob/executor/README.md) *README* file.|
|## Task description|
|Original task was to prune impalad to some sort of *executor* binary which only executes part of query. Two approaches were suggested: top-down and bottom-up. I used bottom-up approach.|
|My intention was to write unittest that whill actually test the behavior we need. So, look at `be/src/runtime/plan-fragment-executior-test.cc`. It contains all possible tests (that is, actual code snippets) to run part of query with or without data. Doing so helped me a lot to understand impalad codebase relative to query execution.|