Skip to content

Instantly share code, notes, and snippets.

@rampage644
Created October 6, 2015 20:53
Show Gist options
  • Save rampage644/3ffd586fa411e3e8e861 to your computer and use it in GitHub Desktop.
Save rampage644/3ffd586fa411e3e8e861 to your computer and use it in GitHub Desktop.
Airflow flows deployment

Introduction

This document describes how Airflow jobs (or workflows) get deployed onto production system.

Directory structure

  • HOME directory:/home/airflow
  • DAG directory: $HOME/airflow-git-dir/dags/
  • Config directory: $HOME/airflow-git-dir/configs/
  • Unittest directore: $HOME/airflow-git-dir/tests/. Preferable, discoverable by both nose and py.test
  • Credentials should be accessed by by some library
  • Configs are in yaml format, access with library

Credentials

Workflow always access credential through some library (rax.credentials).

  1. Ideal solution is to periodically deploy passwords into either textfile or something else by ansible and to access those from within each task via library (preferable solution)
  2. We could also use ansible templates config to make ansible complement workflows configs, but it will clutter configs, and I would prefer not to use configs to store any connection info except its name (to retrieve credentials using that name)
  3. Connect to PasswordSafe on each run. Cons are ease of credentials update but they wont't change fast.

Python code example

import rax.config
import rax.credentials

source_name = "EBI-ETL-DEV-01"
credentials = credentials.get_credentials(source_name)

print "Username is " + credentials['user']
print "Hostname is " + credentials['host']
print "Password is " + credentials['pass']

PasswordSafe has description field which could be used for additional info/metadata

import rax.config
import rax.credentials

config_name = "folder1.folder2.folder3.config_name" # should be unique, i suggest using fs hierarchy also

config.get_config(config_name) # returns an object as pyyaml does
print "Timeout is " + config['timeout']
# or
print "Timeout is " + config.timeout

Workflow description

Each workflow should consist of one or more python modules, zero or more config files, and possibly unittests. All utility functions or common tasks should be in util or lib directory.

In my opinion unittests won't help much in quality testing, so we need to find a way to perform some functional testing.

I also think that each workflow should consist of checking task (and alerting, logging and aborting on error), the task itself and some sort of summary task (perform calculations what is done, logging, etc)

Deployment

We will maintain a single Dag/Config repo with code/settings separation. Utilities should go here. This repo should automatically sync with $HOME/airflow-repo/ with simple cron statement.

From dev viewpoint it will look like creating PR for this repo, with unittests(!).

We could use jenkins to build that (just make sure no syntax errors, PEP8 correctness, pyflake, etc), run unittests (if any). In case there are no errors someone would merge PR for airflow scheduler nodes to pull changes.

3rd party package dependencies

Should be done via PR in caspian-deploy repo.

Alerting

Via PagerDuty python API withing check tasks and summary task (could be common and shareable)

Unresolved issues

Functional testing

I have no ideas how to perform that, writing massive mock for every possible case seems a weird idea). Maybe we should write a separate tool to perform how each workflow is doing (by understanding what and where destination data is.

Update: Require test run. Verify against each task output data. Could be done with Jenkins CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment