Skip to content

Instantly share code, notes, and snippets.

@manisnesan
Last active December 4, 2020 21:09
Show Gist options
  • Save manisnesan/a529873bc9f8a99fe73cb1bcc874c2f6 to your computer and use it in GitHub Desktop.
Save manisnesan/a529873bc9f8a99fe73cb1bcc874c2f6 to your computer and use it in GitHub Desktop.

Task

Adding a new visualization in Apache Superset with the data workflow tool such as Airflow for periodic refresh. We will be creating a python script to fetch, tranform & load the data in sqlite.

Subtasks

  • Learn airflow
  • Setup airflow and create a DAG for periodic data collection
  • Transform the data into records
  • Insert into a database
  • Learn Superset and Create a superset dashboard.

Learning Outcomes

Superset, Airflow orchestration tool, Setting up data pipeline

Airflow

  • open-source orchestration software

  • used for scheduling, orchestrating and monitoring workflows

  • Example workflows like periodic data collection and updating dashboards, training your ML models as data gets updated

  • Installation

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page
  • ~/airflow/airflow.cfg contains all the config info
  • Running a new dag python ~/airflow/dags/tutorial.py #if executes successfully then we are good
  • Structure of an example dag look like
dag = DAG(name, default_args, description, schedule_interval)
t1 = BashOperator()
t2 = BashOperator()
dag.doc_md = ....
t3 = BashOperator()
t1 >> [t2, t3] #chain the tasks to form a DAG
# print the list of active DAGs
airflow list_dags

# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree
  • Testing: The date specified in this context is called execution_date. This is the logical date, which simulates the scheduler running your task or dag at a specific date and time, even though it physically will run now
# command layout: command subcommand dag_id task_id date

# testing print_date
airflow test tutorial print_date 2015-06-01

# testing sleep
airflow test tutorial sleep 2015-06-01
# testing templated
airflow test tutorial templated 2015-06-01
  • Pausing/Unpausing a DAG

airflow pause tutorial

airflow unpause tutorial

  • We can turn off the shipped examples showing up in UI by visiting airflow.cfg file and set load_examples to FALSE

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment