Adding a new visualization in Apache Superset with the data workflow tool such as Airflow for periodic refresh. We will be creating a python script to fetch, tranform & load the data in sqlite.
- Learn airflow
- Setup airflow and create a DAG for periodic data collection
- Transform the data into records
- Insert into a database
- Learn Superset and Create a superset dashboard.
Superset, Airflow orchestration tool, Setting up data pipeline
-
open-source orchestration software
-
used for scheduling, orchestrating and monitoring workflows
-
Example workflows like periodic data collection and updating dashboards, training your ML models as data gets updated
-
Installation
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow
# install from pypi using pip
pip install apache-airflow
# initialize the database
airflow initdb
# start the web server, default port is 8080
airflow webserver -p 8080
# start the scheduler
airflow scheduler
# visit localhost:8080 in the browser and enable the example dag in the home page
- ~/airflow/airflow.cfg contains all the config info
- Running a new dag
python ~/airflow/dags/tutorial.py
#if executes successfully then we are good - Structure of an example dag look like
dag = DAG(name, default_args, description, schedule_interval)
t1 = BashOperator()
t2 = BashOperator()
dag.doc_md = ....
t3 = BashOperator()
t1 >> [t2, t3] #chain the tasks to form a DAG
- Navigate to the UI http://localhost:8080/admin and sample schreenshots https://airflow.apache.org/docs/apache-airflow/stable/ui.html
airflow run tutorial sleep 2020-12-04
- Command Line Metadata Validation
# print the list of active DAGs
airflow list_dags
# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial
# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree
- Testing: The date specified in this context is called execution_date. This is the logical date, which simulates the scheduler running your task or dag at a specific date and time, even though it physically will run now
# command layout: command subcommand dag_id task_id date
# testing print_date
airflow test tutorial print_date 2015-06-01
# testing sleep
airflow test tutorial sleep 2015-06-01
# testing templated
airflow test tutorial templated 2015-06-01
- Pausing/Unpausing a DAG
airflow pause tutorial
airflow unpause tutorial
- We can turn off the shipped examples showing up in UI by visiting airflow.cfg file and set load_examples to FALSE