Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
[DE Zoomcap] Airflow & WSL2: no-frills or no-thrills

-- Read about DataTalks.Club Data Engineering Zoomcamp --

Airflow & WSL2: no-frills or no-thrills

Second week of the data engineering Zoomcamp by DataTalks.Club brought a new tool that is one of the most popular data pipeline platforms - Apache Airflow. So we are going to create some workflows!

Intro

First you have to run the Docker compose Airflow installation in the environment of our choice, which can be one of but not limited to MacOS, Linux, GCP VM or very popular WSL. What's more, we also need the Google Cloud SDK installed in our Airflow env in order to connect with the Cloud Store bucket & create tables in Big Query. That means we cannot just use the official docker-compose.yaml referenced in the Airflow's docs, but we have to build custom Dockerfile with an extended apache/airflow image containing our additional dependencies. Then we can incorporate it into docker-compose.yaml 🙌

Fortunately, the course instructors have prepared all the files for us, moreover in two versions:

  • Official Version which consists of:
    • airflow-scheduler
    • airflow-webserver
    • airflow-worker
    • airflow-init
    • flower
    • postgres
    • redis
    • celery
  • Custom No-Frills Version which consists of:
    • airflow-scheduler
    • airflow-webserver
    • postgres

Since I have some previous commercial experience with Airflow, I was pretty sure that during this course we will be good with just the LocalExecutor so I've decided to follow the No-Frills path with limited number of services.

Soon it turned out that although this lightweight solution has significant number of users who reported "It works❗️" on their MacOS or Linux or Windows/WSL machines, by some unknown reason, from time to time I still saw posts on the Slack channel asking for help with the X.Y.Z issue/error that happened while running custom version on the Windows/WSL.

No-frills troubleshooting

My curiosity forced me to take an in-depth look at this topic, especially because I was one of this Windows/WSL users and at that time it did not work for me as well 🤣

My laptop has 8GB of RAM and 4 CPU cores. Assuming that W10+VS Code consumes around 3GB per se there is still ~5GB free to use by Docker engine which is 1GB+ the recommended minimum (via Airflow docs). I did not notice any CPU related requirements TBH.

Dry run

As I said before, there are three services defined in the docker-compose.yaml that should start in the following order: postgres -> scheduler and webserver which depends on them.

The first round of build&up took some time but less than 10 minutes. Postgres was fine, scheduler was trying to insert something to the database which was not possible because the webserver kept raising exceptions and restarting trying to initialize the Airflow's internal database.

(Not really) Issue #1: Unsupported syntax in .env file

Actually I noticed this when I was dealing with the Issue #2:

nervuzz@DELL:~/repos/data-engineering-zoomcamp/WEEK_2/airflow$ docker compose config
services:
  postgres:
    deploy:
      resources:
        limits:
          memory: "314572800"
    environment:
      _AIRFLOW_WWW_USER_CREATE: "True"
      _AIRFLOW_WWW_USER_PASSWORD: :airflow}  # <---
      _AIRFLOW_WWW_USER_USERNAME: :airflow}  # <---
(...)

Obviously there is something wrong with those values. So what we have in the .env file?

# .env

(...)
# Airflow
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}
_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}
(...)

Okay, at the first sight there is nothing wrong with referencing another variable, other environment variables like AIRFLOW__CORE__SQL_ALCHEMY_CONN use this syntax as well.

But this is not just reference to another variable, it's a variable substitution syntax which won't work inside .env file, according to official Docker compose documentation.

On the other hand, we can use it (as the instructors did) in the docker-compose.yaml file.

Let's do some modifications and test it's behavior again:

# .env

(...)
_TEST_1=buzz
# no variable _TEST_2
_AIRFLOW_TEST_1=${_TEST_1:foo}
_AIRFLOW_TEST_2=${_TEST_2:bar}
_AIRFLOW_TEST_3=${_TEST_2}
_AIRFLOW_TEST_4=${:_TEST_2}
nervuzz@DELL:~/repos/data-engineering-zoomcamp/WEEK_2/airflow$ docker compose config
(...)
    environment:
      (...)
      _AIRFLOW_TEST_1: buzz:foo}
      _AIRFLOW_TEST_2: :bar}
      _AIRFLOW_TEST_3: ""
      _AIRFLOW_TEST_4: $${:_TEST_2}
      _TEST_1: buzz
(...)

Solution: Set default values the same way as we did for postgres:

# .env

(...)
_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow
(...)

Side notes: Since we are using the no-frills version of docker-compose.yaml and custom entrypoint.sh, this environment variables are obsolete. Our webserver user (admin/admin) is created explicitly in the entrypoint:

# entrypoint.sh

(...)
airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow
(...)

Issue #2: ModuleNotFoundError: No module named 'airflow'

Let's bring some real example logs:

dtc-de-postgres-1   | 2022-02-02 21:25:47.569 UTC [1] LOG:  starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
dtc-de-postgres-1   | 2022-02-02 21:25:47.569 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
dtc-de-postgres-1   | 2022-02-02 21:25:47.570 UTC [1] LOG:  listening on IPv6 address "::", port 5432
dtc-de-postgres-1   | 2022-02-02 21:25:47.578 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dtc-de-postgres-1   | 2022-02-02 21:25:47.596 UTC [1] LOG:  database system is ready to accept connections
dtc-de-webserver-1  | Traceback (most recent call last):
dtc-de-webserver-1  |   File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1  |     from airflow.__main__ import main
dtc-de-webserver-1  | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1  | Traceback (most recent call last):
dtc-de-webserver-1  |   File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1  |     from airflow.__main__ import main
dtc-de-webserver-1  | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1  | Traceback (most recent call last):
dtc-de-webserver-1  |   File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1  |     from airflow.__main__ import main
dtc-de-webserver-1  | ModuleNotFoundError: No module named 'airflow'
dtc-de-webserver-1 exited with code 1

Observations: Python cannot find the airflow module but since we are using the Airflow's official image it must be there. Maybe we are not using the correct Python user install director (where PIP keeps installed packages)?

Workaround: Run image with the default user

# Dockerfile

(...)
RUN chmod +x scripts
USER $AIRFLOW_UID  # <-- Delete or comment out that line

Solution: Full solution TBD

Explanation: root is the default user. More details TBD

Issue #3: ModuleNotFoundError: No module named 'psycopg2'

We have just get rid of one missing module and BOOM, there is another one:

dtc-de-postgres-1   | 2022-02-03 15:48:21.137 UTC [1] LOG:  database system is ready to accept connectio
dtc-de-webserver-1  | Traceback (most recent call last):
dtc-de-webserver-1  |   File "/home/airflow/.local/bin/airflow", line 5, in <module>
dtc-de-webserver-1  |     from airflow.__main__ import main
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/airflow/__init__.py", line 46, in <module>
dtc-de-webserver-1  |     settings.initialize()
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/airflow/settings.py", line 495, in initialize
dtc-de-webserver-1  |     configure_orm()
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/airflow/settings.py", line 233, in configure_orm
dtc-de-webserver-1  |     engine = create_engine(SQL_ALCHEMY_CONN, connect_args=connect_args, **engine_args)
dtc-de-webserver-1  |   File "<string>", line 2, in create_engine
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
dtc-de-webserver-1  |     return fn(*args, **kwargs)
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 560, in create_engine
dtc-de-webserver-1  |     dbapi = dialect_cls.dbapi(**dbapi_args)
dtc-de-webserver-1  |   File "/root/.local/lib/python3.7/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 782, in dbapi
dtc-de-webserver-1  |     import psycopg2
dtc-de-webserver-1  | ModuleNotFoundError: No module named 'psycopg2'
dtc-de-webserver-1 exited with code 0

Solution: Add psycopg2 or psycopg2-binary to Python requirements and add two additional dependencies libpq-dev and gcc to apt-get install command in Dockerfile:

# requirements.txt

(...)
psycopg2
# Dockerfile

(...)
USER root
RUN apt-get update -qq && apt-get -y install libpq-dev gcc vim -qq
(...)

Issue #4: "pg_isready -U airflow": executable file not found in $PATH: unknown

Quite by accident I noticed such an error in the Docker engine logs:

WARN[2022-02-03T18:20:55.656604700+01:00] Health check for container 4ac563b736fb02c3f6265db442274bde97fda2c9c9209771cdb731536ff10a69 error: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: "pg_isready -U airflow": executable file not found in $PATH: unknown 

Indeed, there is such a command used as health check test for postgres service:

# docker-compose.yaml

version: '3'
services:
  postgres:
    image: postgres:13
      # ...
      healthcheck:
        test: ["CMD", "pg_isready", "-U", "airflow"]

BTW; pg_isready is a utility shipped with postgres which checks the status of the PostgreSQL server.

Docker compose file specification did not help too much, it says:

  • when test is an array, use NONE, CMD or CMD-SHELL
  • but if it's a string, use CMD-SHELL or skip this instruction (syntax)

After some googling I have found that CMD-SHELL instruction prepends test command with a "/bin/sh -c" in contrast to CMD instruction which will execute test command without a shell. Furthermore, the pg_isready is a 3rd party utility stored somewhere in the /bin/*/* folder but not directly in /bin (like most of popular commands we use every day) so if $PATH is not set then... 🥺

Solution: Everything should be clear now (at least for me), it's time to fix this error:

# docker-compose.yaml

test: ["CMD-SHELL", "pg_isready -U airflow"]
# or equivalent
test: pg_isready -U airflow

Bonus: Take control over RAM usage

Software-related issues can really be a pain but there is something else that know how to push your buttons - freezing operating system while you work 🔥🔥🔥

The docker compose build command is able to eat nearly 4.5GB !

compose_build_4g_warn

But that's not a big issue because with a reasonable Internet connection it can take 5 to 10 minutes (of course depending on the number of services) and and you will probably notice the peak of memory consumption close to the end of the operation.

Things are different with docker compose up command. By default the Docker engine will give service's a "permission" to consume as much memory as possible, with some predefined value being the upper bound which in my case is around 6GB.

image

This screen shoot was taken when there was no running DAGs, no database transactions, no webserver traffic. So Airflow needs at least 1.24GB just to run it's core components! I have not tested yet how much this values are going to change with let's say two DAGs running once per 5 minutes. I guess the webserver will not raise too much, however I'am pretty sure the scheduler will fluctuate a lot.

Memory limits in Docker Compose

Hopefully there is a built-in feature which enables us to limit RAM and CPU consumption on a per-service level.

# docker-compose.yaml

version: '3'
services:
    postgres:
        (...)
        deploy:
            resources:
                limits: 
                    memory: 300M

    scheduler:
        (...)
        deploy:
            resources:
                limits: 
                    memory: 1g

    webserver:
        (...)
        deploy:
            resources:
                limits: 
                    memory: 1300m

However, the compose up command must be changed for this configuration to work:

docker compose --compatibility up

# Options:
#    --compatibility      Run compose in backward compatibility mode

# See details: https://github.com/docker/compose/pull/5684

Now we can take a breath and tailor this limits to our needs!

image

Feedback

Feel free to leave a comment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment