Meet Apache Airflow — #01

Adenilson Castro
4 min readJun 7, 2021

--

As a fresh Data Engineer, I’ve been introduced to a huge set of tools and technologies used in the day-by-day work. These tools form a set of skills required for data engineering, which comprises the whole data flow, from the data generation until the final analysis. One of these tools is the Apache Airflow, which I’ll discuss a little bit. The objective here is to discuss some topics and share the knowledge I’m acquiring over time and not to build (yet) a fully technical information source. Hence, here you’ll find only the most important topics and an Airflow overview, as I’m also still learning.

A pinwheel, the Apache Airflow symbol.

The Apache Airflow put in simple terms, is used to orchestrate tasks. That is, you can use it to schedule the execution of multiple activities, from a bash command execution up to a Python script. In fact, it is quite similar to the well-known Cron, but it offers a series of advantages. First of all, you can monitor the execution of the entire workflow following logs and with a rich web-based interface. Additionally, Airflow is scalable and quite simple to set up, as you can go from download to application running in a few minutes using Docker. It is also simple to set up a task, as it uses Python to define the entire workflow. Plus, it is an Open-Source tool with a huge community that expands the Airflow functionalities every day.

The first important concept you should be aware of is the DAG. A DAG or a Directed Acyclic Graph is where you define how a task should be executed inside the Airflow. It collects the tasks and all of their dependencies, as well as the order in which they should be executed and some definitions, like how many times Airflow should retry the execution in case of failure. As its own name defines, it is acyclic, which means that the task’s execution follows a predefined order and is not capable of returning to a previous task, like in a cycle. Difficult to visualize? Let’s try with an example:

A simple DAG example, containing four tasks: A, B, C and D.
Source: Adapted from Airflow Docs

The figure above illustrates a simple DAG. The execution starts at task A, moving from tasks B and C, which are both dependents of the successful execution of task A. Finally, task D is executed after the completion of tasks B and C, as task D is dependent on both results. Here you can see that the workflow follows an order of execution that respects the tasks’ dependencies and does not allow a cyclic execution, like going from task D to B or any other combination.

As I mentioned earlier, a DAG defines how a task should be executed in the workflow. Hence the task itself contains all the logic necessary to execute the actions it was built for and all its dependencies, being the DAG responsible for organizing the set of tasks and controlling its execution. However, there are some tasks that may have some logic that you can reuse in multiple tasks. For such cases, the Airflow has what is called Operators.

An Operator is the second important concept you should be aware of when using Airflow. It is a task template, which means it already has the logic necessary to execute a determined action and just requires a few parameters to execute it. The Airflow comes with some pre-installed operators, such as the BashOperator to execute bash commands, a PythonOperator that calls a Python code, and the EmailOperator, which sends an email, just to name a few. As it is an open-source tool, the Airflow community also makes available a huge set of other operators that enables the connection to databases such as MySQL or PostgreSQL and even using the Slack API. The possibilities are endless!

These are, from what I’ve learned so far, the most important concepts of Airflow. They may seem a little bit abstract at the first touch, but as you put your hands on code, you realize it is actually not complicated. That is the reason why in the next post, I’ll share a simple code that I built to train my skills and actually visualize the DAG execution. It is a simple ETL (Extract-Transform-Load) Python script, where my DAG collects the most recent news from the News API and loads it into an SQLite database.

That’s all for now!

--

--

Adenilson Castro
Adenilson Castro

Written by Adenilson Castro

Data Science Engineer @ Mercado Livre

No responses yet