Directed Acyclic Graphs (DAGs) in Apache Airflow®

In Airflow®, a DAG (Directed Acyclic Graph) is a fundamental concept that represents a collection of tasks connected to each other through dependencies. It's used to define and manage the workflow of a data pipeline or a series of data processing tasks.

Definition

Let's take a closer look at every notion making up this term:

  • Directed: The tasks in the DAG have dependencies and are connected in a specific direction. The direction indicates the order in which to execute the tasks.

  • Acyclic: This means a DAG doesn't contain any cycles or loops. In other words, you can't create a circular dependency between tasks, as that would lead to an infinite loop.

  • Graph: DAG is a graph structure, where the tasks are represented as nodes, and the dependencies between tasks are represented as edges.

A DAG consists of multiple tasks, where each task represents a unit of work or a computational step.

In Airflow®, you define a DAG using Python.

The code describes the workflow structure by specifying the tasks and their relationships. Each task is typically an instance of an operator class, describing a specific type of work:

  • Running a Python function
  • Executing an SQL query,
  • Interacting with an external system

The dependencies between tasks are defined using shift operators (<< and >>) or by using the set_upstream and set_downstream methods. These dependency declarations specify the order in which tasks should execute based on their relationships.

Airflow® provides a scheduler that continuously monitors DAGs and their dependencies. It automatically determines the next set of tasks to execute based on the state of the previous tasks and their dependencies. The scheduler ensures that tasks are executed in the correct order, taking into account any defined dependencies and the status of the tasks.

Usage example

Here's a basic example of a DAG:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

# Define default arguments and a DAG configuration
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(YYYY, MM, DD),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Instantiate the DAG
dag = DAG(
    'example_dag',
    default_args=default_args,
    description='An example DAG',
    schedule_interval=timedelta(days=1),
)

# Define tasks/operators
start_task = DummyOperator(task_id='start', dag=dag)

def print_hello():
    print("Hello, DoubleCloud Airflow®!")

hello_task = PythonOperator(
    task_id='hello',
    python_callable=print_hello,
    dag=dag,
)

end_task = DummyOperator(task_id='end', dag=dag)

# Set task dependencies
start_task >> hello_task >> end_task

This example contains three tasks: start, hello, and end.

  • The start and end tasks use DummyOperator . It's a placeholder and doesn't do any actual work.

  • The hello task applies PythonOperator to print a message.

  • The dependencies are specified using the >> operator. The hello task depends on the start task, and end depends on hello.

    This is the triggering sequence — hello doesn't until start completes, and end starts after hello is finished.

See also