Directed Acyclic Graph (DAG) in Airflow

In Apache Airflow, a DAG (Directed Acyclic Graph) is a fundamental concept that represents a collection of tasks connected to each other through dependencies. It's used to define and manage the workflow of a data pipeline or a series of data processing tasks.


Let's take a closer look at every notion making up this term:

  • Directed: The tasks in the DAG have dependencies and are connected in a specific direction. The direction indicates the order in which to execute the tasks.

  • Acyclic: This means a DAG doesn't contain any cycles or loops. In other words, you can't create a circular dependency between tasks, as that would lead to an infinite loop.

  • Graph: DAG is a graph structure, where the tasks are represented as nodes, and the dependencies between tasks are represented as edges.

A DAG consists of multiple tasks, where each task represents a unit of work or a computational step.

In Apache Airflow®, you define a DAG using Python code.

The code describes the workflow structure by specifying the tasks and their relationships. Each task is typically an instance of an operator class, describing a specific type of work:

  • running a Python function,
  • executing a SQL query,
  • interacting with an external system,
  • or others

The dependencies between tasks are defined using shift operators (<< and >>) or by using the set_upstream and set_downstream methods. These dependency declarations specify the order in which tasks should execute based on their relationships.

Apache Airflow® provides a scheduler that continuously monitors the DAGs and their dependencies. It automatically determines the next set of tasks to execute based on the state of the previous tasks and their dependencies. The scheduler ensures that tasks are executed in the correct order, taking into account any defined dependencies and the status of the tasks.

Usage example

Here's a basic example of a DAG:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

# Define default arguments and a DAG configuration
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(YYYY, MM, DD),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),

# Instantiate the DAG
dag = DAG(
    description='An example DAG',

# Define tasks/operators
start_task = DummyOperator(task_id='start', dag=dag)

def print_hello():
    print("Hello, DoubleCloud Apache Airflow®!")

hello_task = PythonOperator(

end_task = DummyOperator(task_id='end', dag=dag)

# Set task dependencies
start_task >> hello_task >> end_task

This is what happens in this example:

  • We set up three tasks: start, hello and end.

  • The start and end tasks use DummyOperator , it's there just as a placeholder and doesn't do any actual work.

  • The hello task applies the PythonOperator to print a message.

  • The dependencies are set up using the >> operator. As you can see, the hello task depends on the start task, and the end depends on the hello.

    This is the triggering sequence - don't run hello until the start is completed, and the end will commence as soon as the hello is done.

See also