Directed Acyclic Graphs (DAGs) in Apache Airflow®
In Airflow®, a DAG (Directed Acyclic Graph) is a fundamental concept that represents a collection of tasks connected to each other through dependencies. It's used to define and manage the workflow of a data pipeline or a series of data processing tasks.
Definition
Let's take a closer look at every notion making up this term:
-
Directed: The tasks in the DAG have dependencies and are connected in a specific direction. The direction indicates the order in which to execute the tasks.
-
Acyclic: This means a DAG doesn't contain any cycles or loops. In other words, you can't create a circular dependency between tasks, as that would lead to an infinite loop.
-
Graph: DAG is a graph structure, where the tasks are represented as nodes, and the dependencies between tasks are represented as edges.
A DAG consists of multiple tasks, where each task represents a unit of work or a computational step.
In Airflow®, you define a DAG using Python.
The code describes the workflow structure by specifying the tasks and their relationships.
Each task is typically an instance of an operator
class, describing a specific type of work:
- Running a Python function
- Executing an SQL query,
- Interacting with an external system
The dependencies between tasks are defined using shift operators (<<
and >>
)
or by using the set_upstream
and set_downstream
methods.
These dependency declarations specify the order in which tasks should execute based on their relationships.
Airflow® provides a scheduler that continuously monitors DAGs and their dependencies. It automatically determines the next set of tasks to execute based on the state of the previous tasks and their dependencies. The scheduler ensures that tasks are executed in the correct order, taking into account any defined dependencies and the status of the tasks.
Usage example
Here's a basic example of a DAG:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
# Define default arguments and a DAG configuration
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(YYYY, MM, DD),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Instantiate the DAG
dag = DAG(
'example_dag',
default_args=default_args,
description='An example DAG',
schedule_interval=timedelta(days=1),
)
# Define tasks/operators
start_task = DummyOperator(task_id='start', dag=dag)
def print_hello():
print("Hello, DoubleCloud Airflow®!")
hello_task = PythonOperator(
task_id='hello',
python_callable=print_hello,
dag=dag,
)
end_task = DummyOperator(task_id='end', dag=dag)
# Set task dependencies
start_task >> hello_task >> end_task
This example contains three tasks: start
, hello
, and end
.
-
The
start
andend
tasks use DummyOperator -
The
hello
task applies PythonOperator -
The dependencies are specified using the
>>
operator. Thehello
task depends on thestart
task, andend
depends onhello
.This is the triggering sequence —
hello
doesn't untilstart
completes, andend
starts afterhello
is finished.