How to Monitor Airflow with Zabbix

Background

The team I manage is divided into three parts: Cloud Engineers, DevOps, and Data Engineers.
One day, a Data Engineer asked for the ability to receive alerts when Airflow DAGs fail.

While Airflow’s Web UI can show the status of DAG runs, it doesn’t provide a direct way to trigger alerts.
I searched for references, but most of what I found were Prometheus-based dashboards. They were focused on visualization, not on alerting for failures, which was the actual requirement.

So I decided to build it myself, creating a setup with Discovery → Item Prototypes → Trigger Prototypes in Zabbix.
This post shares how I implemented Airflow DAG monitoring with Zabbix in an on-premises Kubernetes environment.
Note that the approach works in any setup—on-prem, Kubernetes, or cloud—as long as Zabbix can access the Airflow REST API.

(General Zabbix agent installation and configuration are not covered here.)

Why DAG Alerts Alone Aren’t Enough

Catching DAG failures is important, but sometimes the root cause is deeper: Airflow itself may not be healthy.

If the Scheduler stops, no DAG will run.
If the Metadata Database connection fails, the entire system halts.
If the Triggerer dies, event-based DAGs won’t fire.

In short:

DAG failure alerts = symptoms
Airflow health alerts = root cause

You need both to quickly detect and resolve incidents.

Concept

Periodically fetch DAG IDs from the Airflow REST API.
Query the latest DAG run state for each DAG.
If the state is failed, trigger an alert in Zabbix.
Monitor Airflow health endpoints for Scheduler, Metadata DB, and Triggerer processes.

1. Work on the Zabbix Server (Linux)

1-1. DAG ID Collection Script

Create a script on the Zabbix server to pull DAG IDs from the Airflow REST API.

daginfo.sh

#!/bin/bash
API_URL="http://192.102.200.97:8080/api/v1/dags"
USERNAME="testuser"
PASSWORD="testpasswd"
LIMIT=100
OFFSET=0

> /tmp/airflow_dag_ids.txt

while true; do
    curl -s -u "$USERNAME:$PASSWORD" \
        -o /tmp/airflow_response.json \
        -w "%{http_code}" "${API_URL}?offset=${OFFSET}" > /tmp/http_status.txt
    HTTP_STATUS=$(cat /tmp/http_status.txt)
    RESPONSE_CONTENT=$(cat /tmp/airflow_response.json)

    if [ "$HTTP_STATUS" -ne 200 ]; then
        echo "Error: HTTP status $HTTP_STATUS" >&2
        echo "Response content: $RESPONSE_CONTENT" >&2
        break
    fi

    # Extract dag_id and append to file
    echo "$RESPONSE_CONTENT" | jq -r '.dags[].dag_id' >> /tmp/airflow_dag_ids.txt

    OFFSET=$((OFFSET + LIMIT))

    if [ "$(echo "$RESPONSE_CONTENT" | jq '.dags | length')" -eq 0 ]; then
        break
    fi
done

echo "All DAG IDs saved to /tmp/airflow_dag_ids.txt"

Schedule it with cron to refresh every 3 hours:

0 */3 * * * /home/example/daginfo.sh

2. Work in the Zabbix UI

2-1. Master Item

Name: DAG ID Master Item
Key: vfs.file.contents[/tmp/airflow_dag_ids.txt]
Type: Zabbix agent
Data type: Text
Interface: 127.0.0.1:10050
Update interval: 2h

2-2. LLD (Low-Level Discovery) Rule

Use the Master Item to dynamically discover DAG IDs.

Name: Airflow DAGID Discovery
Key: airflow.discovery.dagsid
Master item: DAG ID Master Item

Preprocessing (JavaScript):

try {
    var lines = value.split(/\r?\n/);
    var data = [];
    for (var i = 0; i < lines.length; i++) {
        var dag_id = lines[i].trim();
        if (dag_id) {
            data.push({ "{#DAG_ID}": dag_id });
        }
    }
    return JSON.stringify({ "data": data });
} catch (error) {
    return JSON.stringify({ "data": [] });
}

2-3. Item Prototype (DAG State)

Check the most recent run state for each DAG.

Name: DAG {#DAG_ID} Runs State
Key: dagruns.status[{#DAG_ID}]
Type: HTTP agent
URL: http://192.102.200.97:8080/api/v1/dags/{#DAG_ID}/dagRuns?order_by=-execution_date
Authentication: Basic Auth (Airflow account)
Interval: 5m

Preprocessing (JavaScript):

var parsedData = JSON.parse(value);
if (parsedData.dag_runs && parsedData.dag_runs.length > 0) {
    var latestRun = parsedData.dag_runs[0];
    var state = latestRun.state;
    if (state === "failed") {
        return 1;
    } else {
        return 0;
    }
} else {
    return 0;
}

Return values:
- Failure → 1
- Success/Running → 0

2-4. Trigger Prototype

Name: DAG {#DAG_ID} is not healthy
Expression: last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=1
Recovery expression: last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=0
Severity: Warning (adjust as needed)

2-5. Airflow Health Checks
: The guide for triggers is omitted.

(1) Scheduler Health

Name: Airflow Scheduler Health
Key: airflow.health
Type: HTTP agent
URL: http://192.102.200.97:8080/health
Preprocessing (JavaScript):

var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.scheduler && parsedData.scheduler.status === "healthy") {
    return 0;
} else {
    return 1;
}

(2) Metadata DB Health

Name: Airflow Metadata Health
Key: airflow.metadata.health
Type: HTTP agent
URL: http://192.102.200.97:8080/health
Preprocessing (JavaScript):

var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.metadatabase && parsedData.metadatabase.status === "healthy") {
    return 0;
} else {
    return 1;
}

(3) Triggerer Health

Name: Airflow Triggerer Health
Key: airflow.trigger.health
Type: HTTP agent
URL: http://192.102.200.97:8080/health
Preprocessing (JavaScript):

var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.triggerer && parsedData.triggerer.status === "healthy") {
    return 0;
} else {
    return 1;
}

Wrap-Up

With this setup, Zabbix can:

Automatically discover DAG IDs and alert on failed runs.
Continuously monitor the health of Scheduler, Metadata DB, and Triggerer.

This way you cover both sides:

DAG failure alerts (symptoms)
Airflow health alerts (root causes)

While I built this in an on-premises Kubernetes environment, the method works anywhere.
If Zabbix can reach the Airflow REST API, you can apply the same pattern in VMs, cloud, or managed services.

In Airflow operations, what you really need is not another dashboard but immediate alerts.
This approach delivers exactly that.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

Background

Why DAG Alerts Alone Aren’t Enough

Concept

1. Work on the Zabbix Server (Linux)

1-1. DAG ID Collection Script

2. Work in the Zabbix UI

2-1. Master Item

2-2. LLD (Low-Level Discovery) Rule

2-3. Item Prototype (DAG State)

2-4. Trigger Prototype

2-5. Airflow Health Checks
: The guide for triggers is omitted.

(1) Scheduler Health

(2) Metadata DB Health

(3) Triggerer Health

Wrap-Up

코멘트

답글 남기기 응답 취소

How to Monitor Airflow with Zabbix

Background

Why DAG Alerts Alone Aren’t Enough

Concept

1. Work on the Zabbix Server (Linux)

1-1. DAG ID Collection Script

2. Work in the Zabbix UI

2-1. Master Item

2-2. LLD (Low-Level Discovery) Rule

2-3. Item Prototype (DAG State)

2-4. Trigger Prototype

2-5. Airflow Health Checks: The guide for triggers is omitted.

(1) Scheduler Health

(2) Metadata DB Health

(3) Triggerer Health

Wrap-Up

코멘트

답글 남기기 응답 취소

2-5. Airflow Health Checks
: The guide for triggers is omitted.