How to Monitor Airflow with Zabbix

Background

The team I manage is divided into three parts: Cloud Engineers, DevOps, and Data Engineers.
One day, a Data Engineer asked for the ability to receive alerts when Airflow DAGs fail.

While Airflow’s Web UI can show the status of DAG runs, it doesn’t provide a direct way to trigger alerts.
I searched for references, but most of what I found were Prometheus-based dashboards. They were focused on visualization, not on alerting for failures, which was the actual requirement.

So I decided to build it myself, creating a setup with Discovery → Item Prototypes → Trigger Prototypes in Zabbix.
This post shares how I implemented Airflow DAG monitoring with Zabbix in an on-premises Kubernetes environment.
Note that the approach works in any setup—on-prem, Kubernetes, or cloud—as long as Zabbix can access the Airflow REST API.

(General Zabbix agent installation and configuration are not covered here.)


Why DAG Alerts Alone Aren’t Enough

Catching DAG failures is important, but sometimes the root cause is deeper: Airflow itself may not be healthy.

  • If the Scheduler stops, no DAG will run.
  • If the Metadata Database connection fails, the entire system halts.
  • If the Triggerer dies, event-based DAGs won’t fire.

In short:

  • DAG failure alerts = symptoms
  • Airflow health alerts = root cause

You need both to quickly detect and resolve incidents.


Concept

  1. Periodically fetch DAG IDs from the Airflow REST API.
  2. Query the latest DAG run state for each DAG.
  3. If the state is failed, trigger an alert in Zabbix.
  4. Monitor Airflow health endpoints for Scheduler, Metadata DB, and Triggerer processes.

1. Work on the Zabbix Server (Linux)

1-1. DAG ID Collection Script

Create a script on the Zabbix server to pull DAG IDs from the Airflow REST API.

daginfo.sh

#!/bin/bash
API_URL="http://192.102.200.97:8080/api/v1/dags"
USERNAME="testuser"
PASSWORD="testpasswd"
LIMIT=100
OFFSET=0

> /tmp/airflow_dag_ids.txt

while true; do
    curl -s -u "$USERNAME:$PASSWORD" \
        -o /tmp/airflow_response.json \
        -w "%{http_code}" "${API_URL}?offset=${OFFSET}" > /tmp/http_status.txt
    HTTP_STATUS=$(cat /tmp/http_status.txt)
    RESPONSE_CONTENT=$(cat /tmp/airflow_response.json)

    if [ "$HTTP_STATUS" -ne 200 ]; then
        echo "Error: HTTP status $HTTP_STATUS" >&2
        echo "Response content: $RESPONSE_CONTENT" >&2
        break
    fi

    # Extract dag_id and append to file
    echo "$RESPONSE_CONTENT" | jq -r '.dags[].dag_id' >> /tmp/airflow_dag_ids.txt

    OFFSET=$((OFFSET + LIMIT))

    if [ "$(echo "$RESPONSE_CONTENT" | jq '.dags | length')" -eq 0 ]; then
        break
    fi
done

echo "All DAG IDs saved to /tmp/airflow_dag_ids.txt"

Schedule it with cron to refresh every 3 hours:

0 */3 * * * /home/example/daginfo.sh

2. Work in the Zabbix UI

2-1. Master Item

  • Name: DAG ID Master Item
  • Key: vfs.file.contents[/tmp/airflow_dag_ids.txt]
  • Type: Zabbix agent
  • Data type: Text
  • Interface: 127.0.0.1:10050
  • Update interval: 2h

2-2. LLD (Low-Level Discovery) Rule

Use the Master Item to dynamically discover DAG IDs.

  • Name: Airflow DAGID Discovery
  • Key: airflow.discovery.dagsid
  • Master item: DAG ID Master Item

Preprocessing (JavaScript):

try {
    var lines = value.split(/\r?\n/);
    var data = [];
    for (var i = 0; i < lines.length; i++) {
        var dag_id = lines[i].trim();
        if (dag_id) {
            data.push({ "{#DAG_ID}": dag_id });
        }
    }
    return JSON.stringify({ "data": data });
} catch (error) {
    return JSON.stringify({ "data": [] });
}

2-3. Item Prototype (DAG State)

Check the most recent run state for each DAG.

  • Name: DAG {#DAG_ID} Runs State
  • Key: dagruns.status[{#DAG_ID}]
  • Type: HTTP agent
  • URL: http://192.102.200.97:8080/api/v1/dags/{#DAG_ID}/dagRuns?order_by=-execution_date
  • Authentication: Basic Auth (Airflow account)
  • Interval: 5m

Preprocessing (JavaScript):

var parsedData = JSON.parse(value);
if (parsedData.dag_runs && parsedData.dag_runs.length > 0) {
    var latestRun = parsedData.dag_runs[0];
    var state = latestRun.state;
    if (state === "failed") {
        return 1;
    } else {
        return 0;
    }
} else {
    return 0;
}
  • Return values:
    • Failure → 1
    • Success/Running → 0

2-4. Trigger Prototype

  • Name: DAG {#DAG_ID} is not healthy
  • Expression: last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=1
  • Recovery expression: last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=0
  • Severity: Warning (adjust as needed)

2-5. Airflow Health Checks
: The guide for triggers is omitted.

(1) Scheduler Health

  • Name: Airflow Scheduler Health
  • Key: airflow.health
  • Type: HTTP agent
  • URL: http://192.102.200.97:8080/health
  • Preprocessing (JavaScript):
var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.scheduler && parsedData.scheduler.status === "healthy") {
    return 0;
} else {
    return 1;
}

(2) Metadata DB Health

  • Name: Airflow Metadata Health
  • Key: airflow.metadata.health
  • Type: HTTP agent
  • URL: http://192.102.200.97:8080/health
  • Preprocessing (JavaScript):
var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.metadatabase && parsedData.metadatabase.status === "healthy") {
    return 0;
} else {
    return 1;
}

(3) Triggerer Health

  • Name: Airflow Triggerer Health
  • Key: airflow.trigger.health
  • Type: HTTP agent
  • URL: http://192.102.200.97:8080/health
  • Preprocessing (JavaScript):
var parsedData;
try {
    if (!value) throw "Empty response";
    parsedData = JSON.parse(value);
} catch (e) {
    return 1;
}
if (parsedData.triggerer && parsedData.triggerer.status === "healthy") {
    return 0;
} else {
    return 1;
}

Wrap-Up

With this setup, Zabbix can:

  • Automatically discover DAG IDs and alert on failed runs.
  • Continuously monitor the health of Scheduler, Metadata DB, and Triggerer.

This way you cover both sides:

  • DAG failure alerts (symptoms)
  • Airflow health alerts (root causes)

While I built this in an on-premises Kubernetes environment, the method works anywhere.
If Zabbix can reach the Airflow REST API, you can apply the same pattern in VMs, cloud, or managed services.

In Airflow operations, what you really need is not another dashboard but immediate alerts.
This approach delivers exactly that.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.