Background
The team I manage is divided into three parts: Cloud Engineers, DevOps, and Data Engineers.
One day, a Data Engineer asked for the ability to receive alerts when Airflow DAGs fail.
While Airflow’s Web UI can show the status of DAG runs, it doesn’t provide a direct way to trigger alerts.
I searched for references, but most of what I found were Prometheus-based dashboards. They were focused on visualization, not on alerting for failures, which was the actual requirement.
So I decided to build it myself, creating a setup with Discovery → Item Prototypes → Trigger Prototypes in Zabbix.
This post shares how I implemented Airflow DAG monitoring with Zabbix in an on-premises Kubernetes environment.
Note that the approach works in any setup—on-prem, Kubernetes, or cloud—as long as Zabbix can access the Airflow REST API.
(General Zabbix agent installation and configuration are not covered here.)
Why DAG Alerts Alone Aren’t Enough
Catching DAG failures is important, but sometimes the root cause is deeper: Airflow itself may not be healthy.
- If the Scheduler stops, no DAG will run.
- If the Metadata Database connection fails, the entire system halts.
- If the Triggerer dies, event-based DAGs won’t fire.
In short:
- DAG failure alerts = symptoms
- Airflow health alerts = root cause
You need both to quickly detect and resolve incidents.
Concept
- Periodically fetch DAG IDs from the Airflow REST API.
- Query the latest DAG run state for each DAG.
- If the state is
failed
, trigger an alert in Zabbix. - Monitor Airflow health endpoints for Scheduler, Metadata DB, and Triggerer processes.
1. Work on the Zabbix Server (Linux)
1-1. DAG ID Collection Script
Create a script on the Zabbix server to pull DAG IDs from the Airflow REST API.
daginfo.sh
#!/bin/bash
API_URL="http://192.102.200.97:8080/api/v1/dags"
USERNAME="testuser"
PASSWORD="testpasswd"
LIMIT=100
OFFSET=0
> /tmp/airflow_dag_ids.txt
while true; do
curl -s -u "$USERNAME:$PASSWORD" \
-o /tmp/airflow_response.json \
-w "%{http_code}" "${API_URL}?offset=${OFFSET}" > /tmp/http_status.txt
HTTP_STATUS=$(cat /tmp/http_status.txt)
RESPONSE_CONTENT=$(cat /tmp/airflow_response.json)
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "Error: HTTP status $HTTP_STATUS" >&2
echo "Response content: $RESPONSE_CONTENT" >&2
break
fi
# Extract dag_id and append to file
echo "$RESPONSE_CONTENT" | jq -r '.dags[].dag_id' >> /tmp/airflow_dag_ids.txt
OFFSET=$((OFFSET + LIMIT))
if [ "$(echo "$RESPONSE_CONTENT" | jq '.dags | length')" -eq 0 ]; then
break
fi
done
echo "All DAG IDs saved to /tmp/airflow_dag_ids.txt"
Schedule it with cron to refresh every 3 hours:
0 */3 * * * /home/example/daginfo.sh
2. Work in the Zabbix UI
2-1. Master Item
- Name: DAG ID Master Item
- Key:
vfs.file.contents[/tmp/airflow_dag_ids.txt]
- Type: Zabbix agent
- Data type: Text
- Interface:
127.0.0.1:10050
- Update interval: 2h

2-2. LLD (Low-Level Discovery) Rule
Use the Master Item to dynamically discover DAG IDs.
- Name: Airflow DAGID Discovery
- Key:
airflow.discovery.dagsid
- Master item: DAG ID Master Item
Preprocessing (JavaScript):
try {
var lines = value.split(/\r?\n/);
var data = [];
for (var i = 0; i < lines.length; i++) {
var dag_id = lines[i].trim();
if (dag_id) {
data.push({ "{#DAG_ID}": dag_id });
}
}
return JSON.stringify({ "data": data });
} catch (error) {
return JSON.stringify({ "data": [] });
}


2-3. Item Prototype (DAG State)
Check the most recent run state for each DAG.
- Name: DAG {#DAG_ID} Runs State
- Key:
dagruns.status[{#DAG_ID}]
- Type: HTTP agent
- URL:
http://192.102.200.97:8080/api/v1/dags/{#DAG_ID}/dagRuns?order_by=-execution_date
- Authentication: Basic Auth (Airflow account)
- Interval: 5m
Preprocessing (JavaScript):
var parsedData = JSON.parse(value);
if (parsedData.dag_runs && parsedData.dag_runs.length > 0) {
var latestRun = parsedData.dag_runs[0];
var state = latestRun.state;
if (state === "failed") {
return 1;
} else {
return 0;
}
} else {
return 0;
}
- Return values:
- Failure →
1
- Success/Running →
0
- Failure →



2-4. Trigger Prototype
- Name: DAG {#DAG_ID} is not healthy
- Expression:
last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=1
- Recovery expression:
last(/Airflow - example Product/dagruns.status[{#DAG_ID}])=0
- Severity: Warning (adjust as needed)

2-5. Airflow Health Checks
: The guide for triggers is omitted.
(1) Scheduler Health
- Name: Airflow Scheduler Health
- Key:
airflow.health
- Type: HTTP agent
- URL:
http://192.102.200.97:8080/health
- Preprocessing (JavaScript):
var parsedData;
try {
if (!value) throw "Empty response";
parsedData = JSON.parse(value);
} catch (e) {
return 1;
}
if (parsedData.scheduler && parsedData.scheduler.status === "healthy") {
return 0;
} else {
return 1;
}



(2) Metadata DB Health
- Name: Airflow Metadata Health
- Key:
airflow.metadata.health
- Type: HTTP agent
- URL:
http://192.102.200.97:8080/health
- Preprocessing (JavaScript):
var parsedData;
try {
if (!value) throw "Empty response";
parsedData = JSON.parse(value);
} catch (e) {
return 1;
}
if (parsedData.metadatabase && parsedData.metadatabase.status === "healthy") {
return 0;
} else {
return 1;
}



(3) Triggerer Health
- Name: Airflow Triggerer Health
- Key:
airflow.trigger.health
- Type: HTTP agent
- URL:
http://192.102.200.97:8080/health
- Preprocessing (JavaScript):
var parsedData;
try {
if (!value) throw "Empty response";
parsedData = JSON.parse(value);
} catch (e) {
return 1;
}
if (parsedData.triggerer && parsedData.triggerer.status === "healthy") {
return 0;
} else {
return 1;
}



Wrap-Up
With this setup, Zabbix can:
- Automatically discover DAG IDs and alert on failed runs.
- Continuously monitor the health of Scheduler, Metadata DB, and Triggerer.
This way you cover both sides:
- DAG failure alerts (symptoms)
- Airflow health alerts (root causes)
While I built this in an on-premises Kubernetes environment, the method works anywhere.
If Zabbix can reach the Airflow REST API, you can apply the same pattern in VMs, cloud, or managed services.
In Airflow operations, what you really need is not another dashboard but immediate alerts.
This approach delivers exactly that.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.
답글 남기기
댓글을 달기 위해서는 로그인해야합니다.