An Examination of Monitoring Metrics: Part 2 Kafka

In the previous article, we looked at MySQL metrics. This time, we turn to Kafka.
In production environments, Kafka has grown beyond being just a simple message queue to become a critical data streaming platform.
Therefore, closely monitoring the state of Kafka brokers and clusters is essential for preventing incidents and ensuring stable performance.

In this article, we will focus on interpreting and understanding the key Kafka metrics available in the Grafana dashboard (based on Zabbix data).


1. Offline Partitions Count

  • Meaning: The number of partitions in the cluster that have lost their leader and are therefore inaccessible.
  • Normal: 0.
  • Abnormal: Occurs during broker failure, network isolation, or disk I/O errors.
    👉 Operational Point: Even a single offline partition can imply potential data loss. Investigate immediately.

2. Under Replicated Partitions (URP)

  • Meaning: Partitions where follower replicas are not synchronized with the leader.
  • Normal: 0.
  • Abnormal: Appears under broker overload, network delay, or ISR shrink.
    👉 Operational Point: URP is one of the most critical alert indicators in Kafka operations. Even short spikes are a warning sign. If sustained, cluster scaling or incident handling is required.

3. GC Pause

  • Meaning: The time the broker JVM application is stopped while Garbage Collection (GC) is executed.
  • Young GC (Minor GC): Short and frequent, usually harmless.
  • Old GC (Full GC): Rare, but if it lasts hundreds of milliseconds to seconds, the broker essentially stops.
    👉 Operational Point: High average pause time leads to message processing delays. Monitor Heap size and GC tuning together.

⚠️ Note: The GC Pause metric is not included by default in the Zabbix templates (Apache Kafka by JMX / Generic Java JMX). The following item was created manually using existing JMX items:

change(//jmx["java.lang:name=G1 Young Generation,type=GarbageCollector","CollectionTime"]) 
/ 
(change(//jmx["java.lang:name=G1 Young Generation,type=GarbageCollector","CollectionCount"]) + 0.000001)

4. Heap Memory Usage

  • Meaning: The usage status of the Kafka broker JVM Heap.
  • Normal: Stable between 60–80%.
  • Abnormal: Spikes above 90% may cause OutOfMemoryError and cascading Full GCs.
    👉 Operational Point: Heap usage should always be reviewed alongside GC Pause. High usage + increasing GC Pause means tuning is required.

⚠️ Note: JVM Heap usage is also not part of the default template. A new item was created using existing JMX items with this formula:

100 * last(//jmx["java.lang:type=Memory","HeapMemoryUsage.used"]) 
/ 
last(//jmx["java.lang:type=Memory","HeapMemoryUsage.max"])

5. Message Throughput

  • Meaning: Number of messages processed per second (Messages In / Out per second).
  • Usage: Identifying traffic patterns, spotting load peaks, and tracing bottlenecks during incidents.
  • Abnormal: Sudden drops in throughput for specific topics may indicate producer or consumer issues.
    👉 Operational Point: Throughput should not be interpreted in isolation—always cross-check with Consumer Lag and Broker Latency.

6. Kafka Cluster I/O Traffic

  • Meaning: Broker-level network traffic (in MB/s).
  • Usage: Assessing topic/partition load balancing.
  • Abnormal: If traffic is concentrated on specific brokers, partition leader imbalance is likely.
    👉 Operational Point: Larger producer message sizes or batch sizes can also increase network load. Broker balancing should be reviewed.

7. ISR (In-Sync Replicas) Changes

  • Definition: The set of follower replicas in each partition that are fully synchronized with the leader.
  • A follower is included in the ISR once it has caught up with all messages written to the leader.
  • Shrink = ISR count decreases (follower falls behind).
  • Expand = ISR count increases (follower catches up and rejoins).

Normal State

  • Replica count = ISR count
  • All replicas are synchronized with the leader → 100% data durability.

ISR Shrink

  • ISR count < Replica count
  • Follower failed to keep up (delay, network issues, broker overload, etc.).
  • Correlated Metric: URP increases.

ISR Expand

  • ISR count grows back toward Replica count
  • Followers rejoin after catching up with the leader.
  • Indicates recovery.

👉 Operational Point: ISR changes directly impact URP.

  • ISR shrink → URP increases → data durability risk.
  • Always set alerts on ISR events.

Summary

The key Kafka metrics to monitor in the Grafana dashboard (Zabbix-based) are:

  • Offline Partitions Count – Data accessibility
  • Under Replicated Partitions (URP) – Data reliability
  • GC Pause / Heap Memory Usage – JVM stability
  • Message Throughput – Processing performance
  • Cluster I/O Traffic – Network bottlenecks
  • ISR Changes – Data durability

Kafka monitoring is not about looking at single metrics in isolation. It is about reading correlations:
For example, Heap usage increase → GC Pause increase → Message throughput drop → URP increase.
This sequence reveals risks long before they escalate into incidents.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.09.18