1. Cluster Health Metrics

cluster health
Overall cluster status.
- Green = healthy
- Yellow = replica shards unassigned
- Red = risk of data loss
unassigned shards
Number of shards not assigned to any node.
- Normal value: 0
- Increases when disk space is low, a node goes down, or shard relocation is delayed
2. Resource Metrics



Total size of all file stores / Total available size to JVM in all file stores
- Total = physical disk capacity across all data paths
- Available = actual usable space as reported to the JVM (excludes filesystem reservations/quotas)
- Used to determine whether new shards can be allocated
⚠️ Problem points when Available decreases rapidly
- Caused by index growth, log bursts, or replica expansion
- Watermark thresholds (default values):
- 85% used → no new shard allocations
- 90% used → existing shards relocated away from the node
- 95% used → affected indices switched to read-only
Summary
- Looking only at Total can be misleading; Available is often much smaller.
- Total size = raw physical capacity.
- Total available to JVM = what Elasticsearch can actually use.
- Not related to JVM Heap; reflects only filesystem availability.
- Always monitor Available for real operational decisions.
jvm_heap_usage_percent
- JVM Heap utilization.
- Sustained 85%+ → Full GC frequency increases, higher risk of latency.
- 95%+ → OutOfMemoryError becomes likely.
node uptime
- Node runtime duration.
- Frequent restarts are an early sign of instability.
3. Performance Metrics


query latency
Search query response time.
- Rising latency in milliseconds signals degraded user experience.
service response_time
REST API response time.
- Persistent increases indicate backend resource bottlenecks.
4. Indexing & Connection Metrics


flush latency
Time required to complete a flush operation.
- Indicates disk I/O bottlenecks.
Indexing flow:
- Document → in-memory buffer → segment write (recorded in translog)
- Refresh → buffer promoted to segment, searchable
- Flush → translog safely persisted to disk and segment committed
Operational meaning:
- Higher flush latency → slower disk I/O, larger translogs,
and longer recovery times during failures
http connections opened
Number of open HTTP connections.
- Spikes may suggest client-side load surges or connection pooling issues.
✅ Operational Takeaways
- Cluster Health + unassigned shards → the first and most critical stability check
- Disk usage (Available) + JVM Heap → best indicators of capacity risks
- Query Latency + Response Time → primary bottleneck detectors
- Flush Latency + HTTP Connections → highlight data processing delays and client load pressure
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.
🛠 마지막 수정일: 2025.09.18
답글 남기기
댓글을 달기 위해서는 로그인해야합니다.