An Examination of Monitoring Metrics: Part 4 Elasticsearch

1. Cluster Health Metrics

cluster health
Overall cluster status.

  • Green = healthy
  • Yellow = replica shards unassigned
  • Red = risk of data loss

unassigned shards
Number of shards not assigned to any node.

  • Normal value: 0
  • Increases when disk space is low, a node goes down, or shard relocation is delayed

2. Resource Metrics

Total size of all file stores / Total available size to JVM in all file stores

  • Total = physical disk capacity across all data paths
  • Available = actual usable space as reported to the JVM (excludes filesystem reservations/quotas)
  • Used to determine whether new shards can be allocated

⚠️ Problem points when Available decreases rapidly

  • Caused by index growth, log bursts, or replica expansion
  • Watermark thresholds (default values):
    • 85% used → no new shard allocations
    • 90% used → existing shards relocated away from the node
    • 95% used → affected indices switched to read-only

Summary

  • Looking only at Total can be misleading; Available is often much smaller.
  • Total size = raw physical capacity.
  • Total available to JVM = what Elasticsearch can actually use.
  • Not related to JVM Heap; reflects only filesystem availability.
  • Always monitor Available for real operational decisions.

jvm_heap_usage_percent

  • JVM Heap utilization.
  • Sustained 85%+ → Full GC frequency increases, higher risk of latency.
  • 95%+ → OutOfMemoryError becomes likely.

node uptime

  • Node runtime duration.
  • Frequent restarts are an early sign of instability.

3. Performance Metrics

query latency
Search query response time.

  • Rising latency in milliseconds signals degraded user experience.

service response_time
REST API response time.

  • Persistent increases indicate backend resource bottlenecks.

4. Indexing & Connection Metrics

flush latency
Time required to complete a flush operation.

  • Indicates disk I/O bottlenecks.

Indexing flow:

  • Document → in-memory buffer → segment write (recorded in translog)
  • Refresh → buffer promoted to segment, searchable
  • Flush → translog safely persisted to disk and segment committed

Operational meaning:

  • Higher flush latency → slower disk I/O, larger translogs,
    and longer recovery times during failures

http connections opened
Number of open HTTP connections.

  • Spikes may suggest client-side load surges or connection pooling issues.

✅ Operational Takeaways

  • Cluster Health + unassigned shards → the first and most critical stability check
  • Disk usage (Available) + JVM Heap → best indicators of capacity risks
  • Query Latency + Response Time → primary bottleneck detectors
  • Flush Latency + HTTP Connections → highlight data processing delays and client load pressure
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.09.18