Kubernetes Cluster Architecture for Machine Learning (On-Prem)

50-Node ML Kubernetes Cluster Design

This document explains how to design a Kubernetes cluster optimized for machine learning workloads in an on-premises environment combining both CPU and GPU nodes.
The core goals are efficient GPU resource utilization and high-performance I/O through a tiered storage architecture.


1. Overview

The cluster consists of 50 nodes (15 GPU + 35 CPU).
GPU nodes handle model training, while CPU nodes are used for data preprocessing, ingestion, and model serving.

kubernetes cluster

Architecture Summary:

  • GPU Nodes (1–15): Tier 1 — Local NVMe disks
  • CPU Nodes (1–35): Tier 2 — Ceph-based local storage
  • NAS (MinIO): Tier 3 — Backup and long-term storage
  • Networking: Hybrid InfiniBand + Ethernet configuration

2. Network Design (Ethernet + InfiniBand)

2.1 Architecture

Ethernet Switch

  • Common across all nodes.
  • Handles Kubernetes internal traffic such as control plane communication, API calls, and Pod networking.

InfiniBand Switch

  • Used exclusively for GPU nodes.
  • Provides high-performance RDMA links for distributed training (e.g., AllReduce, NCCL).

2.2 Implementation Policy

  • Kubernetes manages only the Ethernet network (via CNI such as Cilium or Calico).
  • InfiniBand is configured as an external network and accessed directly within Pods.
  • Only GPU nodes have InfiniBand NICs enabled, creating a dedicated high-speed path for GPU-to-GPU communication.

This design eliminates unnecessary costs on CPU nodes while removing network bottlenecks in GPU clusters.


3. Tiered Storage Architecture

Machine learning workloads demand multiple types of storage for datasets, checkpoints, and model artifacts.
To handle this efficiently, the storage is divided into three tiers.

TierConfigurationPurposeNotes
Tier 1Local NVMe disks on GPU nodesScratch / Checkpoint onlyUltra-fast temporary storage, Local PV
Tier 2Ceph storage on CPU nodesDataset / Checkpoint storageUses Ceph RBD or CephFS
Tier 3NAS (MinIO backend)Long-term retention / BackupObject storage, S3-compatible

3.1 Tier 1 — Local NVMe

  • Each GPU node uses a local NVMe SSD as a Local Persistent Volume.
  • Stores temporary checkpoints and intermediate files during training.
  • Managed with a Delete reclaim policy to prioritize speed and simplicity.

3.2 Tier 2 — Ceph Cluster

  • CPU node disks are aggregated into a Ceph cluster.
  • Stores shared datasets and intermediate training results.
  • GPU nodes can mount Ceph volumes for shared access across workloads.

3.3 Tier 3 — MinIO NAS

  • Implements a MinIO-based object storage layer for long-term backups.
  • Serves as a backup target for Ceph data.
  • Synchronization is automated with tools like mc mirror or rclone.

4. Scheduling Policy

Efficient GPU scheduling is crucial for ML workloads.
The cluster adopts a Volcano + Kueue combination for fairness and performance.

4.1 Volcano (Gang Scheduling)

  • A job starts only when all required GPU slots are available.
  • Prevents partial allocation and avoids deadlocks in multi-GPU workloads.

4.2 Kueue (Fairness / Borrowing / Preemption)

  • Divides GPU resources into queues by project or department.
  • Idle GPUs can be temporarily borrowed by other teams.
  • Urgent jobs are prioritized through preemption.

In short, Volcano ensures job concurrency, while Kueue enforces fairness and responsiveness across multiple teams.


5. Monitoring and Observability

A Zabbix-based monitoring system is integrated for unified observability across nodes and services.

Monitored metrics include:

  • GPU and CPU utilization
  • Ceph I/O throughput
  • InfiniBand bandwidth
  • Kubernetes Pod and Node status

Zabbix Agent2 can be extended with Prometheus exporters for detailed GPU and storage metrics.


6. Design Summary

CategoryConfiguration Summary
NetworkEthernet (K8s control) + InfiniBand (GPU only)
GPU SchedulingVolcano + Kueue
Storage3-Tier: NVMe / Ceph / MinIO
MonitoringZabbix Unified Monitoring
Primary GoalsMaximize GPU efficiency, High-performance I/O, Fair resource sharing

7. Scalability and Operations

  • Ensure InfiniBand switch port capacity before expanding GPU nodes.
  • Deploy Ceph MON and OSD daemons across at least three CPU nodes for HA.
  • Configure MinIO in a 4-node distributed setup for 2-node fault tolerance.
  • Integrate Zabbix alerts with Slack or Mattermost for real-time notifications.
  • Supported workloads include PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy, and Ray.

Conclusion

This cluster is not just a standard Kubernetes deployment —
it’s an ML-optimized on-prem platform designed for:

  • Efficient GPU resource utilization
  • High-performance, tiered storage
  • Fair and responsive scheduling

By combining InfiniBand for GPU nodes, Ceph for shared data, and Zabbix for unified monitoring,
the system achieves both performance and manageability at scale.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.10.13