Kubernetes Cluster Architecture for Machine Learning (On-Prem)

50-Node ML Kubernetes Cluster Design

This document explains how to design a Kubernetes cluster optimized for machine learning workloads in an on-premises environment combining both CPU and GPU nodes.
The core goals are efficient GPU resource utilization and high-performance I/O through a tiered storage architecture.

1. Overview

The cluster consists of 50 nodes (15 GPU + 35 CPU).
GPU nodes handle model training, while CPU nodes are used for data preprocessing, ingestion, and model serving.

Architecture Summary:

GPU Nodes (1–15): Tier 1 — Local NVMe disks
CPU Nodes (1–35): Tier 2 — Ceph-based local storage
NAS (MinIO): Tier 3 — Backup and long-term storage
Networking: Hybrid InfiniBand + Ethernet configuration

2. Network Design (Ethernet + InfiniBand)

2.1 Architecture

Ethernet Switch

Common across all nodes.
Handles Kubernetes internal traffic such as control plane communication, API calls, and Pod networking.

InfiniBand Switch

Used exclusively for GPU nodes.
Provides high-performance RDMA links for distributed training (e.g., AllReduce, NCCL).

2.2 Implementation Policy

Kubernetes manages only the Ethernet network (via CNI such as Cilium or Calico).
InfiniBand is configured as an external network and accessed directly within Pods.
Only GPU nodes have InfiniBand NICs enabled, creating a dedicated high-speed path for GPU-to-GPU communication.

This design eliminates unnecessary costs on CPU nodes while removing network bottlenecks in GPU clusters.

3. Tiered Storage Architecture

Machine learning workloads demand multiple types of storage for datasets, checkpoints, and model artifacts.
To handle this efficiently, the storage is divided into three tiers.

Tier	Configuration	Purpose	Notes
Tier 1	Local NVMe disks on GPU nodes	Scratch / Checkpoint only	Ultra-fast temporary storage, Local PV
Tier 2	Ceph storage on CPU nodes	Dataset / Checkpoint storage	Uses Ceph RBD or CephFS
Tier 3	NAS (MinIO backend)	Long-term retention / Backup	Object storage, S3-compatible

3.1 Tier 1 — Local NVMe

Each GPU node uses a local NVMe SSD as a Local Persistent Volume.
Stores temporary checkpoints and intermediate files during training.
Managed with a Delete reclaim policy to prioritize speed and simplicity.

3.2 Tier 2 — Ceph Cluster

CPU node disks are aggregated into a Ceph cluster.
Stores shared datasets and intermediate training results.
GPU nodes can mount Ceph volumes for shared access across workloads.

3.3 Tier 3 — MinIO NAS

Implements a MinIO-based object storage layer for long-term backups.
Serves as a backup target for Ceph data.
Synchronization is automated with tools like mc mirror or rclone.

4. Scheduling Policy

Efficient GPU scheduling is crucial for ML workloads.
The cluster adopts a Volcano + Kueue combination for fairness and performance.

4.1 Volcano (Gang Scheduling)

A job starts only when all required GPU slots are available.
Prevents partial allocation and avoids deadlocks in multi-GPU workloads.

4.2 Kueue (Fairness / Borrowing / Preemption)

Divides GPU resources into queues by project or department.
Idle GPUs can be temporarily borrowed by other teams.
Urgent jobs are prioritized through preemption.

In short, Volcano ensures job concurrency, while Kueue enforces fairness and responsiveness across multiple teams.

5. Monitoring and Observability

A Zabbix-based monitoring system is integrated for unified observability across nodes and services.

Monitored metrics include:

GPU and CPU utilization
Ceph I/O throughput
InfiniBand bandwidth
Kubernetes Pod and Node status

Zabbix Agent2 can be extended with Prometheus exporters for detailed GPU and storage metrics.

6. Design Summary

Category	Configuration Summary
Network	Ethernet (K8s control) + InfiniBand (GPU only)
GPU Scheduling	Volcano + Kueue
Storage	3-Tier: NVMe / Ceph / MinIO
Monitoring	Zabbix Unified Monitoring
Primary Goals	Maximize GPU efficiency, High-performance I/O, Fair resource sharing

7. Scalability and Operations

Ensure InfiniBand switch port capacity before expanding GPU nodes.
Deploy Ceph MON and OSD daemons across at least three CPU nodes for HA.
Configure MinIO in a 4-node distributed setup for 2-node fault tolerance.
Integrate Zabbix alerts with Slack or Mattermost for real-time notifications.
Supported workloads include PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy, and Ray.

Conclusion

This cluster is not just a standard Kubernetes deployment —
it’s an ML-optimized on-prem platform designed for:

Efficient GPU resource utilization
High-performance, tiered storage
Fair and responsive scheduling

By combining InfiniBand for GPU nodes, Ceph for shared data, and Zabbix for unified monitoring,
the system achieves both performance and manageability at scale.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.10.13