Scope: Worker nodes only (excluding control plane and etcd)
Assumption: Running cluster in production, minimize downtime
✅ Checklist
- Pre-check: Verify node name, pods, storage → secure backup / change window
- Drain & Remove:
kubectl cordon/drain→kubectl delete node→ (optional) clean up CNI leftovers - Kubespray Cleanup:
remove-node.yml(different flags for online/offline nodes) - Inventory Cleanup: Remove node entry from inventory file
- Re-Add Node: Prepare new VM → add to inventory → run
facts.yml→scale.yml --limit=<new-node> - Validation & Restore: Labels and taints restored to the new node
0. Pre-check
# Node and pod distribution
kubectl get nodes -o wide
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
# Check local PVs, emptyDir usage, and existing PDBs
kubectl get pv,pvc -A
kubectl get pdb -A
kubectl drainrespects PodDisruptionBudgets (PDBs). If drain is blocked by a PDB, scale out/in temporarily or relax the PDB before proceeding.
📌 PodDisruptionBudget (PDB)
- Definition: A Kubernetes policy that limits how many pods can be voluntarily evicted at the same time (during drain, upgrade, or rolling updates).
- Why: Ensures workloads such as ReplicaSets, Deployments, or StatefulSets always keep a minimum number of pods available (e.g., DB proxies, API servers, ingress controllers).
- Example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-api
If there are 3 pods with label app=my-api, drain will only remove 1 pod at a time, ensuring at least 2 remain running.
- Summary:
- Applies to voluntary disruptions (drain, rolling updates).
- Guarantees safe pod redistribution while maintaining service availability.
- Defined by
minAvailableormaxUnavailable.
👉 In short: A PDB is an uptime safety net for your pods.
1. Drain Worker Node & Remove K8s Object
# Prevent new scheduling
kubectl cordon <node>
# Evict safely (ignore DaemonSets, delete emptyDir, use --force if necessary)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Delete node object
kubectl delete node <node>
Optional: Clean up CNI resources (e.g., Calico)
Sometimes IP or node objects remain even after deletion:
calicoctl get nodes -o wide
calicoctl delete node <node>
⚠️ Note: Worker nodes typically do not run :6443 (kube-apiserver). Avoid killing processes with kill -9. If needed, use systemctl stop kubelet temporarily, but rely on Kubespray for clean removal.
2. Cleanup with Kubespray (remove-node.yml)
- Online node (reachable via SSH):
cd kubespray/
# Refresh facts cache (recommended)
ansible-playbook -i inventory/<cluster>/hosts.yaml -b playbooks/facts.yml
# Clean up the node
ansible-playbook -i inventory/<cluster>/hosts.yaml -b \
remove-node.yml -e node=<node>
- Offline node (not reachable via SSH):
ansible-playbook -i inventory/<cluster>/hosts.yaml -b \
remove-node.yml -e node=<node> \
-e reset_nodes=false -e allow_ungraceful_removal=true
Use reset_nodes=false + allow_ungraceful_removal=true to avoid failure.
3. Inventory Cleanup
After successful removal, delete the node entry from your Ansible inventory (hosts.yaml or inventory.ini).
4. Prepare Replacement / New Worker Node
- Provision a new VM (or bare-metal node).
- Requirements:
- Passwordless SSH from control node
sudoprivileges, NTP sync, firewall/network ready- If reusing IP: clear SSH known_hosts conflicts:
ssh-keygen -R <nodeIP> ssh-keygen -R <nodeName> - Disable swap:
sudo swapoff -a sudo sed -ri 's/^([^#].*\sswap\s)/#\1/' /etc/fstab
5. Add New Node to Inventory
YAML (preferred):
all:
hosts:
worker-new:
ansible_host: 10.10.10.25
ip: 10.10.10.25
access_ip: 10.10.10.25
ansible_user: ubuntu
children:
kube_control_plane:
hosts:
cp-1: {}
cp-2: {}
kube_node:
hosts:
worker-1: {}
worker-2: {}
worker-new: {}
etcd:
hosts:
cp-1: {}
cp-2: {}
k8s_cluster:
children:
kube_control_plane: {}
kube_node: {}
INI:
[all]
worker-new ansible_host=10.10.10.25 ip=10.10.10.25 access_ip=10.10.10.25 ansible_user=ubuntu
[kube_control_plane]
cp-1
cp-2
[kube_node]
worker-1
worker-2
worker-new
[etcd]
cp-1
cp-2
[k8s_cluster:children]
kube_control_plane
kube_node
6. Refresh Facts & Scale In (Add Node)
# Refresh facts for all nodes
ansible-playbook -i inventory/<cluster>/hosts.yaml -b playbooks/facts.yml
# Install/join only the new node
ansible-playbook -i inventory/<cluster>/hosts.yaml -b \
scale.yml --limit=worker-new
7. Post-Validation & Restore Labels/Taints
# Verify join
kubectl get nodes -o wide
kubectl describe node worker-new
# Restore labels/taints if needed
kubectl label node worker-new node-role.kubernetes.io/worker='' env=prod
kubectl taint nodes worker-new node.kubernetes.io/purpose=ingress:NoSchedule
8. Troubleshooting Tips
- Drain blocked by PDB: scale temporarily, relax PDB,
--disable-evictiononly as last resort. - Offline node: use
remove-node.yml -e reset_nodes=false -e allow_ungraceful_removal=true. - Calico leftovers: clean with
calicoctl delete node <id>. - Facts cache issues: rerun
playbooks/facts.ymlbefore retrying. - Add failure rollback: reset only the target node with
reset.yml --limit=<node>.
9. Why Not Force Kill Processes?
- Worker nodes don’t run
kube-apiserver (:6443). Killing processes leaves dirty state. remove-node.ymlhandles kubelet, runtime, and CNI cleanup in proper order.- Use special flags only if the node is unreachable.
10. Command Summary (Copy-Paste Ready)
# A) Drain & Delete
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node>
# B) Optional: Calico cleanup
calicoctl get nodes -o wide
calicoctl delete node <node>
# C) Kubespray cleanup (online)
ansible-playbook -i inventory/<cluster>/hosts.yaml -b playbooks/facts.yml
ansible-playbook -i inventory/<cluster>/hosts.yaml -b remove-node.yml -e node=<node>
# D) Kubespray cleanup (offline)
ansible-playbook -i inventory/<cluster>/hosts.yaml -b remove-node.yml \
-e node=<node> -e reset_nodes=false -e allow_ungraceful_removal=true
# E) Add new node
ansible-playbook -i inventory/<cluster>/hosts.yaml -b playbooks/facts.yml
ansible-playbook -i inventory/<cluster>/hosts.yaml -b scale.yml --limit=<new-node>
# F) Verify & restore
kubectl get nodes -o wide
kubectl label node <new-node> env=prod
kubectl taint nodes <new-node> node.kubernetes.io/purpose=ingress:NoSchedule
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.
🛠 마지막 수정일: 2025.09.18
답글 남기기
댓글을 달기 위해서는 로그인해야합니다.