How Ingress Controller Traffic Works on On-Prem K8s with MetalLB — Why externalTrafficPolicy: Cluster Can Blackhole Traffic

1) Default Behavior (Cluster mode) — Kubernetes

Service.type: LoadBalancer + externalTrafficPolicy: Cluster → traffic to the VIP can land on any node in the cluster.
The receiving node’s kube-proxy forwards the traffic to a node that actually hosts a backend Pod (this hop typically involves SNAT).
The original client IP is lost, but availability is maintained regardless of where Pods run.
You cannot know in advance which node will be bound for an incoming connection.

Check your current setting

Default install (names may vary for custom installs):

kubectl get svc ingress-nginx-controller -n ingress-nginx -o yaml | grep externalTrafficPolicy

2) VIP Binding in MetalLB L2

MetalLB (L2 mode) binds a VIP to a single node’s MAC.
(Upstream ARP table on the switch/gateway: VIP → Node-A MAC)
All external traffic therefore enters the cluster through that one node (Node-A).
Whether the Pod is on Node-A or not, kube-proxy will re-route it inside the cluster.

3) Failure Scenario

VIP initially bound to Node-A (VIP → Node-A MAC).
The Ingress Controller Pod is rescheduled to Node-B.
- With externalTrafficPolicy: Cluster, Node-A → Node-B redirection should still work.
MetalLB changes the VIP owner to Node-B and sends gratuitous ARP (GARP) to update neighbors.
Some network devices (switch/router) ignore GARP or keep the old MAC cached.
External traffic still goes to Node-A.
But once VIP ownership moved, Node-A no longer accepts the VIP → traffic is blackholed.

4) Why a Blackhole Even in Cluster Mode?

In theory, Cluster mode lets any node accept and forward traffic.
In MetalLB L2, only the current owner’s MAC answers ARP for the VIP. When ownership flips, the former owner stops responding.
If upstream ARP still points to the old MAC (Node-A), packets arrive at Node-A, which now drops them → packet loss.
Root cause is almost always ARP table refresh failure in L2 mode.
- Even if you switch the ingress traffic policy to Local and ensure the ingress Pod runs on the “intended” node, this does not resolve the underlying ARP-staleness problem.

5) Practical Remediation

Make network devices honor GARP and be ready to clear/flush ARP on switches/routers when VIP ownership changes.
Consider BGP mode: multiple nodes advertise the VIP, removing the ARP single-owner dependency.
- This adds network handling requirements and operational complexity—fine for greenfield, but migrating a running cluster can introduce many variables and overhead.
- (I’ll cover BGP details in a separate post.)

✅ Summary

externalTrafficPolicy: Cluster is the default and should be robust.
With MetalLB L2, VIP owner changes can fail if ARP tables don’t update.
The resulting blackhole is caused by L2/ARP behavior, not by the application stack.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.09.18