How Ingress Controller Traffic Works on On-Prem K8s with MetalLB — Why externalTrafficPolicy: Cluster Can Blackhole Traffic

1) Default Behavior (Cluster mode) — Kubernetes

  • Service.type: LoadBalancer + externalTrafficPolicy: Cluster → traffic to the VIP can land on any node in the cluster.
  • The receiving node’s kube-proxy forwards the traffic to a node that actually hosts a backend Pod (this hop typically involves SNAT).
  • The original client IP is lost, but availability is maintained regardless of where Pods run.
  • You cannot know in advance which node will be bound for an incoming connection.

Check your current setting

  • Default install (names may vary for custom installs):
kubectl get svc ingress-nginx-controller -n ingress-nginx -o yaml | grep externalTrafficPolicy

2) VIP Binding in MetalLB L2

  • MetalLB (L2 mode) binds a VIP to a single node’s MAC.
    (Upstream ARP table on the switch/gateway: VIP → Node-A MAC)
  • All external traffic therefore enters the cluster through that one node (Node-A).
  • Whether the Pod is on Node-A or not, kube-proxy will re-route it inside the cluster.

3) Failure Scenario

  1. VIP initially bound to Node-A (VIP → Node-A MAC).
  2. The Ingress Controller Pod is rescheduled to Node-B.
    • With externalTrafficPolicy: Cluster, Node-A → Node-B redirection should still work.
  3. MetalLB changes the VIP owner to Node-B and sends gratuitous ARP (GARP) to update neighbors.
  4. Some network devices (switch/router) ignore GARP or keep the old MAC cached.
  5. External traffic still goes to Node-A.
  6. But once VIP ownership moved, Node-A no longer accepts the VIP → traffic is blackholed.

4) Why a Blackhole Even in Cluster Mode?

  • In theory, Cluster mode lets any node accept and forward traffic.
  • In MetalLB L2, only the current owner’s MAC answers ARP for the VIP. When ownership flips, the former owner stops responding.
  • If upstream ARP still points to the old MAC (Node-A), packets arrive at Node-A, which now drops them → packet loss.
  • Root cause is almost always ARP table refresh failure in L2 mode.
    • Even if you switch the ingress traffic policy to Local and ensure the ingress Pod runs on the “intended” node, this does not resolve the underlying ARP-staleness problem.

5) Practical Remediation

  • Make network devices honor GARP and be ready to clear/flush ARP on switches/routers when VIP ownership changes.
  • Consider BGP mode: multiple nodes advertise the VIP, removing the ARP single-owner dependency.
    • This adds network handling requirements and operational complexity—fine for greenfield, but migrating a running cluster can introduce many variables and overhead.
    • (I’ll cover BGP details in a separate post.)

✅ Summary

  • externalTrafficPolicy: Cluster is the default and should be robust.
  • With MetalLB L2, VIP owner changes can fail if ARP tables don’t update.
  • The resulting blackhole is caused by L2/ARP behavior, not by the application stack.

ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. 본문 및 이미지를 무단 복제·배포할 수 없습니다. 공유 시 반드시 원문 링크를 명시해 주세요.
ⓒ 2025 엉뚱한 녀석의 블로그 [quirky guy's Blog]. All rights reserved. Unauthorized copying or redistribution of the text and images is prohibited. When sharing, please include the original source link.

🛠 마지막 수정일: 2025.09.18