Taming multi-cloud kubernetes networking with topology-aware routing

Thu, 20th Nov 2025

BEHROUZ HASSANBEYGI Senior DevOps Consultant Darumatic

Building a multi-cloud Kubernetes cluster is a fascinating challenge. The ideal outcome is a single, unified control plane spanning multiple cloud providers - AWS, Azure, and GCP - but the real hurdle is networking. How do you ensure pods running in different clouds communicate securely and efficiently?

I recently explored this when preparing a technical demonstration, and the journey turned out to be far more educational than expected. I began with a simple, lightweight architecture, but quickly ran into deeper networking issues, especially around Kubernetes Services. This is how I built the environment, what broke along the way, and how the final solution came together.

Initial Setup: Terraform, Ansible, K3s, and WireGuard

My approach was to keep everything intentionally minimal.

Kubernetes Distro:

I selected K3s for its lightweight footprint and straightforward setup. For this kind of distributed experiment, avoiding the complexity of a full upstream distribution made sense.

Infrastructure:

Using Terraform, I created three modules - one each for Azure, AWS, and GCP.

Each module provisioned two VMs with public IPs. These IPs were essential for establishing WireGuard tunnels between the clouds.

Configuration:

Terraform automatically generated an Ansible inventory. I then used a set of Ansible playbooks to:

Install and configure WireGuard on every node, forming a flat, secure overlay network across the three clouds.

Install K3s and join all nodes into a single cluster.

Once the cluster was live, I used k8s-netperf to benchmark the reliability and performance of inter-cloud communication.

The Problem: When Services Betray You

The early k8s-netperf results were mixed.

The Good: Direct node-to-node and pod-to-pod communication across the WireGuard tunnel worked surprisingly well. Even cross-cloud traffic delivered acceptable throughput.

The Bad: The moment I targeted a Kubernetes Service (such as netperf-server), performance deteriorated sharply. Traffic became unstable and often extremely slow. Even when a local pod was available, Kubernetes sometimes sent requests across clouds unnecessarily.

This revealed a clear insight: the WireGuard overlay was functioning correctly, but Kubernetes' service discovery and load balancing - managed through kube-proxy - was not respecting the underlying topology.

Troubleshooting the Routing Problem

My first instinct was to tune kube-proxy.

Attempt 1: Topology-Aware Hints

I tried enabling Kubernetes' built-in Topology-Aware Hints by patching the Service with:

service.kubernetes.io/topology-mode: auto

trafficDistribution: PreferClose

In theory, this should have encouraged kube-proxy to choose endpoints in the same zone (or cloud). In practice, nothing changed. Either kube-proxy ignored the hints or the feature simply wasn't compatible with this architecture.

Attempt 2: Switching to IPVS

K3s defaults to iptables mode, so I tested IPVS in the hope of improving load-balancing decisions.
This failed immediately. IPVS conflicted with the WireGuard configuration and broke the tunnel entirely.

Attempt 3: Replace kube-proxy (and Flannel)

At this point, it was clear kube-proxy was the bottleneck. Removing it, however, also meant removing Flannel - K3s's default CNI - which depends on kube-proxy.

So I needed a new CNI and a kube-proxy replacement. Two candidates stood out: Calico and Cilium.

Calico:

I first tested Calico in L3 mode. Although technically robust, it introduced difficult routing challenges. Without VXLAN encapsulation, Calico could not automatically route pod-to-pod traffic over WireGuard. This would have required configuring BGP or manually adding every pod CIDR into each node's WireGuard configuration. Clearly not scalable.

Cilium:

Cilium immediately looked more promising. It supports VXLAN overlays, integrates cleanly with WireGuard-based networks, and includes a mature kube-proxy replacement feature.

The Solution: Cilium to the Rescue

Step 1: Initial Cilium Installation

I removed Flannel and deployed Cilium with its kube-proxy replacement enabled. Instantly, the erratic service behavior disappeared. Traffic became stable and predictable - a major improvement.

However, the benchmarks still showed traffic routing to pods in other clouds even when local pods were available. The routing was fair, but not topology-aware.

Step 2: The Final Fix

After diving deeper into Cilium's documentation, I discovered the feature I was missing: Cilium's native service topology awareness.

When enabled, this produced exactly the behavior I wanted:

Prefer a pod running on the same node
If none exist, prefer a pod in the same zone (same cloud provider)
Only fall back to cross-cloud traffic if no local endpoint exists

This fully aligned Kubernetes service routing with the actual network topology of the multi-cloud environment.

Final Takeaway

Creating a multi-cloud overlay with WireGuard is surprisingly straightforward. The real complexity lies in making Kubernetes Service routing behave intelligently across that topology.

In this experiment - and in real enterprise environments - kube-proxy can struggle to make optimal decisions. But Cilium's combination of kube-proxy replacement and service topology awareness provided an elegant and effective solution.

For organisations exploring multi-cloud architectures, especially those focusing on cloud-native reliability and performance, these lessons are invaluable. At Darumatic, we frequently see teams underestimate the networking layer when adopting multi-cloud Kubernetes. As this experiment shows, getting topology-aware routing right is essential for predictable performance, cost control, and a seamless user experience.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google