Service mesh is one of those technologies that looks strictly better on a slide deck and strictly worse on a Tuesday night incident call.
The slide deck shows mTLS, traffic splitting, retries, golden signals. The incident call shows a sidecar that crashed on startup, a control plane that can’t reach an endpoint slice, and six engineers arguing about whether the 503 came from the app, the proxy, or the policy.
Both things are true. A mesh solves real problems, and a mesh imposes a real tax. The interesting question is not “should we run a mesh” in the abstract. It is “does the problem we have exceed the tax the mesh costs us to operate.”
For most teams with fewer than about 30 services, the honest answer is no.
1. What a service mesh actually gives you
Strip away the marketing and a mesh gives you four things.
- mTLS between workloads. Every pod-to-pod call becomes encrypted and identity-bound without the application knowing about it.
- L7 traffic policy. Retries, timeouts, circuit breakers, outlier ejection, traffic splitting, header-based routing, applied at the platform layer and changeable without a redeploy.
- Uniform observability. Golden-signal metrics, protocol-aware logs, and distributed tracing headers stamped on every hop, for every language, without touching app code.
- Identity-based authorization. “Service A may call service B, but only on
POST /refundand only from namespacepayments” as a declarative policy, not a library configuration.
That is a legitimately powerful feature set. The catch is that all four are only valuable to the degree you are actually using them.
A team running Go and Node, with one retry library, one logging library, and a gateway doing TLS termination at the edge, already has a workable version of three of those four. The mesh duplicates what they have and then asks them to operate the duplicate.
2. The tax you pay for the mesh to exist
The mesh is not free. The bill comes in four line items.
Per-pod resource overhead. A sidecar on every pod means CPU and memory overhead on every pod. The numbers vary by mesh and tuning, but a typical sidecar runs 50–150m CPU and 80–200 MiB of RAM at idle. Multiply by pod count. For 400 pods that is 20–60 CPU cores and 30–80 GiB of memory spent on proxies, not on your product.
Latency added on every hop. Two hops through a proxy per request, twice if you count ingress and egress. On p50 this is usually 1–3 ms per hop and invisible. On p99 under memory pressure or a busy node it is not.
Control plane you now have to run. An Istio control plane, or a Linkerd one, or a Cilium mesh control plane, is another distributed system in your cluster. It needs upgrading, backing up, capacity planning, and its own monitoring. Every mesh upgrade is a coordinated exercise across the control plane, the data plane CRDs, and the sidecars attached to running workloads.
Debugging surface area. When a call fails, the question “is this the app, the proxy, the policy, the mTLS config, the destination rule, or the workload selector” is a real and frequent question. It is not a theoretical one.
None of these are disqualifying on their own. Taken together, they are why a mesh installed “just in case” tends to become the most-complained-about piece of platform infrastructure within six months.
3. The threshold conditions that flip the ROI
Here is where I have actually seen a mesh earn its keep.
- You have a polyglot fleet maintained by more than one team. Go, Node, Python, Java, maybe Rust, each with their own HTTP client, their own retry quirks, their own logging conventions. A mesh gives you one implementation of retries, timeouts, and mTLS across all of them, and that uniformity is worth real money.
- Compliance or security mandates mTLS everywhere. Some regulatory regimes and internal zero-trust programs require identity-based encryption between every workload. You can get there with application libraries, but the cost of proving it to an auditor on every release is high. A mesh makes the proof structural.
- Granular traffic shifting at scale. Canarying 1% of traffic to v2 of a service, per header, per route, per user cohort, across 100+ services, is something a mesh does well and libraries do badly.
- Zero-trust network policy at L7. Network policies at L3/L4 get you “pod A can talk to pod B.” L7 authorization gets you “pod A can call
GET /healthzbut notPOST /adminon pod B.” If your security posture requires the second, a mesh is the cleanest path. - You run 100+ services across many teams. Once the mesh tax is amortized across a large fleet and the coordination cost of keeping libraries in sync is higher than the cost of running the mesh, the math tips.
If you cannot check at least two of these boxes honestly, a mesh is probably a net negative for you today.
4. What a gateway-only alternative looks like
If the problem is “I need mTLS at the edge, observability, and the ability to do canary deploys for a handful of services,” a gateway plus a few libraries is almost always cheaper.
The Kubernetes Gateway API has matured to the point where traffic splitting, header routing, and TLS are first-class, and the implementation (Envoy Gateway, Kong, Traefik, NGINX Gateway Fabric) is swappable. A HTTPRoute that canaries 10% of traffic looks like this.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: checkout
namespace: storefront
spec:
parentRefs:
- name: public-gateway
namespace: gateway-system
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /checkout
backendRefs:
- name: checkout-v1
port: 8080
weight: 90
- name: checkout-v2
port: 8080
weight: 10
timeouts:
request: 2s
backendRequest: 1500msThat is it. No sidecars, no control plane beyond the gateway itself, no per-pod overhead. It does not give you mTLS between checkout and inventory, and it does not give you L7 policy on internal hops. If you do not need those things, you do not need the mesh.
An API gateway is a sibling concern and I have written separately about choosing one. The point here is simply that a gateway plus disciplined HTTP clients handles the 80% case for teams under ~30 services.
5. What a mesh VirtualService actually looks like
For contrast, here is the same canary expressed through an Istio VirtualService with a DestinationRule, which is what you would write if the mesh is already running.
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: checkout
namespace: storefront
spec:
hosts:
- checkout.storefront.svc.cluster.local
http:
- match:
- uri:
prefix: /checkout
route:
- destination:
host: checkout.storefront.svc.cluster.local
subset: v1
weight: 90
- destination:
host: checkout.storefront.svc.cluster.local
subset: v2
weight: 10
timeout: 2s
retries:
attempts: 2
perTryTimeout: 800ms
retryOn: gateway-error,connect-failure,refused-stream
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: checkout
namespace: storefront
spec:
host: checkout.storefront.svc.cluster.local
subsets:
- name: v1
labels: { version: v1 }
- name: v2
labels: { version: v2 }
trafficPolicy:
connectionPool:
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30sThe mesh version gives you something the gateway version does not: this policy applies to checkout no matter who calls it, including other services inside the cluster. That is the entire argument for the mesh. If nobody inside the cluster actually calls checkout that way, you are paying for a feature you will not use.
6. The sidecarless wave and what it actually changes
The more interesting development of the last two years is the sidecarless architectures. Istio Ambient and Cilium’s mesh both move the data plane off the pod.
In Ambient, L4 mTLS and identity happen in a per-node ztunnel. L7 policy, if you need it, happens in a per-namespace or per-service waypoint proxy that only processes traffic that actually uses L7 features. Cilium takes a similar approach using eBPF in the kernel for L4 concerns and Envoy where L7 is required.
What sidecarless changes.
- Per-pod overhead collapses. You do not pay 150 MiB of RAM for every pod. You pay for the per-node component plus a waypoint for services that opt into L7.
- Upgrades get simpler. No coordinated sidecar restart across every pod in the cluster. The node-level component and the waypoints upgrade independently.
- L4 becomes cheap enough to turn on broadly. mTLS-everywhere stops being a budget conversation.
What sidecarless does not change.
- You still operate a control plane. CRDs, identity, certificate rotation, policy authoring, none of that goes away.
- Debugging still crosses multiple components. ztunnel, waypoint, application — three places where a request can die instead of two.
- L7 policy still needs a proxy somewhere. If you use retries, traffic splitting, header routing, a waypoint shows up in the path and so does its overhead.
Ambient and Cilium’s mesh shift the break-even point meaningfully. A team that could not justify sidecars on 1000 pods might be able to justify Ambient for L4-only mTLS on those same pods, and opt specific services into L7 waypoints as needed. That is a real improvement. It is not a blanket “mesh is now free.”
7. Mesh-without-a-use-case is the worst shape
The outcome I see most often when a team installs a mesh speculatively is what I think of as mesh-without-a-use-case.
- The mesh is installed cluster-wide because “we might need mTLS later.”
- No real traffic policy is using it. Retries are still in application libraries. Timeouts are still in application config.
- Observability is duplicated. Mesh emits metrics, the app also emits metrics, nobody is sure which dashboard is authoritative.
- Every outage adds one more question: “is this the mesh.”
- Every upgrade is a day of coordination that blocks unrelated work.
This is not the mesh’s fault. It is the predictable outcome of installing a platform component that is only valuable when you lean on it, and then not leaning on it. A mesh that is not the source of truth for retries, timeouts, traffic shifting, and identity is pure overhead with an impressive architecture diagram.
If you are going to run a mesh, commit to it. Remove the duplicate libraries. Make the mesh the only path for retries and timeouts. Make mTLS the default, not the exception. Put policy into the mesh instead of into the app.
If you are not going to commit to it, do not install it.
8. A decision rubric I actually use
When a team asks me whether they should adopt a mesh, I run through roughly this checklist.
- How many services do you run, and how many teams own them? Under 30 services and fewer than 5 teams, start without a mesh.
- How polyglot is the fleet? If it is one language with one shared HTTP client, libraries will carry you much further than they will in a polyglot shop.
- Do you have a compliance or zero-trust mandate that requires mTLS between every workload? If yes, a mesh is usually the right answer regardless of service count.
- Do you need L7 authorization, not just L4 network policy? If yes, you need a mesh or a per-service proxy, which is the same tax under a different name.
- Do you have platform capacity? A mesh is a platform team’s responsibility. Without a platform team, the mesh becomes a distraction for product engineers.
- What does your canary strategy require? If weighted routing at the edge plus feature flags in the app is enough, you do not need a mesh for it. If you need per-cohort routing across 100 services, you do.
If the honest answers point to mesh, Ambient or Cilium is usually where I would start today, not classic sidecar Istio. Ship L4 mTLS first, add L7 waypoints where they earn their keep, and stay out of the sidecar-everywhere era unless you have a specific reason to live in it.
9. What to do instead, for most teams
For a team in the “not yet” zone, the stack that gets you most of the benefit without the mesh tax looks like this.
- Gateway API at the edge for public ingress, TLS termination, and request-level policy for outside traffic.
- Cluster-level network policy at L3/L4, not L7, to scope which pods can talk to which pods.
- A single shared HTTP client per language with retries, timeouts, and deadlines baked in. One library per language, reviewed centrally, versioned like everything else.
- OpenTelemetry instrumentation in the app, one set of metrics and traces, no mesh-emitted duplicates.
- Cert-manager plus SPIFFE or workload identity where mTLS is required on specific sensitive paths, not everywhere.
- Feature flags in the app for cohort-based rollouts that do not map cleanly to traffic weights.
That stack carries most teams through the first few years of growth, stays cheap to operate, and leaves the door open to adopting a mesh later when the service count and team count actually demand it.
10. The takeaway
A service mesh is a good answer to a specific set of problems. It is a bad answer to problems you do not actually have yet.
The failure mode I see most is teams adopting a mesh because the architecture looks serious, not because the problems they face require one. Six months later the mesh is running, nobody is using its features deeply, and every incident has an extra suspect.
If you have a polyglot fleet, a compliance mTLS mandate, granular traffic-shifting needs across many services, or a zero-trust program that requires L7 identity, the mesh pays for itself. Pick Ambient or Cilium, commit to it, and make it the source of truth for network policy.
If you do not have those problems, the boring stack wins. A gateway, network policies, one shared HTTP client per language, OpenTelemetry, and feature flags will carry you further than most architecture diagrams suggest.
The best answer to “should we adopt a service mesh” is rarely yes or no in the abstract. It is a count of services, a count of teams, a mandate or the lack of one, and an honest look at what you would stop doing in the app the day the mesh turns on. If nothing would change in the app, the mesh is not earning its keep.