Episode 51

full
Published on:

24th Jun 2026

What Do Service Meshes Actually Solve? (William Morgan, Buoyant/Linkerd)

Network calls fail in ways function calls never do - and once a monolith becomes microservices, reliability problems show up fast: retries amplify load, latency spikes cascade, and “what talks to what?” becomes hard to answer.

William Morgan, co-creator of Linkerd and the person who coined “service mesh,” breaks down what service meshes actually solve for platform teams running Kubernetes at scale. The conversation focuses on practical outcomes: improving reliability between services, getting uniform observability without rewriting every app, and handling gaps Kubernetes doesn’t cover well - like gRPC/HTTP2 load balancing and cross-environment communication.

Key topics

  • Why reliability is the first “microservices tax” (timeouts, retries, backoff, cascading failure)
  • What Kubernetes does not solve at the networking layer—and where a service mesh fits
  • gRPC/HTTP2 load balancing problems and why L4 balancing can fall short
  • Service-to-service visibility: understanding traffic flows and performance without per-app instrumentation
  • Cost and resilience tradeoffs with multi-AZ Kubernetes on AWS (and how zonal-aware balancing can help)
  • Whether developers should ever need to interact with service mesh configuration
  • Where zero trust and policy controls belong: platform guardrails vs application ownership

Guest: William Morgan, CEO at Buoyant, Co-Creator of Linkerd

William Morgan brings a unique take on platform engineering, security, and traffic management in cloud native environments. William’s the mind behind Linkerd, the CNCF graduate service mesh born to make security, observability, and reliability "just work" for modern apps without all that heavy overhead. With roots as an infrastructure engineer at Twitter, where he was hands-on in the shift to microservices, and experience at Microsoft, Powerset, Adap.tv, and MITRE, William understands operational complexity better than most. His perspective on reducing unpredictable cloud spend with features like Linkerd’s High Availability Zonal Load Balancing is timely for any team wrestling with multi-AZ cloud bills.

William has hands-on knowledge of MCP, the protocol now critical for securing enterprise AI traffic. He also has strong views on sustainable open source business models, having contributed to open source for over 20 years.

William Morgan, BlueSky

Buoyant, Website

Buoyant, LinkedIn

Buoyant, YouTube

Linkerd, GitHub

Links to interesting things from this episode:

Transcript
Cory:

Welcome back to the Platform Engineering Podcast. I'm your host, Cory O'Daniel.

Today my guest is William Morgan, CEO of Buoyant and Co-Creator of Linkerd, the CNCF graduated service mesh. William cut his teeth at infrastructure at Twitter back when the company was moving off its Ruby on Rails monolith. And he's been working in the distributed system space, security and platform ever since. William, welcome to the show.

William:

Thanks, Cory. It's great to be here.

Cory:

So you've been around for quite a while. From the Ruby on Rails days all the way up until the modern era.

William:

Yep.

Cory:

Geez. Okay, so you're... So it was Twitter... You were back at Twitter back in the day.

What took you from Twitter to the world of Linkerd and Kubernetes and service meshes? What was that journey like?

William:

Yeah, so for me, Twitter was a pretty transformative period in my career because I actually started at Twitter in what they called, at that point, the Relevance team. And we were doing stuff called natural language processing and machine learning and things that like, you know, nowadays are a hot topic but at that point were kind of a struggle to productize.

I mean, lots of fun challenges, but actually getting those things to the point where you could do anything that was really product oriented was really hard. And I was like, "Well, this stuff is never going to work. You know, like this whole area sucks. I'm actually going to go move to the..." You can see how, how good I am at predicting the future.

But at that point I was like, you know what Twitter is really having trouble with and that is making a huge impact is infrastructure. So I kind of moved over to the infrastructure side of the house. And that's where, as you kind of alluded to, at that point, we were on this Ruby on Rails monolith.

We knew it wasn't going to scale. It wasn't going to scale for like a bunch of reasons. Some of them good, some of them bad, frankly, some of them were just the way Twitter did it and the specifics of where we were hosting it were just forcing this choice to be made. Which I initially was against because I was really pro Ruby at that point.

But eventually we moved on to what we called then a service-oriented architecture - we would now call microservices. And that transition was quite a dramatic one. And there are many analogies.

This was before Kubernetes, before... maybe Docker had just started to rear its head. We didn't have any of that. We weren't using any of that, but the system that we built at the end was what we would now call, you know, a cloud native system. It was containerized in the sense that it was on the JVM and we use cgroups for isolation. It was distributed and it was orchestrated.

It was running on this thing called Mesos, which was where I cut my teeth in terms of orchestration, platforms. And so basically the idea for Buoyant was, "Hey, we've just seen this, like, tremendous transformation. We had to build all this technology to make it happen. The rest of the world is probably going to need to do something similar to that. So let's take what was one of the most critical components of that, which is... it actually was a library for kind of managing the communication between microservices and let's turn that into a product that anyone can use."

And we kind of did that right at the time when Docker and Kubernetes were really taking off. And our genius insight was like, no one wants a library. Well, it was a Scala library. Like, no one wants a Scala. It was even worse. It was a functional programming Scala library, you know, for doing RPC calls. So, like, you really have to be at this really niche intersection.

Cory:

Niche, yeah.

William:

Right. But like, let's turn that into a proxy and then you can stick a proxy next to your application. It doesn't matter what your application is written in... it's probably not fucking Scala... but it works, right?

And so we targeted Kubernetes and that was kind of the genesis of Linkerd and the genesis of the whole service mesh space.

Cory:

So for folks that are, you know, maybe they're starting to tear apart their first monolith, or maybe they've entered into the realm of service oriented architectures and some SOAPs and whatnot.

What would you say is the number one problem that a service mesh is solving as you're starting to break out your application into different services? No matter what their size, whether micro or macro. Despite your service size, what is the main benefit of having this service mesh sitting right next to your application, kind of managing the communication layer?

William:

I would probably say reliability is the biggest one because as soon as you make that move, you are in a world where you have effectively placed what used to be function calls in a monolith... A calls B calls C... with network calls. And function calls, what do we know about them? Well, they never fail unless something really bad is going on. They're basically as fast as the computer can make them.

And is there a third thing we know about them? You do them all the time. And now you've replaced it with network calls which fail all the time and which are really, really slow.

And in fact, it's not just a network call between one component... like A is talking to B... it's a network call between... I've got a hundred instances of A and I'm talking to a thousand instances of B. And each of those instances of B could be in some crazy state.

And so suddenly you've gone from a very clear kind of thing you can rely on - A calls B calls C - to like... you move straight from that into the world of distributed systems problems.

There's all these kind of famous examples of why this is suddenly hard. You're like, "Okay, well, calls sometimes fail, so I'll start retrying them." Okay, well, if I retry too fast that makes things worse, so I'll add exponential backoff. Oh, but actually I have a series of seven services that all call each other, and when the innermost one dies, then everything starts retrying. And now I've magnified my traffic by like three x at each hop.

You quickly get into this world where what in your head is this very simple operation - A calls B calls C, you're still thinking about it as function calls - in practice is this hugely complicated distributed systems problem. And that's a lot of what Linkerd can solve for you.

Cory:

Yeah, I feel like when you first start breaking up a fairly large app, like, the first criteria, like, people are usually looking at are, you know, like, I'm trying to tangle apart something. Right?

And it's just like your engineering effort at that point in time is like, "Okay, I've got the product and now I have to start thinking about just literally pulling the code apart."

And it's just like most teams, they start to get that thing ripped apart and then thinking about the networking layer and like, the intercommunication is just like, that's a day two job you didn't realize you had yet, right?

And you can get into this place where you're just constantly chasing, like, just system failures in different ways, right? You start to get to the point where it's like, maybe those services that you're calling are moving around, maybe they're not at the same IP address all the time. And that just gets to a point where it's like you start spending a ton of effort, like undifferentiated heavy lifting, just trying to get this thing to feel as cohesive as a monolith again. But there's just a ton of engineering effort kind of literally in between these two systems.

William:

Yeah, yeah. And you know, in the modern world like, you know, a lot of the basics have been solved by Kubernetes.

But what Kubernetes doesn't do, and I think this is a positive thing, is like it really doesn't do anything at the networking layer besides, "I will do Layer four load balancing." And so a lot of what Linkerd gets used for today is basically it's kind of like the missing piece to running Kubernetes at scale.

So for example, if you're running gRPC, Kubernetes default load balancing actually doesn't really work for you because gRPC runs on HTTP/2. HTTP/2, you make a single connection and you start multiplexing connections. And like, you know, layer four load balancing doesn't really do it. You know, you get pinned to one pod and it's like, okay, all is well.

So Linkerd will like sit in there, it'll intercept those calls, it'll establish a connection to each pod and it'll start doing request level load balancing.

So you know, if you even... something as simple as like, "Well, I'm going to run gRPC and I'm going to run on Kubernetes," suddenly you find yourself needing something like Linkerd. Or any kind of cross cluster communication. Kubernetes doesn't really provide any affordances for that.

So like if I'm service A and I'm talking to service B, but I actually want B to like live on this other cluster where I want to migrate between clusters, where I want to balance traffic dynamically. You know, Kubernetes kind of leaves you on your own. Well, that's exactly where Linkerd comes in.

Cory:

Yeah, that's one of the places I see people stumble into, I guess realizing that there's this like inter service like work to do fairly frequently is in that Kubernetes adoption. I feel like there's a very common adoption pattern which is like, "Hey, we're going to start using Kubernetes and we have a cluster."

The world's a happy place when you have a cluster.

William:

Yeah, right.

Cory:

When you have a cluster, but it very frequently moves from there to, "Well, yeah, okay, we have our cluster, but we also have this absolute farm of VPCs over here where we have all these old VMs running from the stuff that has not moved to this cluster yet. And I've got networking problems over here and then I've got our data center and I've got networking problems over there."

You can very "easily" (I'm throwing air quotes around "easily") adopt Kubernetes, but the reality is you probably haven't moved your entire business to it, right? And that networking problem is just all over the place when you start getting in there.

William:

Yeah, yeah, that's right. And you know, sometimes the problems are surprising. Like we have a tremendous number of customers who the number one reason why they use Linkerd is because they're running on AWS.

And even in the single cluster case, the best practice on AWS is to run your cluster so it spans Multiple Availability Zones (AZs) within the same region. Because you're like, okay, if one AZ goes down, at least a cluster, you know, still survives. I've got nodes running on another.

So that sounds great, except AWS will charge you lots of money for traffic between availability zones. And so if you have a low cluster traffic, like you're getting started, it's pennies and you don't care. And then you know, as soon as you start sending lots of traffic, suddenly you're paying like millions of dollars. And you know, Linkerd, because of where it sits, it can make these really advanced load balancing decisions.

It can take into account the fact that the destination is on a different zone and it can actually like, not just cut off that traffic so that by default when the system's in a happy state not sending any cross zone traffic, but if the system is under stress, it can actually start opening the gate and allowing you to send cross zone traffic. So then you kind of get the best of both worlds.

In the happy case you don't have any of that charge. But then when you actually need the cross zone thing for reliability reasons, Linkerd will help you. So that's a feature we call HAZL - High Availability Zonal Load Balancing.

And this was not in the plan, in the beginning, it was something that we stumbled into because we had customers who were like, "Holy crap, we're spending so much money on this one thing. And like, what do we do? Like, we don't want to give up the multi AZ deployment, but we also don't want to spend all this money on it."

So you know, sometimes you like learn these things from your customers, which it's been fun for me.

Cory:

That is fun.

I think about our cloud bill and until maybe two or three months ago, I think networking and inter AZ traffic was I think up there in like our number one or number two highest build services like it is. It is a place that sneaks up on you and it is a pain, it's a painful one to address after the fact. Right, right.

Like resizing a database, pretty straightforward.

William:

Right.

Cory:

Figuring out all of your like AZ traffic after the fact and like trying to straighten that one out like real quick to make the CFO happy is. Yeah. Is not a real quick task.

William:

Yeah, yeah. And I mean it's the thing that you know, if you don't do something like Hazel with Linkerd, the, the other options are much more significant.

Like you could deploy, you can move to separate clusters per AZ. Well, it's like a big re architecture of how you're using cluster.

It's doable, you know, but it's like so now you've got a different, you've got the reliability problem at a different layer of the stack. So yeah, it's a tough spot to be in.

Cory:

Have you seen the massive service mesh, Google Doc or the Google sheet? Do you know what I'm talking about?

William:

Oh yeah. This is from a couple of years ago, right? Yeah, yeah, yeah.

I was monitoring that for a while and making sure that Linkerd was represented correctly in there features at that point. Like everyone's adding features left and right and then at some point.

Cory:

I'll put a link in the show notes. I was just curious if you'd seen this because I remember it was like a. Oh geez. At a previous job.

It was a frequently referenced document when we were deciding on our service mesh. But I think my grander question here is why are there so many service meshes?

I feel like it feels like the left pad of things that people can build in the kubernetes space and I feel like looking at that sheet is overwhelming with the amount of different service meshes that are on there. What is the appeal of folks to keep building service meshes?

William:

I think there's a couple things. You know, obviously Linkerd was a first, like we're the ones who created the term. So everyone thought like what a great idea, let's go copy them.

So you know, via imitation. Well, no, I don't. You know, I'd like to think that's true. I'm not sure how true it is. I think part of it is like, they seem fun to build, right?

Like you get to build a control plane and a data plane and like, you know, you can add a proxy and everyone likes, you know, every infrastructure nerd on the planet is like, oh cool, like, let's add a proxy here. Like, let's have a talk to the control plane. And so there was this explosion early on when Linkerd first came out of people doing their own service.

Like every company would build their own because the infra team was like, oh yeah, this looks really cool. So I think part of it is that. And then part of it, I don't know. I think it just, that area got.

That field got really hyped up for reasons that I don't think have really held water in the long run. But for a while everyone was like, oh my gosh, this is the.

There were at least three or four Kubecons in a row where it was like, this is a year of the service mesh. That's all everyone was talking about.

And you know, the cloud native community, they love new things and they love getting excited about technology and like, you know, sometimes making decisions because of the love of the technology more than like, because they're trying to solve an actual problem. So I think that all got kind of bundled up into, into this. Yeah, there might be some other factors in there too.

But you know, from my perspective, I was just like, "Wow, you know, yeah, sure, like take a swing at it. It's really hard."

Cory:

It is. I feel like it is definitely one of the harder... I think it's one of the harder problems to solve. And it's one of those funny things. It's so funny. It's like there were so many tools in the space. It's a hard decision. Well, I'm saying it's a hard decision.

It can be a confusing decision to make.

Like when you first start getting into the space trying to figure out like which one is the right tool, like which one should I even like poc that are out there? Because there was so many of them.

William:

Well, you know, a lot of them had kind of fallen to the wayside. So I'm, you know...

We hit our 10-year anniversary this year, so we've been like powering, you know, production Systems for ten years. And like, you know, my goal is to make it to a hundred years.

Cory:

Heck yeah, I have... I should have worn it today... I'm very bad about synchronizing my outfits with podcast episodes, but I have a very old Linkerd shirt somewhere in my closet still from... I cannot remember what conference I got it at, but it's one of my few that have survived.

There's an old NGINX one, a Linkerd and then I think a Travis CI one that like they're barely clothing anymore. They're just like, they're like a whisper of logo. Yeah.

William:

Yeah. Well, find it. You know, I'm sure we have some more sitting around in a box somewhere, so if you can't find it, we'll send one over.

Cory:

So as far as platform engineering goes and service meshes, right.

So it's like this is a pretty critical component to, I'd say any team that's sufficiently started to adopt platform engineering and self service, right? Like the amount of services that we spin up in the cloud is pretty frequent, right?

Like, self service isn't just about getting some infrastructure that I need, right. It's very frequently how do I get my app up?

Like I got a new app up and that app is almost always immediately interacting with some number of services. Like in our constellation of apps that we have.

In your opinion, what does the handoff look like or what does the service mesh interface look like to the developer who's a consumer of the data platform or not data platform of your team's internal self service platform? Am I seeing the service service mesh?

Is it something that I'm aware of or like in your mind, is it something that is just completely kind of absorbed by the platform which I'm deploying on?

William:

Yeah, I have a pretty strong opinion on this one, which is I don't think the developers should see the service mesh at all. I think that's. That is like a part of the platform that doesn't really need to be exposed to them.

And it doesn't even speak a language that developers, I think understand. Right. Like the way you, you configure linkerd. We try and be as Kubernetes.

One of our design principles is like to be as Kubernetes-centric as so look like Kubernetes, smell like Kubernetes, feel like Kubernetes. So you know, you like, how do you configure it?

Like you write a bunch of CRDs and the CRDs these days are often like Gateway API CRDs, which means it's a mapping of like six CRDs that all refer to each other in like some complicated scheme that's a platform owner concern. And like, you don't want the developers to spend their precious little brain cells on that. Like, you want them off building business logic.

So I think they should see the output of the service mesh.

So for example, one of the things that linkerd does, because we're sitting in between, you know, every call that's happening between services in your system, we emit metrics.

And so, you know, we can give you this totally uniform view of what are the, you know, HTTP success rates and what is the latency distribution and how much traffic is going and actually who is talking to whom in here. Like, is A calling B yes or no? That was a big problem for us at Twitter. Like, we literally had no idea.

Well, we can provide all that information to you and so the developer should see that, right?

I think a big part of the self service model is, yes, like I, you know, push a button and deploy my app, but also, you know, you as a dev are kind of on the hook for if that, you know, what is the app doing? Like, you know, like, is it, is it behaving, you know, is it running out of memory? Is it talking to things?

I think it should be talking to what is that spike in traffic coming from? Like, those are all things the service mesh can help you answer.

And also when you get into policy and things like that, like, actually I don't want A to be able to talk to B because A is in the non pci. You know, it's a non PCI service and this one is pci.

Like you can enforce all that with linkerd, but again, that configuration I wouldn't expose to the devs. I think the devs need to operate at a level that's kind of independent of like the mechanics of how you're, you know, configuring stuff.

Not everyone agrees with that, but that's my opinion. Save their brain cells for, for something more important.

Cory:

Yeah, it seems like, I mean, it's obviously tied so much into what I'm doing as a developer. Right. I want those retries. I want to not think about mutual tls, right.

I want to not think about having a God token for talking from service A to service B all the time. Right. It's just like it feels so tied to what we're doing, but at the same time it does feel very commodity. Right?

And it's like for, for me to have to learn that to use it, and that's that I've seen it surface that way in a handful of platforms that I've been on where it's just like, oh, I have to, I have to reckon with this thing instead of it just kind of like soaking into, into the background for me. I definitely agree. I think it's something I like to see hidden a little bit more. And I think also it's.

It's one of those things that, you know, as you're saying with like the.

Being able to just like profile the traffic and the metrics, like that is a part that I do love the ability to see because that's one of those things that I feel like is so hard, especially when you're breaking down an app, is to figure out what is even talking to what you'll see something like, maybe you have a client in your code and it's like, oh, this is a client for some service. And it's like, are we calling that anymore?

I know the code is there, I can see the code is there, but I don't think anyone's used that in a year or whatever. And then all of a sudden you'll look at traffic in between. You're like, oh shit, that thing gets called all the time. Why?

And it's one of these things it's hard to see unless you go through and literally instrument every single I/O call that you're making.

But to be able to have that just natively appear there and be able to see these services are communicated or even seeing things like, geez, just seeing that cron job or that background job that got scheduled 50,000 times for some reason. And it's just like, yeah, it's, it works, it's doing the work.

But then all of a sudden it's like, huh, that's 50,000 HTTP calls, like while that thing is running.

It's like, it's one of those places where it can be difficult to see that level of metrics and information many times at the application layer, especially if it's not your application that's receiving the calls. I think that's one of the places that I really enjoy it.

William:

Yeah.

And that's kind of our value prop to the platform owner, you know, which is kind of the target audience for Linkerd anyways, is we're going to do this across your entire cluster, across your entire platform. We're going to do it in a way that's uniform.

It doesn't matter what the devs are using for their language or framework or whatever, if they're speaking HTTP or grpc, then we'll give you this uniform layer, not just of visibility, but also of control. Now there is to be fair to the platform owners who have exposed the service mesh config to the devs or whatever, there is an intersection there.

And this is something that I don't think the industry has really like has a great answer for. And I think it's an interesting problem. There is an overlap in control there.

So for example, who should be responsible for the retry or let's say the rate limit of a particular method on a service? Does a platform team own that? Does a developer own that? Does a developer own it but the platform team gets to put some guardrails in place? And if so, like whatever the answer is there, how does a developer get that config into the hands of the platform team? That's not a solved problem.

I've seen people address that in a variety of ways. You know, you can like have annotations on like, what's the open API spec used to be swagger.

So like you could say, "Hey, Devs, you need to like use a spec and the spec you need to annotate it with. And from there we'll like build the Linkerd configs." That's probably the most sophisticated version that I've seen.

But that is an unsolved problem for a lot of people and I have some ideas, but I don't have any great code for solving that yet.

Host read ad:

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas. The result?

It's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making Infrastructure as Code simpler with Massdriver. Learn more at Massdriver.cloud.

Cory:

So Linkerd, it's sitting between all my services on Kubernetes across aws, my data center. That's great for my microservices. Fantastic.

But in this world of people starting to run more like inference heavy workloads on Kubernetes, how does it fit into the picture there? Does Linkerd apply to those type of workloads or is it very much kind of tied around our transactional workloads and microservices?

William:

I mean the surface level answer is like, yeah, it applies. They're all speaking HTTP anyway. So yes, to a first approximation, any inference workload is.

It's just a workload that makes these extremely long multi second or more HTTP calls. I think there's a... you know, you can get a little deeper than that and it depends what that workload is.

So for example, there's kind of two areas we've explored. One is, are these actually like agentic workloads that are making MCP calls? If so, yes. MCP happens over HTTP.

But there's a lot of details that you want to dig into there.

If you really want to do observability on, on agentic workloads making MCP calls, or I should say, if you really want to do observability on MCP traffic, you kind of have to speak that protocol. So we gave a little prototype demo of that Kubecon last November, I think in Atlanta.

The other thing we've been diving into a lot is the actual inference serving itself. And this is something that I think is becoming increasingly interesting, especially as people's anthropic and OpenAI bills start skyrocketing.

And the fact that a lot of the open weight models are extremely good, extremely good, you know, even compared to the frontier models. You can run those things locally and you can use VLM or SGLANG or TRT LLM. And a lot of that is happening on Kubernetes.

And now you have these workloads that actually are a little weird from the Kubernetes perspective because they're very large, they consume a whole GPU or if you have multiple GPUs on the machine, they like consume all of them. So they have to kind of be, you know, pinned to that machine. They're very, very slow.

They take a long time to start up, you know, and like takes 17 minutes to load the models into the memory and compile your CUDA kernel and whatever the hell else they're doing. But you know, once they're running, they are these extremely high concurrency, extremely powerful workloads and then what do you do?

You know, like, how do you manage that? Right. And that's a problem. Not just for Linkerd.

I think that's a bit of a challenge for, for Kubernetes itself because they're so outsized and they're so extreme in the spectrum of workloads. And Kubernetes was designed much like Linkerd. It was designed for these really fast microservices where every millisecond is important.

We spent all this time tuning Linkerd to make sure we wrote the data plane in rust and we're measuring things at the microsecond level. And now when we're putting it in front of this workload, that's like, well, it's going to take me 17 and a half seconds to respond anyways. Okay.

But yeah, it's super, super interesting, super interesting area.

And I have a belief, which I'm kind of trying to implement the corresponding code here, that local inference or people running inference on their Kubernetes clusters, whether it's in the cloud or on a VM, I think that's going to be a huge part of what the future looks like for AI.

Cory:

Yeah. Especially with all the, you know, the kind of token spend articles coming out here and there.

I'm actually, I'm very curious what the, what the move will look like after this. Like, seeing recently Uber came out and said they're spending too much on tokens. Microsoft came out, so they're spending too much on tokens.

And so like, I can see it. Like, I can definitely see where people are blowing a ton of money on tokens.

But at the same time, like, I don't know, I don't love like the Pandora's box genie out of the bottle trope, but, like, it does feel hard to go back from, from a software development point of view, right? Like, I mean, like, I'm not one of these people that's like, "Oh, fucking stick some AI in my washing machine. I gotta know what this, you know, these pants taste like or whatever." Like, I don't know why my washing machine needs AI, but apparently they're sticking it in there now.

William:

You want to know what the pants taste like?

Cory:

I don't know. What is the AI telling me you're washing my pants, bro?

William:

Because there's a cheaper way of solving that problem.

Cory:

Yeah, I know, but I mean, the people that put AI in these things, they're not thinking that far ahead. But as far as software development, I've leaned into AI based development and it does work, it does work well.

I mean outputs are something that have to be reasoned with and you have to build software around it. But that's always been our job in DevOps is to make unstable and untrustworthy things stable and trustworthy. Right.

And it's like going back from that now, right? Like, oh, the tokens are too expensive. I guess I'll go back to writing it myself.

Doesn't seem plausible for the people that have successfully used LLMs to write code. And so I'm very curious what corporations will look like as far as like running inference in house here in the next year or two.

Because it's going to be, I feel like for the engineers that have seen the boon of LLM based development, it's going to be very hard to be like, just go back to, you know, your VS code and good luck. Tab complete.

William:

Dude, I have a much stronger state version of your statement which is, I think it is, it is astonishing when you get the hang of Claude.

If you've used Claude, especially in the past couple months and you have tried to do something serious with it, it is astonishing at how powerful it is now. It's not perfect. You have to learn the tools. You have to know like what it's good for and what's not good for.

You have to know when to insert yourself and that's a process that you like any tool you have to get through. But when you have that thing humming, it's astonishing. I cannot imagine going back.

So I think you're exactly right that we're going to be in this really interesting situation where you know, the devs at least are going to be, you know, and everyone who like takes the output are going to be, you know, they've seen nirvana, right? They've seen just how powerful things are. Can you go back to handwriting code? Boy, I don't know.

You know, that seems like a big regression but at the same time, yeah, if costs are skyrocketing then what are you going to do?

And especially, you know, I think the other thing, the other hammer that's waiting to drop is these companies are going to go public and when they go public now, they're going to have this like S1 filing and you know, they're going to be under the gun. So all the subsidies that are happening right now with Anthropic and OpenAI were like, oh yeah, pay us 200 bucks a month and we'll give you all this.

Like that's going to skyrocket for other reasons and now you're going to be in a really tough situation.

So I think that is all kind of pointing, pointing to, I was pointing to a couple of things, but I think one of the things it's pointing to is a lot more reliance on, on local inference because the big advantage there is you're renting a machine from AWS. Let's say you're in the cloud, right? On AWS, you're renting a machine from AWS, a machine with GPUs. You're not paying per token, right?

You're like, you're using it.

This is an economic model that like you already have a whole company built around like, okay, it's going to cost us this much in AWS spend and you know, maybe we can negotiate with them for a discount and oh, if AWS is too expensive then we'll use another provider over there.

Like companies already know how to manage that and so you can transform all of your token costs into like just good old fashioned cloud compute costs, you know, or, you know, or, and get right back into this very well understood calculus. So it's going to be interesting.

Cory:

Yeah, I think it's going to be a pretty exciting time for I think the open weight models. I mean I was playing with Gemma recently. Gemma's pretty good. Gemma's pretty good.

William:

Great.

Cory:

But I think it's going to be a really interesting time for Kubernetes.

I mean I think, I think we're probably going to end up in, I think we're going to have a Google Doc with about a thousand different ways of running inference on kubernetes. But like, but I just, I can't see it. I can't see it either.

Like for the people that have been successful with it to think about going back, it is a thing to behold.

Like, you know, I think we've done almost 100% of our development over the past year has been, you know, AI assisted and it's, you know, our problem domain. Like we think about different problems now. It's, it's exciting. Like it's, it's, it feels, it feels honestly much more rewarding.

But the weird part, that I'd be curious what your take on this is, is I've felt personally that like while it seems more rewarding, it's like, "Oh, I got that feature out that this customer's been waiting on for a month and I got it out in a day." And it's just like that would have been months of work for me to sit down and like knock out. The one thing that does feel a little weird is, while it's rewarding to see the output, the output feels less yours. Do you feel that? Like, it's just like you don't feel as connected to the code.

It's just like, oh, it does the thing that was kind of magical and now I'm off to the next magical thing.

William:

Well, I think I feel that way in a healthy way, which is, you realize, and this has always been the case, like the code is not the thing that has a lot of value. Sorry, it hasn't always been the case. It. This has increasingly been the case as the decades have progressed.

In the 70s and the 80s, the code had immense value. Open source came around. A bunch of other things have changed. And now you realize the actual lines of code are really not the important part.

The important part is how do you get this into the hands of people? How do you run it in a way that doesn't take 100 humans their whole lives to operate it? All of the other stuff.

And I think that, you know, that transition has been happening anyways and this has just accelerated it. Right. What's the, you know, gosh, what was the thing that Cloudflare did where they like re implemented? I think it was a whole library because they didn't like the terms, they didn't like the license and they're like, "Okay, well we're just going to reimplement it."

Cory:

I think it was like Next.js and they rebuilt it from like the test suite or something like that. And then people were like, "Tests are your moat now."

William:

No, no, really, like if you're making an open source project and you want to prevent, you know, and, and like you. And the license is like part of, you know, this is like a big part of my kind of buoyant's business strategy.

You know, the end the license is kind of like what, you know, what gives you some kind of affordances that you're building business around. Those tests actually are now become the valuable ip. I mean, gosh, really, really fascinating. But I think, honestly I think it's.

I think it's healthy to have that attitude towards code. Code is disposable. Yeah, right. The code is just a means to an end, so the value has to lie elsewhere.

Cory:

What do you think about open source in this new era? Like, that's one of the things that I feel like that the supply chain attacks have been absolutely bonkers, right?

And I followed Mitchell Hashimoto on what I call Twitter. Still, as long as the domain name resolves, that's what I'm calling it. I'm sorry, ditch the domain name and I'll call it something else.

But someone... Mitchell talked about this over the past few days about how HashiCorp's approach to libraries was fork and maintain, not always taking an update.

And it seems like something that we've. It's funny, to your point, the code was never the value, right? Open source kind of also shows this and doesn't show it at the same time, right?

It's like, hey, it's my OAuth library. I don't give a shit, I didn't write it. I'm just going to grab one, right? It's like I don't care.

I mean like I want the person to off, but like, do I care enough to write it? No. Right? I'm just going to take something, run it, right?

But now, you know, especially in AI assisted development, like for me to take a spec, something like the SCIM spec or something like that, and it's like, hey, generate this library and we just own it and if it works, it works, and that's great. And I'm not going to think about it anymore.

Like, I'm curious, like what happens to open source in the AI era when you get to the point where you can't trust the left pad level of libraries anymore, but they're scattered throughout your code base. Your dependency that you do care about is depending on some dependency you don't care about.

I'm really curious, what do you think happens with supply chain long term and just the proliferation of dependencies in our code base when, you know, we're seeing just the ability to sneak attacks in just all over the stack.

William:

Okay, so your worry is that it's now easier to insert attacks into code because you can have AI like generate that. You can generate those attacks and like sneak them in.

Cory:

Yeah. Well, I mean with the Shai-Hulud stuff like, like there's like, hey, we vibe code like half of our attacks and it's like, and we're seeing it like all over the stack, whether it's, you know, getting GitHub Actions and kind of like backdooring through GitHub Actions, or whether it's like these weird commits that they're able to like sneak into, you know, the color library. And it's like, I feel like...

William:

Isn't the opposite true too? Like, you know, you've got Mythos and, like, whatever. Like, now you've got these automated tools that can find these types of attacks, you know, like, both sides have. Both attack and defend have increased their value.

Do you think it's asymmetric? Actually, I don't really have a strong opinion here. It feels like it must be symmetric, but, you know,.

Cory:

I mean, there's a... I feel like it's asymmetric just from the number of assholes on the planet, right?

Like, to be able to sit around like, we've reached the point, right? Like, I think that. I think we're going to have a dark moment of AI in the very near future, right?

Like, I mean, to be able to sit down and con one of these things past its guardrails and have it do an attack, like, that's one thing, right? And to sit and think about, like, does this dependency really matter enough for me to trust a third party?

I think that's more of the question is, like, when do we decide that. That we have zero trust, but then we're just fucking grabbing dependencies off shelves all day long. Zero trust. Fuck it. This thing's got 55 stars.

William:

Yeah. And in fact, it's probably exacerbated by the fact that you've got Dependabot running all the time.

It's like, oh, there's no, you know, update over here. Like, pull it in, pull it in.

Cory:

I'm hesitant now. I'm just like, oh, shit. Like, is that. Is this a good thing or a bad thing? I don't know. Like, now I got to go. Look at this.

In the past, I was like, yeah, fucking green upgrade. Yeah, that's good to go.

But, like, I think, like, I don't know, I feel like it may be asymmetrical just from, like, the amount of shitheads out there. I mean, there's a bunch of shitheads, right? Like, and I think we're going to see it fall apart. And like, the. And this is.

This is software adjacent is like, I don't think we're long from people using Open Call to do horrible things to other people. Like, to just, like, single somebody out, you know, and just be like, you know what? This guy's an asshole.

I'm going to use my open claw to just attack his network as long as possible until I get in, right? Like, I don't know.

So it's like, it feels a bit asymmetrical because it's so there's just so many libraries and there's so many, like, I mean, I have libraries that are used in other libraries. I'm just like, "Oh my God, Like, I can't believe people are using my shit for this."

Like, and it just, it feels like it's getting to the point where like, you have to take in, you have to consider your dependencies now, which was just.

It felt like for a very long time, and this may be so naive to say, but it felt like for a long time you could just trust open source for the most part. And now it's starting to feel like a bit like I have to second guess, like a dependency I reach for.

William:

It feels like you've never really been able to trust them, but you've gotten away with it for a long time and now, you know, AI has made it really clear.

Cory:

Maybe that's the asymmetry.

William:

Right? Like, you should have been doing this the whole time, like this guy. You know, I think there's a couple areas.

I mean, gosh, this is such a broad topic. You know, we could, we could talk about this for like three hours.

I think there's a lot of areas where what AI has done, if you step back, it's bas. It's amplified the dynamics. So those dynamics were always there, but now like, oh, you can't ignore it anymore.

So, for example, one that comes up a lot with Linkerd is, is zero trust. Right. Like, one of the things that Linkerd does is it gives you the ability to basically put a little firewall in every pod.

And that pod makes its own decisions about like, hey, this client wants to talk to us and it's calling method A, you know, and its identity is foo. Right. And I have to decide whether I want that or not.

People have been talking about zero trust for a long time, but very few companies have actually gone to the trouble of like implementing that.

Because most of the time the code that's running in your environment, well, it's written by developers and it's been code reviewed and like, you know, it's, you know, kind of like you don't have to worry about that much trust on it. Right, Right. But now, now you're running agentic workloads and they're powered by an LLM and like, what's the LLM going to do?

Well, at some point it's going to find the call that deletes the database and it's going to call it. Like, there's no question, it's not it's not a maybe, it's a question of when. Right. At some point it is going to call it. So now you have to do this.

You have to take this zero trust approach. You cannot allow the LLM to do whatever it wants. Okay?

That dynamic was already there, but now we've amplified it, you know, and it seems like the same thing, you know, with like supply chain attacks and supply chain defenses and stuff like yeah, those dynamics were there. You never could really trust. I mean look at the XC attack. Oh my gosh. The fact that that thing was caught feels like a miracle.

Which suggests to me that there's probably 99 other XZ attacks just as nefarious that were never caught.

Because what, you know, like it just happened that this guy in Microsoft was like profiling something and he just happened to wonder why did take more, you know, 200 milliseconds more in this situation. What a happenstance that was. So there's got to be 99 of those things, you know, buried around little sleeper.

Sleeper agents, you know, agent sleeper agents, you know, so AI is just making it all, making it all obvious for us. Yeah, but I don't know, it's, you know, it's a landscape like it's easy to say, hey, you know, it's just amplify the successing problem.

But the consequence is there's a big shift in the landscape. And the way you have to think about all of these dynamics now is different. You have to be pretty first principled in your thinking. Good luck.

Cory:

That's the big difference between this and a loom, I suppose, is when it generates free labor. That's when it starts to. That's when it starts to get a little out of control. That's the asymmetry.

William:

Maybe way out of this is token costs. Like maybe once these costs go way, way up, then you know, we can go back to the good old days where it was hard to do all that.

Cory:

Yeah, yeah, you have to do all your attacks from your Raspberry PI cluster in your garage.

William:

This is the paper you can write. The Ethical Case for High Token Costs.

Cory:

The Ethical Case for High Token Costs. I like that. I like that.

William:

By Cory O'Daniel. Not by me.

Cory:

Oh my gosh. Yeah, no, I think that one's going to be. It is funny because it has been something that, you know, I know some teams, they're very particular about like their dependencies and whatnot.

And it's an interesting place because it's like at the same time like there's so much Libraries that we use today where it's just like, I only need like 5% of that library, right?

And just to like, just generate that node yourself versus, like bring in a whole thing that can start to feel a little intimidating to have a bunch of your own code that's not really like your core thing, but also sometimes just to be like, ah, I can trust that 5% of the code that I brought over. Like, like looks good, does what it says on the box.

William:

It also depends on what your code is doing. So, for example, you know, you mentioned nginx earlier. Nginx had a service mesh product.

At some point we actually hired a bunch of those folks at Buoyant. And you know, so I've learned a lot about nginx.

One of the things that I learned is that the core maintainers of nginx were very sensitive to any dependencies. They wanted to reduce the dependency service as much as possible because nginx would sit at the edge, right?

Every company that use it would put it at the edge and, and so it would have to handle untrusted traffic. So you know that any problem in any of those dependencies there would, it would turn into a vulnerability for nginx.

Now if you're writing, you know, so that's like one extreme.

If you're writing like, oh, this is something that like runs in my, in my infrastructure and it's behind, you know, all these other layers of protection, then like, maybe your risk profile is smaller and like you can take on some of those dependent, you know, it's not zero like those situations that the, like don't trust anyone's situations have existed in the past. Now it's just again saying it's been amplified.

Cory:

Yeah, yeah, for sure. Last question for you.

So I would love to know because I feel like, you know, just based on the conversation we have, like you're somebody who's leaning into AI. You're doing a little bit of the AI development.

I know many people today that have started leaning into it, like, don't feel like they're doing any less work, right. Like, it can feel exhausting. And I'd be curious, like at the end of your day, like, how different do you feel from a job?

And I know you're CEO now, but writing as an IC writing software all day versus assessing the output of LLMs all day is a very different mental task. And I'd be curious, how are your teams adjusting to that and how does Buoyant look at the value extraction from LLMs?

So it's like if I sit down and I'm the only person on my team using AI code assisted development, I'm not a boon to my team, I'm a hindrance. If I'm the only one doing it, I'm producing this 7,000 PRs a day and just creating work for everybody else.

Now if the whole company is using LLMs, and 100% of the day, we're all just doing LLM output. I've done 60 years of work in a week. The company gets that full benefit.

How are you guys thinking about the value abstraction and how engineers use LLMs without like necessarily giving 100% of the value to the individual engineer, 100% of the value to the company? How are you thinking about that? And like, how is it changing like the way that you work?

Is it changing like how you think about like work hours and days or. I'm just really curious, like, what, where you guys are at with that?

William:

Yeah, we just vibe code everything, you know, like if you're running Linkerd, it's basically like, who knows what's going on with that thing. Like it's probably a bunch of JavaScript that the LLM generated and you know, we just ship it, you know, fuck it.

Cory:

Yeah. You're not, you don't feel any. Okay. I don't know. I feel exhausted right now.

I'm like, I'm thinking, I'm already thinking about PRs that I have to review and I'm already, I'm exhausted just from seeing the stack of them appear.

William:

Yeah, there's your problem. You don't review the PRs.

So for us, you know, it starts with, and I kind of alluded to this with my little story about nginx, different parts of our code base are dramatically different in terms of the risk profile, in terms of the necessity to get things right and the spectrum spans. I would say on one extreme side is the core of what Linkerd does. That thing has to work, has to be performant, has to be predictable.

When people insert Linkerd into their platform, we are a pacemaker, we're not an Apple Watch that you put it on like, okay, you don't like it, then you take it off. We go into your body and like you need surgery to put us in and you need surgery to take us out and we have a big dependency.

So that part of it, our usage to date has been on things like analysis, find bugs like things like that.

Maybe limited generation, but everything gets extremely peer reviewed and like a lot of hands on effort like, no line of code gets into the core of Linkerd without a bunch of extremely smart systems engineers who are familiar with all the intricacies there, you know, are okay with it and probably have like tweaked it a bit on that. So that's one extreme. On the other extreme, which is where I operate, you know, is on like the product process.

And there it's been really transformative because now instead of writing a Google Doc and like we have these like, you know, long conversations and like, you know, flesh out the whatever. And that takes place over several weeks. I just ship a prototype. I'm like, hey, here's a prototype, it works.

Here's the five design choices that I had to make in practice. Now let's talk about those things. And like, if we have a question about this one, well, let me refactor it so that it does this other thing.

And like we answer product questions through prototypes and that, you know, that code is basically throwaway and then there's a middle area and that's where there's a gray area. Okay. We run a SaaS service, it's called Buoyant Cloud. You know, it has to be up, right? It has to like serve our customer.

It provides this like kind of management layer and visibility layer for your Linkerd installation which is running in your cluster. But if that thing has a failing component, the worst case scenario there is like, oh, the app is down for some number of minutes. Well, that sucks.

We have SLAs we have to meet. But that's a way better thing than like, oh, a customer's application got taken down because we made a mistake.

So like our tolerance for LLM generated code is higher there. And so if you can map out, I mean, this would be my advice in general.

If you can map out kind of the risk, risk levels for the different parts of code and recognize that they're different, then I think that from there you can naturally draw the type of usage and the way that you interact. That makes sense. I know there are some very AI forward companies where what they do is.

And we're not quite at this point, and I'll leave it as an exercise to the reader about whether this is a good idea or not. But Claude has a plan mode and they do peer review, code review, but they do plan review.

They don't look at the actual code, they review the plan, they go back and forth on the plan and once everyone's okay, then Claw generates a code. I'm sure there's other protections in place but that's how they've evolved.

Like I said, we're not quite at that point and I would say that probably would make more sense for our SaaS product where the risk service area is lower than for the core of linkerd.

But those are the sorts of things that I think also evolve and the benefits and kind of the practicalities evolve as the models themselves get better. So yeah, that's our approach today. I think we're kind of, we do some unique stuff.

The fact that Linkerd is such a critical piece of infrastructure and the requirements on it are so tight, it means that we have to treat it in a very specific way. And I think the majority of the world that are building SaaS apps probably are operating like, you know, in a different way.

But yeah, I'd say the most transformational part of it to date has been in the product process and in the move to prototyping versus writing PRDs product.

Cory:

And prototyping has been amazing.

cognitive load of getting the:

Like just like the stuff that like nobody's ever had time for that like hinders everybody. But like you also like, eh, it's like oh, this we know that I've got, we got a flaky test that like 10% of runs failed.

But it's like, ah, it was a hairy thing to figure out. Nobody wanted to spend time on it. Couldn't get a, you know, a PM to give us those T shirt sizes to go in there and do that work.

It's just like, go fix this thing so it never bugs an engineer again. And like that's one of those places where it's like you can, it doesn't matter like what code comes out.

You could, you could throw that PR up three, three or four times until it's right. But as soon as it is, it's like, dude, if that just, if, if 10% of our builds were fail, I wouldn't. If it was 10%, I'd shoot myself, honestly.

But if it was 10% of your builds were failing. No, I know that team's out there. Somebody's listening right now is like 10% I'd kill for 10% of my builds failing.

But like that's a place that I find that it's very Useful to use where it's not necessarily like mentally taxing to like deal with the output.

William:

Yeah, yeah. And it's not going to the core of your product.

The other usage I found similar to that is like generating terraform and like make files for terraform stuff. Like, you know, because I, I spin up a lot of AWS clusters these days and like I don't want to use any of my brain cells on Terraform. Like zero.

Cory:

Not as a, as a person who. I wouldn't say I sell Terraform. I definitely cannot legally say that. Get yourself an open, get yourself an open tofu, folks.

But I haven't written a single line of it. I mean like, it's funny, we've gotten to the point for our proof of concepts with customers.

We take our conversation and we run the conversation through one of our little AI agents that we built internally. And so like their proof of concept is running invisible like day one. And it literally comes from a conversation.

And it's just like, that's one of those places. It's like in the IAC space. The painful part of IAC has always been using it, not writing it, writing it so easy. It's like that problem is solved.

Like we need to spend the effort in making like getting, you know, getting self service and platform engineering to.

Into fruition and working like, and fiddling around with a Terraform resource at like the variable level, like that is a waste of ops time at this point in time. I'm, I'm convinced at least. But yeah, I make it some, I may get some tomatoes thrown at me for that one, but. But it's fine. I like tomatoes.

William:

Yeah, I agree. All right, well, I, I have a request for your listeners if that's all right with you.

Which is like if anyone out there, you know, assuming you have any listeners, assuming this isn't just being thrown into the void. If anyone out there is running inference workloads or wants to run inference workloads on kubernetes, I would love to talk to you.

You know, I've talked to some people who are doing really interesting things, but I would love to talk more. Just shoot me an email william@buoyant.io. Make sure you spell buoyant right. That's your first test.

I won't spell it for you, but if your email bounces then like you've opted out.

Cory:

You just gotta buy. But you just gotta buy like all three spellings, two spellings. How many different ways can you misspell it?

William:

Yeah, well we did, we did buy a bunch of the misspellings.

Cory:

Just redirect them all.

William:

Awesome.

Cory:

Where else can people find you online?

William:

You know, I'm on BlueSky. You can find me William Morgan... not something me or something... but I really don't... I haven't been doing a lot of BlueSky.

LinkedIn, yeah, you know, follow me on LinkedIn.

Cory:

There you go.

William:

Oh, gosh. What have I come to? That's my request.

Cory:

I like it. It's my favorite. It's my favorite social social network. LinkedIn's the best. I love it. It's one of my favorite places to go and... Yeah, and see ads.

William:

It's weirdly. Weirdly a big part of my life.

Cory:

Yeah, me too. Awesome. Well, thanks so much for coming on the show. I really appreciate it.

Show artwork for Platform Engineering Podcast

About the Podcast

Platform Engineering Podcast
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver.

Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.”

Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how platform engineering actually works inside real companies.

Whether you're deep into Terraform/OpenTofu modules, building golden paths, or just trying to keep your platform from becoming a dumpster fire — you’ll probably find something useful here.