Episode 44

full
Published on:

4th Mar 2026

Why Extend Went All-In on Serverless Platform Engineering

Billions of requests a month on AWS Lambda can cost less than a single engineer’s laptop budget, but only if the architecture and developer workflow are designed for it.

Justin Masse, Senior Platform DevOps Engineer at Extend, shares how Extend committed early to a serverless-first approach and built a platform that prioritizes developer speed and low operational toil. The conversation breaks down what it takes to run active-active, multi-region systems in a serverless world, how the team keeps services small and fast, and why asynchronous, event-driven design changes both reliability and cost.

You’ll also hear how Extend treats developer experience as a core platform responsibility: templated microservices, fast deployment pipelines, ephemeral environments for pull requests, and infrastructure that developers can own without becoming cloud specialists. A big theme is using AWS CDK and internal abstractions to keep infrastructure close to the application code, so teams can move quickly while keeping platform standards consistent.

Finally, the discussion gets practical about tradeoffs that show up after the “serverless is easy” pitch: local development challenges, the real cost center (observability), and where AI is helping today, including an internal agent that diagnoses failed deployments and suggests fixes.

What you’ll learn

  1. Why Extend avoids servers and VPC complexity, and what they use instead
  2. Patterns for active-active, multi-region thinking in a serverless architecture
  3. How DevEx practices like templates and ephemeral environments reduce friction
  4. A pragmatic approach to IaC with CDK and reusable internal constructs
  5. Where serverless costs stay low, and why observability often dominates the bill
  6. How AI is being applied to platform workflows without skipping engineering judgment

Guest: Jusin Masse, Senior Platform DevOps Engineer at Extend

Justin Masse is a self-proclaimed lead chaos engineer, recognized within niche engineering communities for his expertise Chaos Engineering and Infrastructure & DevOps.

The father of three young kids, a husband, a recent MBA graduate, recent cancer survivor, and competitive powerlifter, he still finds time to actively contribute to the platform engineering community.

Justin Masse, website

Justin Masse, GitHub

Extend, website

Links to interesting things from this episode:

  1. Episode with Adrian Cockroft
  2. “From $erverless to Elixir” by Cory O’Daniel
Transcript
Cory:

Welcome back to the Platform Engineering Podcast. I'm your host, Cory O'Daniel. Today I'm joined by Justin Masse, Senior Platform DevOps Engineer at Extend.

If you're active on LinkedIn in the platform engineering world, you've probably seen Justin in the comments. He's usually in the middle of thoughtful debates about cost, AI hype, architecture trade off and what holds up in production.

He works on some pretty cool active-active multi-region stuff, runs a heavy serverless stack, so we're definitely going to get into that today. But he also enjoys like the big messy problems that platform teams tend to avoid.

Justin, I've seen you all over the place, like I love having conversations with you online. So I'm super stoked to have you here today. Why don't you just give a little intro about yourself what you do at Extend and let's go from there.

Justin:

Yeah, yeah, I'm glad to be here. Go ahead and take the banter off LinkedIn into the podcast.

Cory:

Hell yeah.

Justin:

Yeah. So like you mentioned, Senior DevOps Platform Engineer, kind of a mouthful. Means something a little different to every company, title's a little different.

Cory:

They certainly do.

Justin:

Yeah. Yeah. So I've been at Extend just about five years now, so kind of gotten to grow our platform because we're only about seven years old as a company.

So yeah, a lot. Everything we do serverless, we don't have servers, we avoid VPCs. Anything that's going to add just like any extra technical burden and slow down our teams is the kind of stuff we try to avoid as much as possible.

So yeah, everything we're running is serverless from our databases, from DymoDB through all of our compute layer, which is mostly... almost all lambda backed. So we have a good 70 or 80 different services that we're running at any given time, all just completely serverless backed.

So it allows us to just kind of nix a lot of the ops part and do a lot more of the dev. So that's where we, we kind of lean more heavy on our platform teams into developing versus maintaining. We're not worried about middle of the night, kubernetes clusters, anything like that. We can just run hands off and build, build, build, which kind of gives us this huge competitive advantage and lets us build at scale and speed.

Always be kind of out ahead.

Cory:

I had Adrian Cockcroft on, I think like maybe a year or maybe two years ago now, and he just talked about like the origins of serverless at AWS and like how Netflix was involved and like hearing... and I've worked in Serverless as well over the years, but not to the extent that you all, like 100% all in.

Reflecting on that conversation with Adrian, it was just talking about like the speed and agility that Netflix engineers had at the time, when like this was all just kind of coming into fruition, like the idea of serverless. But there's so many great fits for it in the modern business, but I still feel like it's one of those things. It's like despite the fact that you see it everywhere in the cloud, I feel like so many companies did not get to that ideal place that Adrian kind of pitches. And it seems like Extend has really gotten there.

And when you talk about stuff like the active-active stuff you're doing and the multi-region stuff you're doing, that's one of those things that, as an operator, it's like that is so hard to do in a server full world. Right. And it's much, I feel like it's much easier to do in the serverless world.

And so this is one of those things. It's like, you know, it's that pipe dream that so many companies want to chase. They want that great disaster recovery story - "I'm in multi-regions, I can just cut over." But it's just so hard to get there and it's so hard to just adopt serverless in like a sane, non-chaotic way. And so I would love to just learn about how your team just got to the point where you're like, serverless is the way for us.

We're going to eschew the servers, eschew the networking. What decision took you there?

And then how did you start going about designing the architectures and processes to really get your engineers to embrace it and own it?

Justin:

Yeah, luckily it was pretty much decided that way.

Even just before I started at Extend, it was just, "This is what we're doing." The people who founded the company, the technical leaders and the first couple of engineering hires - who are very brilliant engineers - all decided, this is the way we're going to go. This is the way we can start our company, this scrappy startup, and kind of explode out of the gate and get things built, get things deployed, get live quickly. We're not going to have to hire DBAs, we're not going to have to hire network administrators and set up all this infrastructure. Let's just take our code... here's our actual business code and let's just send it out as fast as we can and start getting customers. And then from there it's just kind of... we realized really it can scale.

The past couple years we're like, "Is this something that's gonna really scale? Are we gonna have to kind of back off and like do some containerization?" And we realized we really don't need to. We're able to kind of work around any limits we approach.

So that really, just from the get go, that was just the decision. We're going to go this route and we're going to avoid any of these other costs or complexities and anything like that. Needing extra engineers to have their hands in here. So we could really just develop and build. And really, any engineers we've hired, the buy-in has been pretty good.

I mean that was actually one of the things that drew me to Extend was this idea of serverless everywhere because I'd worked at a previous company where we're running our workloads on your typical load balancers, EC2s, stuff like this. And we're just like, "Why are we doing this? This is such... we're paying for all this compute that we're not using."

Cory:

Yeah.

Justin:

We talked about this idea of this like, "Why aren't we just doing lambda backed stuff here?" And ran into Extend when I was looking for a new job and I was just like, "That's what I want to do."

Cory:

Just aligned.

Justin:

That's what they were doing. So it just lined up perfect. Yeah.

I mean the buy-in has been pretty extreme. We've not really had much pushback from engineers. There are occasional workloads that come up where like, "Hey maybe we can explore a different option." We do have a couple workloads that we run in... if you're familiar with App Runner in AWS... so it's like a managed containerization where it's still completely hands-off, VPCs, everything. So we have a couple other workloads where we are still running where it's like hands-off. We don't need net admins, anything involved still.

It's been a dream. We have very little maintenance, very little Ops work. Really like the toil is extremely low.

Cory:

Yeah. Where I see a lot of companies like I guess struggle with serverless is I think... I think three main areas.

I'd love to like dig in on these, if you're cool with it.

Justin:

Yeah.

Cory:

So the first one I see is the local development story and there's like, you know, there's like the lambda simulators, the DynamoDB simulators. But like, once you get outside of like the first major class of serverless, that's where it starts to get hard to do local development.

The second place I see it is... it's funny, like serverless, I feel like when you look at code bases, like IaC code bases, you have a lot more IaC in a serverless world than in a serverless world. And I feel like that's one of the things that kind of can be rough is the like IaC can typically feel like outside of the developer's domain. And I feel like that's one of the places we see a lot of people kind of like start to choke up in adoption is like, there's a lot of this IaC to manage the thing or SAM or something like that. And then that always kind of falls back on the Ops team, right? And now you got this blocker of like, well, we're serverless. It's supposed to be so easy, but the Ops team does the IaC.

And then I think the third one where I see people trip up on serverless is they're like, "It's super duper cheap." And then they make some bad decisions and they're like, "Holy moly, this thing is not super duper cheap." And there's CapEx and OpEx, you know, logic that goes into what is cheap there.

But like, starting with just like the developer story, like, how does Extend think about developers, like, building a new service, like, whether it's an app runner or a Lambda or a step function. What does my development story look like?

How can you build a good from zero lines of code to production on serverless and make it fluid for the developer?

Justin:

Yeah, so a lot of that work too has been done by... We kind of divide our platform team into a couple teams. We have a DevX team, which is just one or two engineers, but their primary focus is on how can we improve the efficiency for our developers to do that, to go from idea to service to production. A lot of that actually starts with templating in GitHub, where we have these microservice templates where we just call it like our backend template for a microservice. You take that, you're able to just start a new service based on that template. In GitHub, you have all the code there. You have your base Infrastructure as Code, which I'll get into in a minute, but you have your base infrastructure, you have everything that you need. So essentially all you need to do is plug in your business code into Lambdas, add a little bit of infrastructure around your API endpoints and then stand up your pipeline and it's off. You're going through the pipeline to production. Realistically, we could spin up a new... if you had all the business code ready... you can spin up a new service and be in production in a single day for most of our services. Which is... it's amazing.

So a lot of it started there. It's just like, "Let's automate, let's get a template set up where we can just take it and run with it." And then all the deployment stuff is handled through a lot of like GitHub actions, type things, things that are repeatable, reusable. So developers really don't even have to think about that. It's just plug your service code in as much as possible. So that's, that's where most of it, most of our developers get their efficiency.

Actually, like you mentioned, one of the kind of limitations is that local testing. That's really been still our... I would say our biggest pain point is that like, "How do I quickly shift everything left? How can I get everything tested on my local without having to spend a bunch of effort?" A lot of it, again, we're just doing unit testing, just typical development stuff. We do have a very fast way to deploy to...

We have this idea of ephemeral environments.

Cory:

Hell yeah.

Justin:

You open a PR, you get put into a quick queue, it sets up an environment for you and blasts your service in there with all the other services like main... essentially their mainline branches in there and runs your different test suites against your deployed service. Where you're touching integration points. Like if you're buying a contract and you need to wait for some event to fire off when a contract gets purchased and digest it, things like that. You're able to actually test all those workflows in this ephemeral environment so you don't touch it, you don't do anything to it. Just open a PR, goes there, runs all your tests against it. It's good. You can merge your PR and go to our development and up the chain to production.

We do have one off environments that developers use for their services as well, like sandbox environments that we're able to kind of spin up and use. And that's where a lot of... outside of your unit tests, your tests are passing, you're good... you're like, "Okay, let me see if it actually runs on Lambda." Which really, we've never really run into anything were it doesn't.

Cory:

Yeah.

Justin:

But again, it's just like, let me make sure it actually deploys. It runs. There's no weird gotchas because I'm missing a layer or something weird. So they're able to just kick their service deploy off. We're able to deploy most services in like a minute and a half, two minutes. So it's pretty quick.

Cory:

You know what I love is, as a person that runs a company that like sells a Dev tool, like DevX is such an important role in developer tooling. I love that you said, like, you have DevX people on your team that is focusing on your actual developer's experience, which is, I think, rad.

Like the developer experience... I feel like in so many, especially like a budding platform team, like when you're trying to like, you know, kind of work through like the primordial ooze of like, "What is Ops and SRE? And what is our platform going to be like?" The DevX is something everybody on the engineering team is talking about. It's in Slack somewhere, right? It's like, "Oh, this sucks, that sucks, this sucks, that sucks." And like, people are just trying to keep track of what sucks and it's usually going to the bottom of the stuff we fix list. Right.

It's like, there's stuff that the customers say suck and then there's stuff that the Devs say suck and like that sits to the bottom. And it's up to all of us to kind of make it better or, you know, the Ops person that's starting to move towards platform engineering to say like, "Hey, how can we get a really good MVP that makes the developer experience better?" But it's always still kind of a thing you do as a part of your job. And it's like when it's a shared responsibility, it's no one's responsibility. Like, you know, like, who's accountable for it and like, I love the fact that you guys are doing that.

I venture to say that that's probably where a lot of this efficiency comes from in your team, is that you have people dedicated to making our Devs more efficient, which is the name of the game at the end of the day, right?

Justin:

Yeah. Besides the scaling for customers, that's more of like our entire goal as a platform team is how do we make everyone else faster because they're the ones that are like developing the business logic and code that's going to actually make us money. And we're just like, how can we get it to customers faster? How can we make it scale? And that's really it.

We're just always looking for those little improvements. And most of what my direct team does with DevOps is focused around that area, as well as scaling. But a lot of it is just like developer efficiency, which kind of led into that next point around IaC.

So for us that's a really, really interesting thing we do. I'm sure a lot of companies kind of do it. Everything we do is built on AWS CDK. So we've looked at Terraform, we looked at all these other things, nothing just... we wanted something our developers can use with no knowledge of it. So everything we do is written in TypeScript. We wanted one language. We don't want developers to have to come in and learn some kind of new language, new templating language, anything like that. They don't need to know cloudformation, they don't need anything. They just need to know how to read TypeScript.

Cory:

Yeah.

Justin:

Which is awesome. So that's why we chose CDK. Almost everything we run, at least developer facing, is in CDK.

So what we did is we took CDK and we more or less built an abstraction layer around it that is kind of an in-house... just NPM package essentially that we publish versions, just typical semver versions that we push out to our registries that we let developers then install. And it's just full of abstractions to really simplify anything services need to do.

There's always one off things where it's like, "Hey, my service needs this." And if they can't figure it out in CDK themselves, either we'll build an abstraction if it sounds like it's going to be reusable or we'll help them work it into their service. But for the most part most services are doing the same thing out of the gate.

So yeah, we have this giant just abstraction layer essentially around CDK where we are building. So the way CDK works is they publish a bunch of L1s which are just essentially base cloudformation wrappers, very basic cloudformation wrappers. And they publish some L2s, which are slightly more opinionated wrappers around their Cloudformation L1s.

So we take that a step further and we take these L2s... so you might have something like API Gateway and then API gateway methods... and we abstract that into what are known as L3s. And we abstract that so teams can stand up API gateways with a bunch of methods and attach their Lambdas to it really simply.

Yeah, we publish those out there and it kind of really reduces the burden on developers. It's really not much burden because when you're standing up a service, most of our services are pretty similar or there's existing patterns out there.

Cory:

I'm going to use a phrase that is probably going to immediately make you vomit, but like... it'd probably make everybody vomit... but it feels like there's almost like a cultural synergy to like the way you all are approaching your IaC and Ops, right?

Like, I'm a person, I love Terraform and OpenTofu, obviously. I work in a space where I typically touch a lot of clouds and so it's hard to lean into an AWS CDK because here, there, whatever, right?

And one of the things that's always interesting to me is when deciding on an IaC tool... Terraform is such an easy go to, especially if you're multi-cloud... but there's so many companies, it's like, "You're not locked in, but you're not fucking going anywhere." You know what I mean?

It's like with you guys, you all decided you want to go with serverless, you want active-active, you want multi-region, you're probably not going to just be like, "Hey, we should move to Azure today." And so it doesn't matter.

I guess an OPS person is like, "Well, we don't want to be locked in, so we can't choose the cloud's tool because then we won't have that ability to leave if we need to." But like, you know, as a person that loves Terraform and OpenTofu, like, CDK sounds like such a good fit.

AWS CDK sounds like such a good fit for a team who's just 100% in on AWS.

I feel like this is one of those places where you could have made a decision as an Ops team that could have just crippled the Development team across the board and said, "Hey, you know what? Because in the future we might do a Google thing or we might do an Azure thing. We're going to use Terraform." And like now all of a sudden those engineers are just ground to a halt, right? As far as like their agency and being able to control their infrastructure.

One of the things that I think the AWS CDK did such a good job of that Terraform has just fumbled for a decade is these constructs. Like this is the thing that is so hard about the cloud.

Like if you look at almost any control plane, especially on the serverful side, there's the operational plane and then there's like the developer plane and they're mixed in the API call, right? Like you look at postgres and it's like, "Hey, tell us about all of your availability and all the stuff that your Ops team cares about and what the developer cares about in one API call." Right? And like the thing that's hard about that is like I need that abstraction as a developer. Like to me I'm like, "I want Postgres and there's this operational plane I don't care about."

And the AWS CDK constructs have done such a good job of like, "These are base layers that you can kind of like layer on top of each other to get to the abstraction that you need." And Terraform is like, "We solve this by like wrapping a public module around like every single cloud service." And it's just like... it's just kind of like this pass through layer that never meant anything.

I feel like that is one of those places where there's just so much more potential in self-service for developers because the ecosystem did such a good job of making abstractions and building blocks a core part of the tooling, versus this thing that they're thinking about a decade later.

Justin:

Yeah, and that's always kind of been the point. It's something where it allows our developers to not have to understand Terraform. They don't have to understand Terraform modules, anything like that. They just, they understand TypeScript because that's what they're writing their backends in for the most part.

It's like, they can read the infrastructure code, they can debug it. They can, if they're running like a synth... which is what transforms CDK TypeScript into CloudFormation templates... they run the synth and it breaks, they can debug their TypeScript code. They don't have to rely on us.

Most of the time, at previous companies, it's like, "Hey, I tried to compile my IaC code, it broke. What's wrong?" And they don't even try to debug it because that doesn't belong to them.

Cory:

It doesn't feel owned by them. It's almost alien to them. So it's like there's an immediate hindrance to solving the problem yourself. I have to now step into these operator shoes for a moment.

Justin:

Yeah. That's what we tell teams, "You own it, it's in your service code. It literally lives in your services code, in your GitHub repo. It's in there, you own it."

And there are times, of course, there's a bug or something in one of our abstractions - that comes to us, we fix it, whatever makes sense.

Cory:

Yeah.

Justin:

But for the most part, it's in your code. You just implemented something wrong or some kind of misconfiguration. We figure it out, fix it in the service, it's good, but it really reduces the burden on us... on our team... on the platform team, to be able to kind of iterate on other things. So that's really helped.

And again, yeah, we could have gone with Terraform. We could be like, "Hey, maybe in five years we're going to move to GCP." But are we? Are we going to go from this giant multi-region, multi-account set up in AWS to that? It's like, there's going to be... this isn't going to be the hurdle for us. Like, there's going to be other things that are the hurdles, that are the real hurdles.

We know what needs to be there, it's just that that translation... which I'm going to say, AI can do a lot of the translation these days, in that regard. So if I needed to take the deployed cloudformation templates, I could probably back port those to Terraform modules at a decent pace compared to 10 years ago, now. So that's where it's like, we could do this if we needed to, but are we going to?

Cory:

Yeah, I don't know. Like, lock in is always so funny to me like that. Like, the way people think about it, talk about it, it's like, "Oh, what if we have to leave AWS?" It's like, "I don't know, what if you have to talk to your CTO and your CIO about a massive shift in the business that is probably like a fraction of a percent of a chance of happening?" Right? It's just like, it's very rare that you have like this need to move from a cloud.

Like, sure, there are moments where... people in the EU right now are like, "The fuck you talking about? Like, we're moving everything to Hetzner right now."... Yeah, there are moments where you do need to move clouds, but, like... you know, given that the world's falling apart, like, it's usually... like most of the time people move from... I did consulting in this space for quite a while. Most of the time people are moving between clouds... a lot of the times it's because the cloud has given them some sort of financial incentive to do so. Right? It's like Google comes along and they're like we'll give you $14 million of cloud credits to move off of Azure to here or whatever. It's a financial decision. It's not, you know, AWS has come to you with a gun and they say, "Hey, your bill's 20 times what it is." And you're like, "Well, okay, I guess we'll go to GCP real quick." Like it is such a tough business decision to do, especially at any company with any sort of compute.

It's one of those things I feel like we over optimize for a lot when we're designing our IaC tooling and it's just like dude, if that day comes that's going to be a huge pain for everybody at the org. Let that be a problem then. Like don't let it hamper your DevX now.

Justin:

Yeah. It's like that balance of over engineering to go over generalize something without any concrete evidence that this is ever going to happen. And just because of how ingrained we are with AWS and everything, it's probably not something that's going to... it would take some pretty substantial increases in AWS cost for us to even consider that. And if anything AWS has gotten cheaper.

Was it last year or the year before they just cut down the DynamoDB cost by 30%? Like our bill just dropped a couple thousand dollars... just for fun.

Cory:

Yeah.

Justin:

I'm like, "Oh, this is great." So we have not even seen like increasing costs over time, which is actually pretty absurd.

Cory:

It was funny. There was a company I worked for a number of years ago where we built one service in Serverless. I wrote an article about it - I'll put it in the show notes for anybody who's curious. It's very old but like coming back to your cost point, like they cut costs all the time and we built this service and it was just this one-off thing.

We were just building real quick for like one customer and then it was just one of those things where it's like three months later like our sales team's like, "Hey, we're going to every... all customers are going to be using it now." We're like, "Yo, this was just like a one-off thing we built real quick for one person." They're like, "Whole entire globe's using it, including like some of the biggest websites on the internet."

And then all of a sudden we moved from this thing where like we didn't think about the cost of it whatsoever to just the API gateway for this one endpoint was $32,000 a month, right? Like it just, it appeared overnight, like it just boom and hit us.

And it's funny because like when I'm writing this article... my whole team had, we had a ton of Kubernetes experience. Ton of Kubernetes experience. We had a ton of Kubernetes clusters already running. And so we rewrote this service from serverless to just a, you know, small app running in Kubernetes and literally the cost went down to like dollars of prepaid compute.

And it was really funny when I wrote this article, people were like, "Ah, see, like you see how expensive serverless is." And I think people kind of miss the point. It is expensive and because we have a team that understands Kubernetes, it's cheap for us to move it. But if you didn't have that team, all of a sudden, a service that cost you $300,000 a year, that's making money and you're not paying for an ops person, it's no longer expensive, right?

And I feel like a lot of times when people look at serverless, it can scale really quickly, it can get expensive really quickly, but a lot of times we don't take into account the full picture, especially in an org like yours where you've done a lot of the work, where you don't have to have DBA admins, network admins, that is cost savings that the business doesn't necessarily see day in, day out. And then your CFO sees the serverless bill and is like, "Oh my god, AWS costs too much."

Host read ad:

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas. The result?

It's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making Infrastructure as Code simpler with Massdriver. Learn more at Massdriver.cloud.

Cory:

Like, how do you guys deal with A, just like your costs in general, like, how are you monitoring them? And B, as far as the org goes and the people outside of engineering, how do they see the cost? Like CFOs. And how do you kind of, you know, account for that cost with the fact that you don't have to have all these, you know, traditional operations roles?

Justin:

So it's interesting that we actually... there's not been a single time in the five years I've been here now that anyone has come to us and said our costs are too high.

Cory:

Oh, that feels good.

Justin:

Yeah. We're not spending anywhere near what we would expect to be spending for the amount of traffic we get. Surprisingly.

Like I was looking back previous months, our entire Lambda bill for production... this is like billions of requests per month... it's like $6,000 or something like that.,

Cory:

Yeah,baby.

Justin:

Ridiculously small. Yeah, it's not even... Lambda is not even in... that's all of our compute... it's not even in our top three spends. Like in AWS, it's not even close, it's like number six or something. I think like KMS might actually be getting up there.

Cory:

Yeah.

Justin:

But it's observability by miles. It's not even close. Observability spend is our number one spend.

We have people, like myself, who are very much... like, I track these costs, I watch. It's like if I see a spike or something, I'm immediately investigating. Even though it's like, "Oh, I could save like $200 a month, who cares?" But you find these places to optimize. It just helps everyone over the next several months.

But yeah, I mean for us the spend has been tremendously low because the Lambdas we're running... I mean everything we're running is very efficient, fast. We don't have many synchronous workloads for like all event... everything's an event-based architecture.

So our lambdas that are running, we're running milliseconds. Like a lot of our heavy... there's a couple endpoints where we get the most traffic in by a mile, and those are running in sub 100 milliseconds.

We're using really small memory footprints on our Lambdas. So everything is just... because we split everything up so much that everything's just running so fast and we don't need much memory, we don't need much time for compute, so we're just able to run everything. It just feels so cheap and easy because everything's just so split.

Cory:

I know from like service to service within an org you can have different architectures and whatnot. But would you say that you're tending to lean... like when you're doing your stuff in Lambdas... are you doing like, gosh, I forget the word, like the nanoservice architecture, where it's like each Lambda is literally like a function that's processing an event? Or are you doing the bigger Lambdas where it's like, "Hey, an entire express app is inside this Lambda and it's going to get routed to some controller inside that express app."

Like what approach do you guys tend to take?

Justin:

So pretty much it's one Lambda, essentially one Lambda per endpoint or one Lambda per event. So we have API gateway and then each method. So your git, your post for... let's just call it like a contracts or something. A git contracts, or a post contracts, or put contracts, whatever it may be, each one of those is going to have its own Lambda that does the logic for handling that and gets attached to API gateway.

Cory:

Kind of like right in the middle. It's like the events that are scoped to like this particular resource type, right?

Justin:

Yeah.

Cory:

So it's easy to reason about as the engineer, like you've got it all in one Lambda. That's one of the things I see people... like the nano stuff, like you can get really fast Lambdas, but then all of a sudden like your create, your read, your update, your delete, it's like all spread across like five Lambdas and it's so hard to reason about.

That seems like a really good balance there.

Justin:

Yeah, I would say we're more in the middle. I don't want to have to have developers write like necessarily like a thousand Lambdas for a service either. Yeah, I'd say right in the middle. I definitely would say it's in the middle. Like we have queues and then each queue has its own like processor Lambda or things like that.

So if you're writing to a queue, you may have one queue and it's like, "Hey, I'm looking for this event." Or if you're processing like eventbridge events, something like that, if you're looking for a certain event like, "Hey, contract sold" or something like that, you have one Lambda that's going to process those contracts sold.

And whatever it's doing behind the scenes, whether it's updating DynamoDB or whatever it is, you're just listening for those events. So that really keeps... a lot of our architecture is just done asynchronously. We don't have a lot of synchronous, "We're waiting for this, waiting for this, waiting for this."

It's also just kind of the nature of our business. But I think that also kind of helps as we're able to... we don't have insanely complex workflows where we're traveling through 10 services to do something. Which a lot of that just kind of comes down to how developer teams are actually architecting their services.

But yeah, as far as spend, I mean I'm looking at right now - CloudWatch Events is above our Lambda spend.

Cory:

CloudWatch is the money maker. If they're.

Justin:

Yeah, we don't even use CloudWatch. We have CloudWatch logs turned off and it's still above Lambda by like 200%.

Cory:

Yeah, I remember that. Oh my gosh, I remember, that Lambda I was talking about earlier, we had a thing where we could turn on CloudWatch logs on and off in it. When we turned it on, like it was like $600 a day in logs. Like as soon as we turned the logs on, it was like we had an environment variable to control whether or not we logged. So we were just like, we'll only turn on logging like when we really need to look at stuff. Like it was so, so expensive.

It is so funny how expensive observability has gotten like from like the Datadog $60 billion company.

But like, yeah, I talk to people about this like pretty frequently. Like we don't, on our product, like, we don't do any logging internally. Like, we only do OTel and people are like, "Oh, isn't that expensive?" It's like, "Well, it is if you also do logging." Right? Like, observability has just gotten, it seems like it has gotten very, very expensive.

I feel like so many companies... it's like we have to store all of our logs and all of our metrics and all of our traces. And it's just like three things stored three ways in three different places with three different bills. And I feel like that is such a place where people are just still struggling to get like a... Like we just had logging, like we just had like the three pillars for such a long time and it's just like we've built so much shit on top of that shit and it's just like somebody's clutching some dashboard like it's so important, it is so hard to move away from it.

It's so hard to get to a better observability world because we have these like sunk costs that we're still stuck to. But it is just so funny that you're like the most expensive part is the observability. Oh man.

Justin:

Yeah, that's by far the most expensive thing that we... as far as like our platform. It's observability just because I mean you're paying third party providers, you're paying for log storage, metric storage, traces, everything.

Cory:

I feel like this is a problem that like it really does exist for so many orgs and it's something that it's going to get way worse. Like, with AI - if this whole thing doesn't implode - like the observability story, I feel, becomes a lot worse because now computers are writing code that we have to understand in Prod and we're probably going to need a lot more observability to do it.

Like it is going to be a pretty interesting problem for I mean many orgs to kind of figure out. Like ,how do you deal with rising observability costs in a world where the only place that we really have to understand our code is in a PR and then in production? It's not that you're getting intimate with how it works by writing it and seeing your code base computers just kind of barfing it out for us now.

Justin:

Yeah, yeah. And we have agentic workloads running in Production for things and we want to see the traces, like what's coming from the agent. We want to see all these traces, all this stuff coming from the agent and there's a lot of data going in and coming out. I mean just pure like context windows, everything. It's a lot. So yeah, a little too early to tell but that is going to be a worry about agents in production, is how much data we need to be able to see.

Cory:

Even in the OTel world. I feel like, you know there's definitely like a blog here or there about like best practices around like what type of information you should be tracing and what type of attributes and events you should be attaching. But like, it's also something that we very much... It's a wheel that gets reinvented over and over and over and over and over again in every org and then probably in every app. And like, people adhere to it, they don't adhere to it.

But in the world of what you're saying, like agentic workloads, there's so much context in what's happening behind the scenes that we just have no idea about. It's like, how do you even decide what attributes are important? And you know, 80,000 tokens that were just processed.

There's a bunch of decisions that happen in there that are going to have an impact on our production system. And what was it? I don't know.

Justin:

Yeah, so we'll see. Observability with just agentic workloads will be an interesting problem that is really... we're just kind of on the cusp of that beginning as we kind of expand and expand more into it and lean into it.

It's something that we'll see what that actually looks like in the future. But yeah, even without that, it's been a primary spend for a while.

Cory:

How are you guys starting to use AI internally? Are you using it... like, so you're building AI facing features, or are you mostly using it for like your actual development? Like a little bit of both?

Justin:

Yeah. So pretty much every developer... we lean pretty heavy into it. Just to kind of be on this, like, cutting edge, "Let's see what we can get out of it." But also not blindly. We get both ends. There's some of us that'll take turns and we'll play devil's advocate. Like, "What if...?" So we have a lot of people that are willing to like play devil's advocate. And then we have other people who are like, "It's new, it looks awesome, let's use it!" And that's fine. And we want both. So we want people pushing the boundaries and we want people also saying, "Wait, slow down, let's explore, let's see what's going on." So we get a little bit of both.

So every developer has access to AI. I think everyone uses some level of local AI tooling, whether that is like Cursor, any of these different things. We've got a lot of Claude, a lot of Cursor, things like that. So everyone's using something. Different developers to different extents.

We have a lot of people that are really, really involved in the space as well, who kind of help drive new initiatives, stuff like that.

We use a lot of AI on code reviews. There's lots of third party tooling that we plug in to code reviews. Just a general overview of the PR, like, "Hey, you missed this", something like that, that our local agents just aren't going to catch.

Cory:

Yeah.

Justin:

Our team ended up building our own agent that we run in Production for deployments, which is really cool. It will help developers if they deploy to one of their environments and it fails.

It always sends a Slack message to the developer team like, "Hey, your deployment failed. Here's some details." But also we run our agent on that and we go and look like, "Hey, let's figure out what the cloudformation error was." It goes back, looks at the PR, if there was a PR, whatever code was merged, you can see the CDK changes, the CDK diff... All this information it has access to and it will tell the developer, "Hey, your deployment failed because you tried to add 2 GSIs at once, but you can't do it in DynamoDB." It's like, "You tried to add 2 GSIs and they failed. Here's the code." And it'll give you a recommendation for how to fix that, "So hey, maybe deploy one GSI at a time." Something like that, it'll give you recommendations.

We built this tooling that has access to all this information that can help developers debug failures on deployments, which is really, really helpful. It really helps. Otherwise developers usually come to us, "Hey, this failed to deploy."

This is kind of the first step at automating ourselves out of that loop. It just immediately gives feedback to the developers telling them what failed, root cause analysis, and then potential fixes for that. So that was a really cool tool. Still something we just continue to expand on, especially with new things coming out like AWS MCP, all these different tools that have been launched recently.

Cory:

Yeah, it seems like you guys have... I'd be curious what your take is on this... it seems like you guys have a pretty good culture of like playing devil's advocate, like experimenting with new tools, questioning when they should be used.

I have this struggle like day in, day out where it's just like I consider myself like anti-AI, but I also use it every day. It's just like, in my heart I'm anti-AI, but in my wallet and brain I'm like, "This thing's great." I just don't understand necessarily where we go from here.

We were talking internally about what our own development workflows look like. We're to the point where we had a big enterprise feature that was like... we're looking at it, it's going to take us a month or two to build... we've been leaning into AI pretty heavily, we've done a ton of code with AI over the past six months... and we've fucking one shotted this feature that we thought was going to take us a month or two. Like straight one shot it. And we're like, "Okay, I'm reviewing a PR now." Right? Everybody's reviewing a PR. Like we were going to do that PR work review anyway, it was just going to be two months from now.

So obviously in my wallet and brain I'm like, "Well, I want the work product faster and the review time's still the same." But as we get to this AI code generation that does get better and better and we get to a point where maybe we don't have to code anymore - maybe that happens - What does that mean for our teams? What does that mean for our development culture and our processes?

Are you all starting to think through how this changes your work or is that just still so in the horizon you're like, I'm not worried about that yet?

Justin:

Yeah, I think there's been a lot of internal discussions about it and lots of debate. Just kind of fun back and forth. We have some different AI channels in some of our communications and we talk about it and people kind of have different takes. But yeah, I mean everyone I think recognizes it really does kind of speed you up.

Like a good developer with AI, it's just you're faster.I don't have to type. I know lots of good developers who are just literally slow typers and they don't have to do that, that anymore. They can just tab through it and it kind of gets that. Even just that alone has helped some developers.

I don't necessarily think it helps everybody to that level. I think you still need to have a system. You need to be system aware. You need to understand what you're trying to build.

The big problem I have with AI is it doesn't say no. I spend as much time as I can getting rid of code. I want to get rid of stuff. I want my PRs to be more red. I want to say no to things because we don't need to build everything. It's just more to run, to maintain.

Cory:

It's so funny that you said that because when we built this feature... so it was like our SCIM integration... literally the integration just "boop", it came into existence and I was like, "Oh, well, we need to do a full end to end testing of this." We don't use SCIM and Active Directory and all this stuff. And so all of a sudden I'm like, "Oh, I have to go set up an active directory, users and whatnot to test this thing actually works."

It was funny. The feature - AI just nailed it. And I'm a DBA, that was what I did originally. The only critique I had looking at the code, for a code base that I'm intimately familiar with, is I would have named the table different. That was my feedback. I'm like, "This thing nailed it." I was completely surprised.

It took six hours. I was dead set... I was like, "I'm going to prompt it to make the stuff in Azure as well, to do the AD SSO SCIM testing." It took six hours to prompt to just get it to generate the Terraform. And it eventually got to the point where it's like, "You actually can't do this in Terraform. You have to use the CLI or do it in the UI." I'm like, "Why didn't you say that six hours ago?" It was just like, "Just tell me I'm an idiot. Please tell me when I'm an idiot to save us time here."

Justin:

Yeah, that's been the major issue with it. It's just like, "Just say no or tell me don't." It tries to just keep going in circles until you get it. And sometimes that's just because I don't understand, like, I don't know the limitation of what I'm trying to do.

Cory:

It might be my conspiracy theory brain, but everybody's like, "Well, the reason why is because it's sycophantic and it wants you to like it so you use it more." I think the reason they do it is just because it's more tokens process and it's more money. It's just like, "Dude, whatever they ask for, just do it, and if it's wrong, they'll ask you again, and it'll cost them more money."

Justin:

Like, yeah, that's where I really lean into more devil's advocate of AI. The thing that worries me the most is these companies like OpenAI are subsidizing us. Everything is subsidized. They're literally burning piles of cash and at some point they have to 20x, 30x these costs that we're paying to just break even. And they don't want to just break even. They're eventually going to have to make money.

I mean Amazon put it off for decades, but these companies that burn billions of dollars don't have decades. They have to win. Someone's going to win. One or two companies are going to win and then they're going to control the costs from there.

So I think we're in a time where companies are giving us AI for free. But that's not going to stay.

Cory:

I said this the other day on LinkedIn, like it is the time to learn it, right?

It's really interesting because like with AI, with the commoditization of intelligence, where it's just like I can just buy it, I can buy thought now, right? Like there's no moats anymore. Like that's the crazy thing.

You've got Extend. I've got my business, Massdriver. It's like, okay, well today I can tell my code base to add a feature. It's a pretty good code base, it's clean. Like we follow so many practices. There's a lot of really good context. We've set AI up to succeed.

There's a lot of code bases that aren't that way, but look where we are today. A year ago I wouldn't have thought I could have one shotted this with anything. A year from now you probably will be able to one shot some pretty good functionality into a shitty code base, right?

And so when we get to a point where SDRs and salespeople are just one shotting software into existence, like the moats become really interesting. Like you're saying, it's like one or two companies are going to own everything. It's like OpenAI can now look at your business and go, "Wow, that's interesting. This company makes a lot of money. Let's just will that functionality into existence in our product."

Now the only moats are like your product quality and your support. Because it's sure as shit not your employees' work output, especially in like the software world, right? So it's going to be real interesting.

That's one of the things that like wears at me. It's like, right now I'm getting free labor - that's the way I see it. I'm getting a ton of free labor and I'm going to use it.

I don't know where this goes. And if you tell me it costs five to ten grand a month, I might pay for it. I don't know that like Macy's is going to go buy $10,000, $5,000 a month licenses for it.

Justin:

Right.

Cory:

You know what I'm saying? They could sell it to me as a small business where I'm like, "I don't want to go hire 20 people." But I just don't know how our world looks.

Justin:

It'll be interesting because, yeah, I don't know what happens if people start 20xing cost there. There's probably a lot of places where like, "Hey, maybe we'll back it off here. Like, let's be more strategic with our use instead of just plugging it in everywhere."

That's the main concern - I don't want to go too far all in to where it's like we have things that they're so heavily relying on AI and then the costs explode.

The idea is that costs are actually supposed to get cheaper over time, but I don't know, I don't think they're going to. I think there's going to be... the need for profit is going to outweigh the cost over time. So we'll see with local models, smaller context models... I think a lot of that's going to be like, "Hey, maybe we don't need to use Claude 8" or whatever it is. Whatever model you're using, you don't need to use the latest. Like you can use something five years old and it's still good enough for this workload. So there's going to be a lot of optimization that has to happen in the next couple years.

Cory:

I don't see Claude or Anthropic launching the $4.99 Claude Code Plan anytime soon. Yeah, it's not getting cheaper, folks. It's definitely going to get more expensive.

Justin:

Yeah, yeah. Like right now AI is paying for itself. We're getting enough use where we don't like... even if it's like $10,000 a month, we're for sure getting our money back there.

Cory:

Yeah, I mean, our output... It feels like we've got a team of 15, 20 engineers and there's three of us.

It's funny because I feel like in many orgs you see like the pull request bottleneck. All engineers sit down cranking out code, and then you're waiting on a PR to get approved. And it's literally just a matter of somebody reading the code, making sure it's good writing.

We're about to enter a world where that PR bottleneck is going to get exacerbated really quickly, especially if teams aren't on the same pages about their workflows. Right?

Let's say you've got a team of 20 engineers and two or three of them start leaning into AI super hard. Let's say you have a code base that is producing great output. Now all of a sudden, two or three engineers are pushing out 20, 30 hardcore PRs a week. The rest of your engineering team's going to drown, just under reviewing the PRs, right?

And now we get AI that can review the PRs. It's like, "Okay, so it's writing it and it's reviewing it. Like, how much should I review?" And I still see Copilot catch plenty of great things. I also have my static linting tools that catch a bunch of stuff before I ever commit. But how many code review AIs should I have? And do I let them all run and then I kind of just look over their thoughts or should I look at the code too?

There's so much to figure out in our workflow and I feel like if the engineers on our teams aren't going the same direction, you're going to see just people get burned out from other people's success - which I feel like it's just going to be the worst of worsts.

Justin:

Yeah, yeah.

And like we've had really good success with it finding like... especially like when it's reviewing our PRs. Our tooling reviews the PR, it's catching a lot of really great things. Like, "Oh, yeah, duh, like you can't do that. That makes sense." And you fix it. Or little nitpicky things even. It's like, "Yeah, okay."

But it still is not at the level where it can tell you, "Hey, you did this in this service, that actually breaks this other service." So that's where there's still this gap of system. It's still reviewing it in isolation, where it doesn't understand. You really still need those human eyes of someone who has the system knowledge.

A lot of the things that we've seen AI producing, it is producing everything fine, it just doesn't have that system aware context, yet. Just because it's not plugged into everything, eventually it will get there. It could be there in a year, it could be there in six months.

Cory:

And it's crazy too because like there's so much context that's missing. Right?

Again, like good output of these things is good context in. You've got a good test suite, you've got good domain driven design. Like you've set up a path to success. And a lot of teams might not have that maturity. But beyond that, the rest of the context is the rest of my repos. It is what is actually deployed in Prod. How is all this constellation coming together? And that's a context that's not there.

This was something we were talking about the other day at work. I've spent a lot of time thinking about repo sizes throughout my career. Like what should be in a repo? What shouldn't? What should be in a microservice? What shouldn't? Like, what is the split-join criteria of our services and repos? And something that I'm now starting to think about is like, does LLM generation in quality increase when I'm sitting in a monorepo? When I can see like my entire world?

Like I don't love the idea of a monorepo, because I like teams to have... I like kind of the pipelines and life cycle of that repo to dictate like what goes in it, not how a computer reasons about it. But, when I make a workspace in VS code and I put all of my repos in that same workspace... like when it understands how it works, the rest of the world, it gets better. It is interesting. I think repo composition is going to be a factor in this as well.

And then the production side of it is, how do you get it to understand what's happening in actual product? Like in your actual environment? How will this change affect something when I have this much traffic versus, you know, a small amount of traffic?

And it just doesn't know that today, but we're not far off from it being able to access it and make those kind of decisions as well.

Justin:

Yeah, that's like one of the things with active-active, multi-region is like this idea of like things have to be idempotent. So an event comes in somewhere, it gets replicated to the other region, you don't want to process it twice. You need like idempotency on your table stream triggers, things like that in DynamoDB. And it doesn't necessarily know all of that... what's happening there.

It doesn't necessarily know all of your architecture, how everything's set up, because that's abstracted out of this code base, that's abstracted into our CDK wrapping package. So it doesn't really know that, based on this code, because it would have to have access to that. It needs to know to link like, "Oh, hey, you're using this abstraction, I need to go look at it. Oh, this is what helps with idempotency." Things like that.

So it's really good, it's helping us... it's still got a little bit more to go for me to be worried about my job.

Cory:

Yeah, I mean, honestly, it's one of those things... there was a lot of like fervor over like, are people going to lose jobs? I don't think so. I mean, I was in this exact same spot a year ago, I was like, "Oh, developers are going to be out of work." And now it's like I think their jobs are going to change significantly... it's going to be a lot of code reading, right?

The thing that's hard there is like, that is a much more... closer to staff or principal level role where that's going to be the most successful. Because you're thinking about the grander architecture.

I think one of the things I really worry about there is, how do we do apprenticeship in this world?

Justin:

Yeah.

Cory:

Where you don't come in and write software and learn about like the failure modes of our systems, how are you ever going to be able to review code from that perspective?

And I feel like that's going to be one of those shifts where it's like, I don't think that we're going to have less software engineers. I think we're just going to be doing very different things, but we're not going to get trained on how to assess that.

Like five, ten years out from now. If we hit a world where people aren't writing software in twenty twenty-seven, in twenty thirty-five, people aren't going to know how to write software anymore. You know what I mean? Like how do you assess software if you don't know how to write software?

It's going to be... it's like me looking at like the grammar output. I'm like, "Hey, ChatGPT, fix the grammar of this." I'm like, "I have no fucking idea if it's correct. It's better than all the ellipses I use all the time."

It's just like, "Right, go for it, put it in Prod. Let's see what happens."

Justin:

Yeah, that's the only thing I'm worried about with the job market - the junior levels. Like I don't need someone fresh. I would like that... still just like to have like that pipeline because there's going to be this mega fight. I mean there is already this fight, even in a pretty crappy job market right now, for these senior engineers, for these people who are there.

But I also kind of look at everything as a whole. You hear about the extremes of these companies like Google. Everyone who's just like so heavily leaning into AI. And then I also have a lot of developer friends who are senior developers at other places like a Toyota or somewhere else, where they just don't move like that and they're not really using AI that much necessarily. I don't know about Toyota specifically, but some of my friends that work at those types of companies, they're really kind of like, "Eh, I don't really need AI, I don't want to use it, we don't use it here." So I think it is kind of hyper inflated the hype on these companies... like even us and these other companies who are really heavily using it. Whereas, as a whole, there's a lot of companies that aren't yet. Just because they're usually a couple years behind on these new adaptations. So in twenty thirty, when all of these companies are using it, we might start to really see that shift.

But I am worried about like junior level roles just because like you don't necessarily need... like it's just time I have to spend in training and I could just hire the senior and there's not like that pipeline.

Cory:

I'm going to throw out the worst take.

I honestly like... and I've felt this way for a while and it's a conversation... it's something that I've brought up before and I've definitely pissed people off... but like in our title... we seek the engineer title in what we do and there's some states and countries where you cannot because we're not engineers, folks, we're software developers.

Like we're not licensed, and that's been good. I've definitely in moments of my career thought maybe we should be licensed. I understand why it's been important that we're not.

But now in a world where we do... I don't want to say we have a fiduciary duty to our businesses or whatnot, but in a role where we're going to be observing and trying to understand what's happening a lot more. Especially when you don't have an apprenticeship path. To me, that might be one of the places where a couple of things can happen. A, we can defend our livelihoods as engineers by requiring that there's some licensure. But also licensed industries have paths through apprenticeship to become licensed and understand things.

Now the problem that I think we're obviously going to have with that... I know people are like, "Fuck you, dude."... there's just too many languages, too many architectures. There's so much stuff to reason about. But I would also just argue, structural engineers, and people that build houses, and architects, and all these other licensed industries have just as much complicated things to think about as we do. And maybe our stuff's not as complicated as building a skyscraper. Who knows? Maybe, maybe not. I don't know, I don't understand what they do, but I know it's just one of those things.

I feel like that is going to be a huge problem - we're going to have a massive knowledge gap. And maybe we get to a place where we just will software into existence and software says that software is fine and it merges it. But that seems like a huge risk to a lot of businesses that are risk averse. Right?

Justin:

Yeah. You have a lot of financial companies that just really can't necessarily do that. Just regulation... I think there's going to be a batch of regulations coming at some point too.

Cory:

And it comes back to the old DevOps conundrum, right? Dev's goal is to ship change in functionality to make money. Our job is to make a system stable. And when computers are writing the software and reviewing the software, how do we keep the system stable? That's going to be fun. I think we're going to have jobs still.

Justin. I know we're over on time. I'm sorry if you've got to bounce to another interview, or not interview, but a meeting or anything work related. I do appreciate you coming on the show today. It was fun to get like a real live LinkedIn sesh going.

Justin:

Yeah, yeah, yeah. It was a good time.

Cory:

Awesome. Where can people find you on social media?

Justin:

More or less for anything work related, it's always LinkedIn. That's pretty much the only place I do anything work related is LinkedIn, and that at work. All my other content is powerlifting related.

Cory:

Yeah, there you go. I don't do that, obviously.

Cool. Is Extend hiring? You guys got any Ops/platform roles open right now?

Justin:

Not for DevOps or any... I don't know if any of our platform teams do. We almost always have some sort of developer role open. We have lots of different other roles, but a lot of it we do like ML. We have some ML roles open. I'm sure there's other teams hiring. Our platform team has been pretty stable.

Luckily we've been very stable over the last couple years and with AI, we haven't had to expand. We've been able to maintain and increase capacity without hiring. So we've been pretty good there.

Cory:

Heck yeah. That's the dream. Awesome. Well, thanks so much for coming on the show and I really appreciate you sticking on for an extra couple of minutes.

Justin:

Yeah, thanks for having me.

Cory:

Yeah. Catch you all next time.

Show artwork for Platform Engineering Podcast

About the Podcast

Platform Engineering Podcast
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver.

Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.”

Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how platform engineering actually works inside real companies.

Whether you're deep into Terraform/OpenTofu modules, building golden paths, or just trying to keep your platform from becoming a dumpster fire — you’ll probably find something useful here.