Guest Host: Kelsey Hightower — Why IaC Alone Isn’t Enough

Ever wonder why strong Terraform modules still lead to long review queues and fragile pipelines? From hand-built scripts and early data center migrations to cloud sprawl and Kubernetes, configuration management has changed a lot - but the core struggle remains: too many decisions, not enough guardrails. Guest host Kelsey Hightower sits down with Cory O’Daniel to unpack where Infrastructure as Code succeeds and where teams get stuck.

What you’ll learn:

How to avoid “choice overload” in cloud configs by moving decisions upstream
Practical ways to pair IaC with UX, policies, and SLAs to reduce toil
When click-ops is a symptom, not the problem - and how to replace it safely
Patterns for scaling platform practices beyond a handful of experts
A simple mental model for mapping workflows across serverless, containers, and VMs

Guest Host: Kelsey Hightower

Kelsey has worn every hat possible throughout his career in tech and enjoys leadership roles focused on making things happen and shipping software. Prior to his retirement, he was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. He is a strong open source advocate with a focus on building great software as well as great communities around them. He is also an accomplished author and keynote speaker with a knack for demystifying complex topics, doing live demos and enabling others to succeed. When he is not writing code, you can catch him giving technical workshops covering everything from programming to system administration.

Guest: Cory O'Daniel, CEO and Co-Founder of Massdriver and Co-Founder of OpenTofu

Cory has been a software architect and engineer for 20 years, leading up to the founding of MassDriver. He's also a husband and the father of two kids.

Cory O'Daniel, X

Cory O'Daniel, Medium

Links to interesting things from this episode:

Transcript

Kelsey: 00:00:13

All right, welcome back to the Platform Engineering Podcast. As you can hear, there is a new, improved host we constantly are looking to upgrade. And so I'm going to be your host today.

I'm Kelsey Hightower, and we're going to do something special today. We're going to be interviewing Cory O'Daniel, the normal host for this podcast, and we're going to deep dive into configuration management. Maybe a look over the time period. A lot of people may have started with configuration management when I did, maybe 20 years ago at the dawn of DevOps.

Lots have changed since then, and so we're going to do a deep dive over that with Cory, and we're going to go a little longer than normal, so stick around and hopefully we're going to learn a lot today and learn a little bit more about Cory outside of being the host for this podcast.

With that, welcome, Cory.

Cory: 00:01:00

Hey, thanks for having me on the show.

Kelsey: 00:01:02

How does it feel to be on the other side?

Cory: 00:01:04

I have this awesome condition where I don't experience nervousness, and I think I might be experiencing it for the first time. My hand's a little sweaty, which is weird.

Kelsey: 00:01:10

I would imagine it feels like... I remember the first time my daughter started driving and I got in the backseat of my own car and I'm like, "I've never been back here before."

This is weird. This is kind of nice.

Cory: 00:01:25

Yeah.

Kelsey: 00:01:25

And so I could imagine it feels like that a little bit.

Cory: 00:01:27

Oh, yeah. It's also, I'm not in my normal spot. I'm in my house today instead of my office because it's like 95 and the AC is loud. So I figured I'd come inside. So I'm out of my element. I'm surrounded by plants, which is weird.

Kelsey: 00:01:39

I've been enjoying the Platform Engineering Podcast. If you haven't listened to any of the episodes, you've had some legendary guests. I think a lot of people who have made a big impact in the space.

And so even though it's platform engineering, a lot of people may know this discipline as system administration. A lot of people may know this discipline as DevOps. I think the new term that kind of brings it all together is platform engineering.

And I remember meeting you all and it's like, "Look, configuration management, isn't it. Putting a bunch of things in a YAML file, HCL file can't be the end game. There's something missing here." And so maybe give people a little bit about your background. What makes you qualified to say this can't be it?

Cory: 00:02:22

Yeah, so I've been in the space for a while. I was actually one of the first data centers to like fully migrate off to AWS when they first launched EC2. We had a big data center in El Segundo.

So I've been in the space for a while. And my background is operations, database administration. I tinkered with software since maybe like 14 or so, learning to program.

And when I had this opportunity to do this migration, one of the first things I kind of was, I guess, weirded out about like migrating to the cloud was that like they're not real. And I was just like, "This is software, like there should be a way to automate it." And so I actually built a configuration management tool very early on, like before GitHub. I remember importing it into GitHub when I got my GitHub account, like that's how long ago it was.

I was like, there has to be a way to like automate this. Like we already are trying to do all this automation stuff at our job around Bash and deploying to our servers. I was like, I feel like this is a really great way to like further extend automation.

So been in the space for a while, worked for a lot of startups and then I started doing consulting. And so when I was in startups, I was seeing this struggle as DevOps was starting to become cloud management. You got to remember, like, cloud wasn't there. Like the cloud was different when we first coined the word DevOps, we ran our software on the cloud, right? Like we laid down these things and we put our software on it. And now we lay down these things, we put our software on it and then our software is also interacting with it, right?

And so kind of the landscape changed a bit on us and I was seeing these startups struggle. Like, "How do we manage this stuff?" Like, "I don't quite know all of the details and ins and outs of IAM and VPC security and all this stuff that... I just need to like access a queue or I need to access a database." And it's like developers... all this stuff was kind of put on their plate.

I was like, "That's a startup problem, it's not a big deal." And then I started doing consulting, and I started doing some large scale consulting with Google Cloud, and I started seeing the same problems at massive organizations. It's just like there was so much and it was so difficult and these companies were just getting hung up and stalled on like scaling IaC adoption, like scaling knowledge, scaling access to cloud resources.

And that was kind of when I got the first idea for Massdriver. I was like, "The problem in our space isn't that Terraform's hard or Helm's hard or Kubernetes is hard, or even that the syntaxes are hard. It's that you can't expect everyone in your organization to know everything about everything."

I gotta understand costs, our reserved instances, I gotta understand our compliance and our security. There's a lot of knowledge to provisioning something in the cloud besides just picking some attributes. And I feel like that's where people really kind of got hung up on Infrastructure as Code adoption - it's too hard to scale it across the Org.

So, yeah, that's just a little bit about my background.

Kelsey: 00:05:12

I think there's an important part that I want to go back to - your time as a system administrator. A lot of people that may have started their career in the last 10 years, they don't have that part.

They don't have the, "What OS are we going to run? Are we going to PXE boot static IPs vs DHCP? Where should I put the Mac address?" And I think there was a time early on where, once you kind of got to production, it stayed like that for a while. Right? If you bought a database, that's the database for at least the next five to 10 years.

And when I thought about automation during that time period, it's more like, "All right, this is it. Now let's automate it because it's probably not going to change." So the investment of writing a KornShell script, a Bash script, had a huge return because you would use those scripts for a very long time.

And I think the game is a little bit different now. So let's go back to that time period - What was your environment looking like? Were you kind of... was this like the VMware stage?Were we already at VMs? Were you still booting bare metal? And then maybe how did you manage those systems if it's like pre config management time.

Cory: 00:06:17

Yeah. So I mean, when I first got into it, it was AIX, it was systems. So that was like my very corporate background - banks, healthcare.

And I had a very weird encounter with somebody in a bar in two thousand and five... this is like a whole side story... but I wore a suit to work, that was how you did things in Florida. It was nice, you know, a little tie. It was very professional at the bank. And I saw somebody wearing board shorts and a T-shirt in a bar in like two thousand and five with their laptop. And I was like, "Why do you have a laptop in a bar?" He's like, "I work for a startup in Los Angeles." And I was like, "That sounds way cooler than what I do." And so I just moved. And so like, you know, I feel like that move was where my life changed.

I wrote software, I didn't build products. I wrote software to kind of automate the things I was doing in the data center - interact with things, scan things. I wasn't writing a product. And so the shift in my job was also where I first started working on products and like user facing functionality. And you know, then it was pretty much just SCP-ing files onto a server someplace. I mean it wasn't even as sophisticated as VMs. It was like we have these machines and I don't know what the Ops team's doing over there. We just run this script and it just copies a bunch of stuff over. This is also like pre Git, like SVN days.

As the developer on that team, it was an extreme black box. I had no idea what happened on the other side. And what was scary for a lot of Ops teams in those days... if any of you are still around and listening... it was scarier receiving a release from developers. The people were just SCP-ing files around. You had no idea what was going to end up on this machine. And so you didn't have this level of reproducibility that we had today. It was really... from company to company, it really, really seemed like people just did things their way. Like we didn't have great ideas of standards or even like approaches at the time.

Kelsey: 00:08:06

You know what I would probably say there is... during the best times, that's a really good interface. Think about it. If you copy your stuff to this server, it will be deployed. So do what you all have to do, make your decisions, cut your tags, bundle it into a release, but when you SCP these bits to these servers it's go time.

And if you think about a contract, that's a pretty decent contract. I'm not saying it's the safest contract, but if someone told you, "How do I get something to production?" Well, "If you put files here it's live."

And I think for those that don't know what AIX is, you have these highly curated platforms with the software, the OS and the hardware kind of designed together. And so if you want to get something done, for a lot of people, IBM was your platform. It was your platform engineer, they were the platform engineers, they gave you mainframe, P series, and then AIX ties everything together. And for a lot of people, if you learn those tools, you could actually do a lot the IBM way.

And I think, based on your transition from suits to T-shirt and shorts, the world changed a little bit. We went from this rigor to, "You know what, let's relax a little bit, let people bring in new technologies, try new libraries." And I think that creates a world of mayhem that I think you're starting to describe, which is, "I have no idea what you're SCP-ing now." We went from "I'm going to be sure it's IBM compatible" to "Hey, we want some Linux servers now and we want to copy Java and Python to those destinations."

Cory: 00:09:47

Yeah.

Kelsey: 00:09:47

And so in your opinion, you know, we get to this new era, it's kind of wild, wild west. People are starting to define the roles a little tight. There are some people crossing the bridge. DevOps is born. What are you doing, wearing T shirt and shorts, at this moment?

Cory: 00:10:03

Yeah, so as DevOps is kind of being born... So it's interesting, like when did that happen?I'm not sure if that was like around being in the DevOps days or if that was like the Flickr video. Do you remember the Flickr video? The like 8-minute, "We're deploying 15 times a day" and everybody's like, "Wow, how are they deploying that many times a day?"

Kelsey: 00:10:20

Yeah, I would say the time period is twenty ten for me. Twenty ten-ish. People starting to give a name to this practice. So I would say around that timeline is when I think people are starting to bring that lexicon into the workplace.

Cory: 00:10:35

Yeah. So two thousand and six-ish. I'm working at a very bizarre startup. It's a startup that could only exist in a non phone world. It's two thousand and five, two thousand and six. It was a company called Ripple TV. And the whole idea was like people would wait in lines and they didn't have anything to do because you didn't have a phone to stare at. And so you'd be at Starbucks or you know, Pete's Coffee or Robex or maybe you're at the gym, and it was just pretty much like all the noise that you have on an iPhone today, but on a huge screen tv. And so we had a data center, but we also had 3,000 remote servers that essentially powered all these screens.

Each one was like a full on Dell server attached to one of these screens. And so deploying software for us was interesting. So, like, a part of our deployment... there was putting stuff in the data center, but then there was also putting content and some of these local stream aggregators, they could pull stuff from just movie times and whatnot. And that was interesting because, like, a part of our deployment pipeline was BitTorrent. Because we'd have, like, at that time, megabytes of video that we had to distribute around 3,000 servers around the US. Megabytes - could you all imagine, like, moving megabytes of data? But at the time... like, megabytes of data in two thousand and one - that's a lot of data, to like push around 3,000 servers, like video streams and whatnot.

It was an interesting deployment pipeline because it was stuff you had to put in the data center, but then there was stuff you had to distribute across this, like, huge network. You could almost consider it kind of like CDN-ish, like, it kind of had those vibes.

And then shortly after that, I went on to co-found a company that, if I would have founded it five years ago, I'd be a billionaire right now. It was this company called Vokle, V O K L E. It was a very early, like... say influencer platform. So it was video chatting. I mean, it's pretty much Zoom, but in two thousand and seven, two thousand and eight, if you could imagine that. And it was pretty much for, like, people to broadcast and, you know, have shows on things like YouTube, but have, like, interactive audiences and whatnot. And now we're actually doing real video streaming. This required also a lot of servers, but this was still very early, very early AWS days. Like, we'd be able to stand up a couple of EC2 instances.

And that's when I started leaning more into this automation and trying to figure out how to do auto scaling. Because we would have, you know, two people on the platform, and then all of a sudden, the Mythbusters used it one day, and there was 5,000 people trying to log in and watch video streams. And we're like, "We can't afford any of this."

I mean, it was the wild, wild west until about like two thousand ten, it felt like.

Kelsey: 00:13:10

I think you brilliantly described where the pressure came from to go from shell scripting to thinking about tools like Puppet, Chef and Ansible. Like, why did we bring in these additional tools with all this complexity? Why were we pushing so hard for abstractions and automation?

Once you get into more than a handful of servers, once you get into multiple networks or distributed systems in general, everything you had that worked well in the mainframe, a couple of physical server world, it just doesn't work anymore. There's too many servers to keep in your head, there's too many servers to keep on a spreadsheet.

And now it's like, okay, we gotta get a whole new paradigm. And I think this is when we all kind of forced ourselves that being a system administrator just wasn't enough. You have to understand things like BitTorrent so you can use them instead of just rsync. You start to have to be aware of what was available in the software realm to complement your realm.

All right, so you get it. DevOps, you respond to the pressure. You start building tools, you start building companies, you're starting to take advantage of this entire landscape. And then I'm pretty sure you were observing, just like everyone else, there's this bridge that gets us to today, which is config management goes through the roof - SaltStack, Puppet Shift, Ansible, Python versus Ruby, YAML versus everything else. And then there's something that happens that I think that pauses all of that. And it was maybe cloud to some degree, but when containerization came out, it felt like everyone was like, "Whoa, whoa, whoa, we've got to rethink the role of the server. Maybe its role should shrink, maybe the role of a package manager should shrink. And the thing we're configuring, it needs to move."

And so what are you observing during this transition period going from "We've got to automate a bunch of servers" to people want the servers to be smaller. Some people want to go serverless and just get rid of the OS as the predominant thing we're talking about and move to the app layer.

Host read ad: 00:15:12

Are you tired of the slow, complex and expensive process of spinning up new environments for your developers? What if you can have ephemeral environments for every pull request without duplicating your entire stack? Introducing Signadot.

Signadot's sandboxes are a new approach to ephemeral environments. Instead of duplicating everything, Signadot intelligently routes requests to services under test while sharing the rest of the stack.

This means you can finally scale ephemeral environments for your entire engineering team, no matter the size. With native support for data isolation, you can spin up ephemeral databases when needed.

And for your asynchronous systems, Signadot has you covered with support for Kafka, SQS and more. Stop mocking and start testing.

With Signadot, you can run your end-to-end Playwright tests against sandboxes for every pull request, ready to revolutionize your developer experience? Visit signadot.com to learn more. That's S I G N A D O T.com

Cory: 00:16:07

I feel like it was a hard era because serverless and containerization was like both happening at the same time, right? And you're hearing AWS say like, "You build it, you run it" and it's like, "What do you mean?"

Like, if somebody makes a lambda, do they have to understand configuring CIDR ranges? Like, how far do we go from, "You build it, you run it"? Is it "I'm running my app" or is it "I'm running like the whole Shebanga Bang", right? And what I was seeing at the time was teams that are like, "Hey, we have to go 100% serverless. Like, serverless is the future." And people like, "Containerization is the future."

And, you know, I feel like it's one of these moments that we have time and time again in software. It's like something so fundamentally different comes out... and we had two fundamentally different things coming out, like right around the same era... and everybody's looking at it, saying, "Oh, that's the thing that we have to do now." It's become the shiny bauble. And both of them, serverless and containerization, I'd say, offered... I mean, I know there's probably some people out there, like, "Life was simpler before all of this", but like they offered simpler solutions to a bunch of organizations that all of a sudden had to deal with a global community.

You click back to like the day before iPhone. None of that. Containers and serverless did not matter. No one really... maybe a couple of people at like Netflix did because they're trying to get rid of these CD things... but like, for most businesses, you didn't have much traffic. You could have like a single IIS server, like serving something. And then all of a sudden the entire Internet was accessible 24/7 by everybody, right? And like, so scale just popped out of nowhere.

So a lot of these deployments, a lot of the way that we manage stuff, had to fundamentally change because we needed more compute, like we needed more scale. We need to start figuring out how to do auto-scaling. Containerization would offer that eventually and serverless offered that.

And the thing that was really hard I think for a lot of teams was they ended up chasing the shiny bauble where it's like, "That is the way." That is the way - one of these two directions is the way. And the truth of the matter is - what's your workload?

Kelsey: 00:18:08

You know what's crazy, Cory? Some people did that every two years. One year it's all Heroku, next year it's all Cloud Foundry, after that it's Mesos. And then when you step back from that, you have five of these decisions that have been made. Nothing's finished. 10% of your apps went here, 10% went there. The new stuff is like, "We're not going to any of those places." And so now you have a different issue. Now you have five or six meta platforms that all need a bit of glue around them to make them work with your existing stack that you already have. And so now we entered this new era.

I remember, to me, honestly, it was around that time and maybe some VMware time, where there was so much choices that for most people, ClickOps was the only option. We didn't have a configuration management module for everything. So where there was a gap, what did people do? Especially, and I include SaaS as part of the platform, people would just ClickOps - click, click, click, click, click, click. "It's the way we want it. Leave it alone. Oh yeah, Ops, you're responsible for that too. And yeah, we have no configuration management for it." And so now we end up in this place where people blame the ClickOps for creating an extension of the mess.

And then we get into this world now where it's ClickOps versus config management. Do you even allow ClickOps? Do you provide your own self-service experience that mimics ClickOps with guardrails?

And then platforms like Massdriver start showing up that say, "Look, we can't be stuck in the terminal. There are so many things that are involved with managing platforms that maybe just Terraform isn't enough." What gets you to the point where you say, "Yo, we need to build a company around this?"

Cory: 00:19:58

Yeah, I think the thing that's a little interesting is I'm going to say this and some people are going to be like, "That's not true." And it's like, "Think about it a little harder, it's true." Like today almost every startup, even startups, are multi cloud. And what do I mean by that?

I don't mean that they're using multiple hyperscalers. You're not half on, you're not half active in GCP and AWS. But like, "I use Vercel [Okay, great, use Vercel] and I use Neon database." Well, great, you're going across the Internet to two different clouds. Like you are multi cloud, whether you realize it or not. You bought a database from a database as a service vendor. You bought Jamstack from another vendor. Like it's very easy to get into a multi cloud scenario very quickly.

So this has implications. You can only Infrastructure as Code things that have APIs for being interactive with via Infrastructure as Code, right? I can't manage Vercel with Helm, right? Okay, well maybe I can manage Vercel with Terraform or Open Tofu. Oh, you can. But then somebody buys another product, maybe it is a CDN, maybe it's something else, and maybe they don't have an API or maybe they do have an API, but they don't have a Terraform provider. All of a sudden you're clicking again, right? It's so absolutely hard to just... like, "What is the boundary of cloud?"

"What is the boundary of what we need to automate and reproduce?" is hard to reason about in a business, especially like a sufficiently complicated business. And it's very hard when you get to the fringes of, "How do we automate that? How do we reproduce that?"

And it's like, "Well, the vendor doesn't have an API, but we still use it because it's an SFTP service" or something goofy and weird that you're just stuck on because of 20 years of debt, right? And so it's very hard to even get to a hundred percent IaC adoption. Now that's fine. Like having a little ClickOps out there, I think, is okay.

But when we come back to companies like Massdriver, and kind of like our founding principles is like, it's easier. It's always easier to ClickOps something, whether there's Terraform or whether it's in AWS, it's easier to just click it. You see it, it's faster. Tell you what, it's faster - I can get into AWS and make a Kubernetes cluster faster than you can put together the Terraform for it. Like, I know it's not a race, but it's easier to do. Why is that? Well, it's because there's an interface guiding you. Right?

And so like, as we kind of think about like abstractions and like, what does the idea of DevOps mean? Is it a team? Is it two teams working together in collaboration? I think it's a little bit of the latter. Like, how do we make these systems easier for people?

It's so hard to do IaC well, it's almost impossible to do it to 100%. And there's recent surveys that say we're stuck at like 30% adoption, 13% successful adoption.

That's a lot of people ClickOpsing stuff or managing infrastructure some way.

So as we're starting to see these teams that are extremely competent - huge, huge products, massive, massive scale - struggling to adopt something as simple as Terraform, as simple as OpenTofu, like, how is that possible? Are they nimrods? It's like, "No. A, they don't have enough time. B, they're Ops folks, probably designing Terraform and pipelines, they're not necessarily like product developers.

So for a team that is already underwater, trying to figure out how to scale access to the cloud, because now it's not all about like security and the VPC and the security groups, it might just be about like tuning some RAM and a Lambda. Like, how do I extend that? Like that functionality right there that the developer definitely knows? The developer has an idea of how much RAM they need in their Lambda. How do I extend that functionality to them without them having to figure out like, "Hey, this Lambda needs to be in a VPC. And by the way, like, this is how the VPC needs to be able to route traffic." And it's just like... it's too much for them.

And so when you hand them Terraform, you say, "Hey, it's Terraform. It's super easy. Just punch in a bunch of stuff on the right side. I already wrote the module for you." It's like, "Well, you did the easy part. You just wrote some code." Anybody can just type some code. The hard part is the right side of the equal sign. The hard part is knowing the right value.

I can stand up anything in Terraform. Is it right for Production? I have no idea. I could probably bang out a Kubernetes cluster off the cuff in Terraform, but is it right for whoever's listening's production environment? No, I don't know your production environment.

Who knows Prod? Operations.

And so we get in this place where we've seen so many teams like develop this great footprint of Terraform modules. "Hey, these are modules that you can use. They're in this... maybe it's in a central repo. We've made it easy. It's one place you can find them all. Developers, all you have to do now... this is all you have to do, the cloud is yours... the only thing you have to do is find that repo, find the module you want, go back to your repo, add a main.tf file, reference that module, find the workflows to execute that module, find the security scanners to run, you know, Snyk or whatever. That's all you have to do. And then as soon as you know all the values for Prod, you're familiar with our RI constraints, you're familiar with everything else, then you just have to punch all those values in and then you can hit Open PR and then ask me to review it and I'll tell you what you did wrong. It's super easy. You can do it in days, right?"

Kelsey: 00:25:32

I hope the listeners are sensing the sarcasm.

Cory: 00:25:34

Oh yeah, there's some sarcasm.

It feels so simple, I think, a lot of times for Ops people. You're like, "Oh, just punch in the shit here, punch it in, some Terraform." But it's like there's so much context missing on the other side of that, right? And like that's why it feels tedious.

That's why you get a lot of need for like the service nows and approval workflows because you're like, "I can trust them to type some stuff in Terraform, I just can't trust the values they put in it. So I need to approve that, I need to see it go through. You need to ping me."

And so like when we hand that level of self-service to developers, I feel like we're just creating a bottleneck. We're saying, "Hey, open as many tickets via pull requests as possible and I will still review the pull requests and apply them."

And so when we're thinking of Massdriver, like how do we make that simple for the developer? Well, they shouldn't have to be thinking about pipelines... and I've got a whole tear on pipelines, I do not think CI/CD is the right solution for executing Infrastructure as Code. That's a whole other thing right there, there's a blog post coming out about that one. But beyond that, like going back to ClickOps is easier.

Why is ClickOps easier? Why is the AWS console easier than all of this other stuff? It's because you have an interface to guide you. So that's kind of like the founding principle of Massdriver. We don't want to make a no code platform, we think Infrastructure as Code is really important. Going back to that point a minute ago of like trying to get to 100% IaC adoption, 100% of your cloud services reproducible.

The Terraform provider library is the only thing that gives us that functionality. And that's why you see the Pulumis and all the other tools like being built off of it initially. It's because that thing gives you just access to tons of APIs that you can automate through the same interface. That right there is the treasure trove of IaC - those providers. Now, whether it's Terraform or OpenToFu or Pulumi on top of it, like it doesn't matter, somebody has made those abstractions so that we can interact with these APIs.

Okay, great. How do we get those into developers hands? Well, we need to make an interface to guide them, and then we can start to codify into that interface our constraints. "There's only three types of instances you can use here because we have reserved instances and these are the ones that we're allowed to use." Right? "I don't want you to turn off an encryption ever. I don't want it to be a boolean. It's on." Or, "Is it Production? If it's Production, it's on." Like it's not exposed to you.

And so our idea was not to kind of, you know, throw the baby out with the bathwater and say, "Hey, you know what? This infrastructure as code things is wrong. It's a mess." We've done a ton of investment in there and I don't think it's sunk cost fallacy, like there's value in there.

Now the question is, how do we scale it? We know that the providers work well, we know that Terraform works well, we know that the idea works well. It's scaling that operational expertise that is the hard part, that gets people tripped up.

So the idea with Massdriver was just to essentially put an interface on top of your IaC, whatever it is, and then allow that interface to be customized by an Operations team without them having to learn React and TypeScript and essentially re-skill to be able to extend their skill set. Like we want these people to be able to extend these skill sets quickly. We want to be able to scale faster. And having a whole team say, "Hey, stop and learn how to write TypeScript and React and build web servers and controllers and database interactions" is hard. And so that was kind of the founding philosophy, we just want to make it where you can work in the tools that you know. And essentially a site kind of reacts to that and builds a platform out for your developers that's really streamlined around what you want the cloud to look like. Kind of like lowering that surface area of the cloud.

Kelsey: 00:29:12

You know, I think there's a thing there that... it's in the thread, if people are following it. All of these platforms, all of these tools, aka the Cloud, including all the SaaS products, they have so many knobs. And each of those knobs represents a decision, whether it's a security property, what region, how much you want to spend - there's a ton of decisions. You can imagine there being 10 to 100 decisions for every product that's available. And so if your team is using, let's say, 10 products, there's 100 decisions that need to be made every time you start from scratch.

And I think what you see as someone who worked at a cloud provider, we know we have all these APIs, and we always ask ourselves before we release a product, what's the default experience look like? And so that's what we do in the Web UI. We pick a default region, we pick a default size, we pick a default everything and say, "Hey, what's the fewest decisions we can get away with for you to get some value from this product day one?" And we can do that.

If you take that same product, and you go to the command line, and you run gcloud --help, it's like, "Here's the entire world, here's every flag, here's what they mean. You got to make your decision." And I think it's those decisions that when Massdriver shows up, it says, "Listen, Terraform gives you a way to make decisions, pretty much any decision that the cloud provider allows you to make, using a dialect like HCL, but you still have to make those decisions. Once you make those decisions, the next question is, do you make the developers also make those decisions?"

And what do I mean by that? If you make a really great Terraform module and maybe you don't expose everything that the cloud provider does - so you go from 1,000 to 100 things like, "Hey, our module's super flexible. People need different things for staging than they do for production. And we're going to give them that. So we're going to have a pretty big API."

And so then you take your pretty flexible module, you pat yourself on the back, you may even version it, and then you tell the team, "Here's the golden module." And now what you're telling them is, "You only have 100 decisions to make. Please make the right ones or CI/CD pipeline is going to catch you and tell you which ones you've done wrong, because we're going to late bind the validation or if I have enough time, I'll catch it in code review."

I think what you're saying with Massdriver is - let's move those decisions. First of all, let's make some decisions. And the decision would be you can only deploy 3 out of the 100 instance types Amazon has, because as a team we've made those decisions. And instead of giving you a custom command line tool, which I'd imagine you could do, you could say, "Well, here is a UI that represents the only decisions you can make and whatever you pick should work because we will never let you make a decision that isn't valid."

Cory: 00:32:03

Exactly. Yes, I mean that is exactly it. And people are probably like, "But, but, but, but, but..." I know what the buts are. I've heard this one a million times. It's like, "But what about that developer that does need the different value?" Guess what? That's DevOps right there. That's collaboration.

You're going to talk to that developer about why. And you're going to understand the problem. And you might see that that field that you didn't expose is actually important in some workloads, and maybe you add it and set a good default.

But this is one of the things that's bothersome... it's funny hearing you describe this from like the UI perspective of GCloud versus the GCloud... or, sorry, Google Cloud versus the GCloud command, because I feel like that's what many organizations do with like the public Terraform modules - they are the Google Cloud UI of Terraform. It's like, "Oh, we made all these nice safe decisions for you. You can just deploy this thing."

But again, that's where you start to get in these places where a developer's like, "Oh, I can change literally any of these? Which ones should I change?" Like whether they're just trying to reason about all the different things or they change one field and it's like, "Well, if I change this one, how does that affect this other one that has kind of a similar... Do I need to change that too?" They're going to spend a lot of time googling stuff that may not matter.

Even in that scenario, like it's still just a lot for people to reason about. So I just think like providing that guidance and then finding the places where 20% of the use cases doesn't fit this golden module. That's not a failure, that is an extreme success. What do I mean by that?

Well, if 80% of the workloads in your organization have never had to ask you a question and their PRs have just gone through the first time, you have a lot more time to deal with that weird case where somebody's like, "Oh, in this instance we need this." Great, it's a use case you didn't know about and the business is learning about it. You might even find that there is a second golden path. And that's also fine.

Hey, you know, let's say it's databases, right? Maybe day one is like, "Hey, we run Postgres and we've made a bunch of decisions for you. You don't have to think about the zones, you don't have to think about the auto-scaling, et cetera. We've kind of abstracted that away from you." And then all of a sudden somebody comes along and they go, "How do I run my SQL in it?"

Here's the rub. If I give you access to the Google UI or if I give you just straight access to the AWS RDS module, guess what? That developer's putting MySQL in the cloud and you might have a bunch of constraints that you want to put around that. What version of it? Are there any policies that we want to apply? Are there certain extensions we want to make sure that they're going through? Like you've just allowed people to sidestep everything you just did.

But if you're like, "Hey, you know what? We run Postgres. Here's the database module. We called it database because we only have one." And then the developer comes along like, "Oh, the reason we need MySQL is we bought something off the shelf and it only supports MySQL." Well, that's great. Now we can start to think of like, "What does MySQL look like as an offering here?" Maybe this is a one-off and that's fine. Or maybe we know that some developers have wanted to use MySQL for other workloads, right?

Like that to me is what DevOps is, these teams working together and figuring out what their constraints are to get things places. And I think one of the ways that you can do that, whether it's a tool like Massdriver or not, is focusing on nailing the 80% use case in those modules.

I've seen people do this in all sorts of really interesting ways, right? Like going back to the AZ thing I mentioned a minute ago, it's hard to get right. Developer, you're like, "Oh man, we have a lot of outages. Maybe I should put Postgres Aurora in all four zones of US East 1. It'll never go down." Yeah, it won't. I mean, your CFO is going to be pissed in like three weeks, right?

How do we make that decision? It's like, "Well, what is the workload? What are our requirements?" We have SLAs, right? It's like, "Oh, this service has a 99.99% SLA or 99.9% or 95% SLA." It's like make the SLA one of the inputs to the Terraform module. What is the SLA? It's 95%, 99% or 99.9%. And then you make decisions for the developers based on that. You go, "Oh, this application requires extremely high uptime. I am going to do auto-scaling to AWS's max and I'm going to put it across all the zones." Or you see, "Oh, this is a staging workload. I'm just going to make a single Postgres instance because it's staging and it doesn't really matter."

And so I think there's lots of ways to think through how you can make using Infrastructure as Code easier for developers. And I think the key part to it is taking your experience as an Operations/DevOps platform professional and imbuing that module with your expertise. Like that's the only way to scale it.

Kelsey: 00:36:30

I think the thing that for people that come from the world of shell scripting, in order to write a shell script you almost have to know the end-to-end thing that you need to happen and very little flexibility. Like, I haven't seen a lot of great shell scripts that like are configurable and have config files and really robust flags. It just doesn't happen, right? It's like, "When this script kicks off, it's all or nothing, bro. You might get an exit code that's valid."

Cory: 00:36:54

Pipe fail.

Kelsey: 00:36:56

Yeah, exactly. And then when you get into this world, you kind of want to avoid overly constraining things. You want to be a little bit more flexible. You kind of want to give in to the illusion of choice.

But I think there's another component to this which is, if you're going to be building out these tools, you have to understand you're never finished. And I think the idea of a golden path had thrown a lot of people off. They're like, "Oh, we're going to figure out the blessed path going forward. We're going to account for all the things right now, today, and that's what you're going to use."

And it's like, yeah, I think checkpoints are a little bit better idea. Like, "Based on everything we know and the experience we have, this is how you can use these things today. And if one of those don't work for you, we know this should change and we're not going to make a big deal about it. So if you really want to introduce a new flag or a new region, just help everybody understand, because that's the learning part. We need a new region because... Going forward we want to make that option available because..."

And once we all get consensus one time, and we go back and update the Terraform to add that, we update MassDdriver with the constraints to show that, and all of our monitoring tools to expect that, then we've now learned permanently. And we don't have to rehash this debate again.

So you have to do that bit of upfront. I think a lot of people have been afraid of that product development loop because they feel, "Oh, this feels a little bit like waterfall. It isn't agile enough or it's too much flexibility. We can't manage this much flexibility." But if you get into this way of working, things like Massdriver should just be checkpoints along the way.

So if someone joins the company tomorrow, they should benefit from the last 15 years of you guys learning how to work together. And then just using the best of the breed that's available. And if they have new suggestions, they can enter the loop as well. Versus rediscovering what's possible every single time someone wants to do something.

Cory: 00:38:54

Thanks for listening.

Next time, Kelsey and I are going to talk about how golden paths can fall apart, what good Infrastructure as Code can actually buy you, and how GitOps can sometimes shoot you in the foot. Hit subscribe and make sure to tune in to Part Two. Thank you.

Episode 37

8th Oct 2025

Guest Host: Kelsey Hightower — Why IaC Alone Isn’t Enough

Transcript

Listen for free

About the Podcast