Beyond GitOps: Rethinking Cloud Self-Service with Dave Williams

Is GitOps holding your team back? In this thought-provoking conversation with Massdriver co-founder Dave Williams, we challenge conventional wisdom around cloud infrastructure management and explore why traditional approaches to compliance and self-service may be creating more problems than they solve.

Discover how leading organizations are moving beyond ceremonial approval processes to create truly automated, self-service platforms that enhance developer productivity while maintaining security and control. Learn why treating infrastructure as code differently from application code could be the key to unlocking engineering velocity.

Key topics covered:

Why compliance doesn't require manual GitOps workflows
Creating meaningful abstractions that codify operations expertise
The shift from reactive to proactive infrastructure governance
How platform teams can become strategic enablers rather than bottlenecks

Whether you're a platform engineer, engineering leader, or developer frustrated with current infrastructure processes, this episode offers practical insights for evolving your approach to cloud operations.

Guest: Dave Williams, DevOps Propagandist

Massdriver

Links to interesting things from this episode:

Transcript

Intro: 00:04

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory: 00:40

Hey everybody, welcome back to the Platform Engineering Podcast. I'm your host Cory O’Daniel and for the next couple of episodes I'm going to be talking to different makers in the space of Infrastructure as Code, particularly around the Open Tofu space. So we're to be talking to some of our co-founders in Open Tofu over the next couple of weeks.

‍This week, I have my actual co-founder, Dave Williams, joining the show from Massdriver. We're just going to talk a bit about the future of Infrastructure as Code, how we see it, and the value that we think we're adding to the ecosystem.

‍That's kind of what the next couple of episodes are going to be about - different folks trying to make operations a bit easier for organizations to develop at scale.

‍Dave, welcome to the show or welcome back to the show. You haven't been on in a couple of months, but we used to do all the first episodes together. Thanks for coming back today.

Dave: 01:25

Well thanks for having me, I'm looking forward to it.

Cory: 01:27

You've intro'd yourself on the show before, but do you want to do a little bit of intro before we kind of dive in?

Dave: 01:32

Yeah, I'll do a quick one. So, I started my career in a weird way. I actually got into software engineering as a consultant for visual artists, so a lot of embedded systems programming and stuff like that. In collaboration with artists, so building interactive sculptures and stuff like that.

After I got my master's degree in that, I wanted to actually make money… instead of just bumming around making art. So, I got into kind of engineering for like the web specifically. And the thing I noticed when I got there, going from like this solo engineer, focused in on a single runtime, to like, there's a team of people jamming code into stuff, was like (a) it's a team sport, and the tooling that allows that is actually more important than the code we write. And then like, runtime sort of matters… in that like, it really matters that you know what you're deploying to and how and what your constraints are… but at the same time, compute memory was suddenly so inexpensive that like optimization didn't matter in the same way that it did before.

So that pushed me into this weird kind of like Dev tools/DevOps space, even though like, I'm a product engineer at heart - like I like making things for people to consume. And I think that kind of like, led me to the space that we're in now, over the last, you know, 15–20 years.

Cory: 02:51

You're in the Dev space, you started moving into DevOps/IaC a very traditional… I feel like it's one of the two traditional paths into DevOps land. But before Massdriver, we were working on some other things.

‍What was the moment for you where the idea of what would become Massdriver (we weren't even called Massdriver at the time, we're called… Connelly Corporation, I think was the name of the company)... What was that moment where the idea of Massdriver like hit and made sense to you? Even before we had anything put together that looks like what we have today.

Dave: 03:24

You and I were talking about building something just in Elixir just to write some software together. And I remember, we were doing like a photo sharing app or something (the details of it don't matter), but when we got to kind of the point where it's like, “Well, how do we deploy this?” You and I both looked at each other and said, “Not it.”

I think for people with, between us, like 40 years of experience, it's really weird that we were in dev mode - we care about product features and writing the same Infrastructure as Code we've written in every job we've ever been in was wildly unappealing.

It was like, is there a way to just get this so we can build whatever we need to build, whether that's the photo sharing app or anything else and just have that happen without thinking about it.

And I think that led to… Massdriver is actually the thing that allows us to do whatever we want to do.

C‍ory: 04:10

I remember this day very vividly, because you were coming over to my house and the goal was we were going to do some whiteboarding on the wall. And I had planned to paint my wall with this whiteboard paint that I'd bought. And it had sat in the garage for such a long time, when I went to paint, it was a plastic brick - it solidified.

‍And so you came over, you remember, we had that… it was like a 24 by 18 inch whiteboard that we tried to diagram everything on. Do you remember that? It's still up today. It's what my wife uses for like tracking her workouts today.

‍But we're like, diagramming with like the tiniest markers, like trying to diagram our system, that we're working on.

‍Okay, so, you know, we get the idea for Massdriver, we start working on it. We've obviously gone through a lot of iterations, but we've gotten to a place now where things are resonating, we feel really good about it.

‍Like, what would you say is the key problem that we're solving today for teams with Massdriver versus like what we were originally trying to do with the product? And how do you think this aligns with like the broader challenges facing platform and infrastructure engineers today?

Dave: 05:15

In the beginning, we were kind of being like, “All right, how would we make our job as operations engineers like a lot easier if we were Ops?” And I think we built a tool for Ops people that can be consumed by Ops people.

And I think like the major change in the way we think about it, and really the aha moment is like, it's actually not about ops people doing their job over and over. Like writing Terraform isn't hard for them, they're really familiar with the tools. But enabling self-service, so you can actually like distribute some of the operational burden on to like your entire organization.

I mean, I say burden, but software engineers/product engineers want the ability to manage their infrastructure, they want the access to make the changes they need to make. And our job is to keep them safe to do so.

That's really the aha moment where this becomes a product. Now, we know exactly who's consuming it, we know exactly who's kind of like curating it. And that's where like the real value is - in that enabling of self-service.

Yeah. And so self-service is something I want to tie into because we actually talk about this on the podcast constantly - I think almost every episode. You know, even going way back to the IAC episodes I did at the beginning where I was kind of talking to all the tool builders from the early two thousands through to now that built most of the Infrastructure as Code tools we use.

C‍ory: 06:27

Self-service is a word that we use a lot. And I feel like it's either done one of two things. It's either become just a buzzword or it's become almost one of those words that has no meaning anymore because it's very different from organization to organization, practitioner to practitioner - like what they mean by self-service.

‍I would love to know, like in your words, like what is self-service? Like what should an engineer expect of a system that's giving them self-service? And what should an operations engineer expect of a system that's delivering self-service?

Dave: 06:58

Yeah, so let me touch on the developer first. I mean, self-service in this case really is the cloud console.

We see all these companies that complain about drift, right? They complain about drift because someone is going, “I don't know how to do this complicated thing with, you know, Terraform and GitOps. I need to make a change right now. I'm going to the thing that's easy to use and readily available.” And it's the cloud console.

Now, there are so many downsides to the cloud console, right? Like you lack reproducibility. Who knows what button you pushed and will you change it in all your environments? And so you need to get all that safety and security from like the IaC tooling while making it so readily available to people that they can just make the right choice.

I think on the operations side, it's really about not having to worry about waking up at one in the morning and debugging a problem that like you probably don't have a ton of visibility in. Like if I deploy an RDS instance and no one ever puts data in it or queries it, it will never break. It'll just run forever. Maybe AWS reclaims it eventually, right? But at one in the morning, if someone pushed a gnarly N+1 that like is triggered by a job that runs at midnight, I, the Ops person, don't have visibility into the code change that did that.

And so, like, I need to put the power into the people's hands who have that context to like get things done. And that makes my life… [Dave knocks something over and there is a crashing sound]

C‍ory: 08:21

[Laughing] That makes your life… crashing sound. I love it. Sorry, what was the last part?

Dave: 08:27

So much easier, right? I'm getting passionate and like kicking my feet around.

C‍ory: 08:31

I'm kicking a toddler's pram right now that is under my desk for Christmas.

‍It's funny, like, you know, when I think about self-service and where we are, I feel like it definitely varies from team to team. And I think it should vary like what it means from team to team.

I think it really comes down to like dialing into like where an organization kind of sits on that DevOps maturity model. Right? ‍Like, you know, self-service for one team might be Heroku. And like we saw teams thrive on that and thrive on PaaSes today even. And for other teams, where you start to need more access, like maybe you have people that are like every single engineer is going to write Terraform.

‍Dialing that in for your organization is really an organizational choice. But I think one of the hard things that we have as Ops people is gauging that right level that we should be delivering at.

‍And that's where I see a lot of organizations that come in and talk to us - where they've kind of run into some failure. Like they have an idea of how they want the world to look, but their engineers aren't at that same level.

Dave: 09:30

Yeah.

C‍ory: 09:31

Right. And so when you're looking at an organization of a thousand, two thousand engineers, like the idea of re-skilling or changing the skills of those engineers generally isn't something you can necessarily do. It's like it's like turning a container ship, right?

Dave: 09:45

Yep.

Cory: 09:45

How would you say that it affects self-service for some of these orgs where they have just a massive team of engineers, smaller operations personnel? Like how do we make it easier for these orgs to actually get self-service? Like what is the value that we're creating for these operations and engineering teams?

Dave: 10:01

Yeah, so I think it's twofold, right? Like the first kind of obvious one is like, I think we don't manage infrastructure as code the same way that it has been for the last decade, right? We actually like package up smaller pieces and make them composable, right?

And so it doesn't require brand new interaction. Like you don't have to make new Kubernetes manifest and deploy them, right? Like you can just take what the ops team has packaged, right? You don't need to make a new main.tf to string modules together. You get what the operations team has packaged. And so it really lowers the, the amount of code review that an operations team has to do. And it shifts them to like adding value, right?

And I think from the developer side, that consistency in like user experience between Helm and Terraform is just really nice. It’s just one less thing you have to think of while you're trying to ship product features and get closer to the customer, right?

I think the other side of this is… and it's funny you kind of like bring up PaaS. I think PaaS is risky beyond a certain point because like there will just be a moment you need to grow and a million dollars to peer a VPC is just money you're not going to want to spend, and you're going to be forced to, and that kind of stinks.

And so the question is like… I look at something like Heroku and the main draw to me, at this point, it comes with like scale and optimization of your resources, right? So like I have plans and those plans are big enough chunks that it's really obvious when I need to go up and down.

In the cloud, we don't have that. There's so many micro adjustments you can make and like this manifests itself by like the hundreds of consultancies optimizing your cloud spend, right?

And it's like, how do you create that thing where it's like, here are 10 resources in buckets that are obvious with alarms that'll tell you if you're scaling out of them. Like you can now provide that through software, right? Cause the alarms are now integrated, right? The thing is instead of every AWS instance in a PR for Terraform, it's like, here are the four we support, move up as you need to.

Cory: 12:02

Yeah, and I think the thing that's really interesting that I've appreciated about our approach is it really does feel like for the longest time… I'm an operations engineer as well for anybody who's tuned into this for the first time, what an episode to tune in to.

‍But for many years it's like… especially in one of those orgs where like the Ops team tends to write the Terraform and then like also write the Terraform for engineers… it really does seem like you do everything kind of twice, right?

‍It's like you write your shared modules and then you consume them. And it's like, you're always learning just enough of the operations team’s tooling and process and words as an engineer in many organizations just to get cloud resources, right?

‍I think the thing that's always been frustrating for me was that developers through and through are fine with abstractions, right? Like we see people run on, you know, Heroku. You see people run on Vercel. Like there's plenty of successful organizations that are running on these businesses.

‍And I mean, honestly, if I sat down today and Massdriver didn't exist and I had an idea for a side project and Massdriver didn't pop into my head again, I'd be running it on Heroku in heartbeat.

‍Now, people might be laughing at that, but like, shit, if I got a side project, I got stuff to do, right? I only have so many hours a week to work on this thing. And the operational side of it is something that's not going to make that side project a success or not, right?

‍I think that many organizations, their teams are in a very similar place. If I've got 45 software teams and they're all working on very specific initiatives that like create revenue and move a business forward.

‍They're all thinking, I've got so much time to deliver this. I've got external pressures from project managers, stakeholders, et cetera. I've got life going on outside of this. I’ve got 40 to 50 hours this week to ship value. And if I have to stop and learn a little bit of the cloud and a little bit of my ops team tooling and how we do it every step of the way, that is a big context shift. That is a hard context shift.

‍What I'd love to know is from your point of view, like how does something like Massdriver (like our approach) allow people to get that self-service without that huge context shift of just like, “Okay, I'm looking at a different tool set now. I'm looking at a completely different world. I'm looking at 7,000 knobs for fine-tuning RDS or an EC2 instance? Like how do we give people that self-service without just context switching their life super hard?

Dave: 14:41

Yeah, so one of the big things you've been talking about lately, and it's so true, is that the funny thing about the AWS API is like, it is one API call to do two things. One is kind of scale and do the things that developers really care about, right? And then there's weird like security availability settings that are like purely an operational concern, right?

You have to contend with both of those if you use Terraform, or some subset of that, right? Like you just have to understand all of these settings as a developer. And you don't care. Realistically you care about like how many requests a second can this thing take, right?

I think the really powerful thing about Massdriver is you can make all those security and availability concerns either non-existent… like we just run it one way, right? So if you're in prototype mode it's like we only use one availability zone because we don't care if it goes down, we're just trying to convince people to use this thing, right? And then like maybe you grow a little bit and it's like everything runs across an entire region and like maybe that's a button push, maybe that's something no one even has to think about. Now you're left with like four or five fields that are actually crucial.

Again I think the one in the morning test is where I really evaluate a lot of the software in the space. Where it's like one in the morning, can a developer come in do what he needs to do with minimal disruption of the rest of the team? That's really how you shrink that interface of the cloud and give them a PaaS-like experience. It's just like hide all the details they don't care about to write software.

C‍ory: 16:06

I know I've been this person on the team… and I feel like you've been this person on the team at your most previous role… a very other common pattern is like you have the team lead, like the person that knows the most amount of infrastructure like on your development team, right?

And like, that's the person that, yeah, they're the lead, like your role is to lead, right? I should be skilling up this team, but that person, that person in that role is also the person that tends to get their like, brain tasered the most, right? They're just getting thrashed by people asking questions all day long, right?

‍So it's like when you start to centralize your DevOps into Teams, now you have this one resource that everybody's just kind of pinging constantly, right? And at 1 a.m., that role sucks, like whenever there's downtime, like you're getting pulled into it constantly, right?

Dave: 16:53

Yep.

Cory: 16:53

And the thing that sucks about that… people hearing this, and maybe you've never lived this life and lucky you… is that role, like if you found yourself in that role, it's an extremely hard one to get that team out of because when you found yourself in that role, you're typically already underwater.

‍And like finding that room and finding that like, that like Google like error budget is so hard in most orgs. Like it's a hard one to convince the people above you that you need to do. “We’ve got to slow development so that we can make this team better.” It's a hard argument, right? It's like, can I prove that I'm going to make this team better? Like, can this team really add value if we stop shipping software and like focus on our own delivery.

‍We both know that the answer is a resounding yes. But many teams don't have that marketing and language within them to go out and talk about it, to figure out like, how do we do an MVP to show that we can pause and make ourselves more efficient?

‍And I feel like that's one of the things that we're trying to help with, right? It’s like, if we can come in and help two different people on your team become more efficient by neither of them learning any new tooling. Like you start to see that people start to rise above water a little bit, right? And like now the org sees the value of it. “We're shipping software faster. What's happened in the past month?” It's like, “Well we're letting people get what they want, but we're not making them learn new tools to do it.” Like that's beautiful, right?

‍Like you see people clicking around in like the AWS console and like they're getting stuff done. They have no idea what they're doing, but they're reading and they're moving. They’ve got enough docs right there. That's not there in Terraform. It's not there in Open Tofu. It's not there in Bicep or whatever, right? They're clicking around, they're having a good time. Until they need to reproduce the environment, right? Or until somebody needs to understand what's going on at one o'clock.

‍I feel like that's one of the places that I'm really excited about where we've ended up. Because we actually see that… we see those customers every day that just they don't have cloud experience, but their operations engineer has given them enough where it feels like they know everything they need to know to run this thing at scale. And that's powerful.

And it gives them the time to come back in and learn about the stuff. How does this work under the hood? I actually have the time now and the interest to come and figure that out. And I feel like that's so hard to get to as a team if you don't have a management hierarchy that understands that and lets you get to the point where you have enough time to tread and get above water instead of just kind of get a nostril above the surface of the ocean every three or four weeks.

Dave: 19:23

I mean, it's funny when you think about like even joining a team as operations. Like I feel like by the time someone's like, “We need to hire operations people or we need an extra.”, you're already underwater. You're going to get there and you're going to settle tech debt, and while you're settling tech debt you're not investing in like the future thing. So people are making more tech debt while you're busy settling the tech debt.

And that that is how like DevOps and SRE and all those people… the cloud infra people, wind up burning out so fast at organizations. Like you'll be there for a year, you'll settle some tech debt, but there's a whole new pile for the next DevOps hire to have to deal with.

C‍ory: 19:58

I'd say like maybe eight to 10 weeks ago… When you look at the startup journey, there's troughs, there's highs, there's lows. There's low, low, lows. And there's like, yeah, there's a couple of highs… until you get to that series E, then there's a lot of highs. But I'd say about 10 weeks ago, we were at one of our lowest of lows, I think, ever as an organization. And then we learned some words that worked very well for us. And I think they've kind of changed our life over the last like eight to 10 weeks.

‍So I kind of want to ask a couple of questions about this. I know we talk about it a lot on LinkedIn, but I feel like, you know, it's a fun conversation to have. And that is compliance.

‍We are not a compliance tool, but talking about compliance is the thing that has unlocked customers for us and in a way that nothing else ever has.

‍Compliance is a big deal. It's a big part of DevOps, Ops, Platform Engineering, day-in-day-out… software engineering. It can feel like for a lot of teams, they spend a lot more time thinking about audits, thinking about compliance than actually securing the systems. There's so much more we do around it than actually it. There's so much theater to it.

‍How does our approach and our language, that we've started using around compliance, address the problems that many orgs face today? And how is Massdriver helping teams focus on meaningful compliance initiatives instead of just checking all the process boxes?

Dave: 22:34

It's funny, I feel like compliance used to be like a first class citizen in our kind of like outreach material. And like, if you were talking to somebody, we'd be like, automate your compliance. But I think the big thing that shifted was… self-service is important to you, you feel a lot of pains around kind of the state-of-the-art operations process and we want to alleviate those. And I think the next question that comes up is like, “Well, how is clicking around in this UI compliant?” I think the aha moment for a lot of people is like (A) SOC 2 is older than GitOps, right? You don't need this like heavy approval process in TicketOps necessarily, right? What you need is to have predictable inputs to your automation so that you can say, “Hey, this is what I'm going to allow to go out into the cloud and that's it. There is no way to circumvent this.” Right? It might be three instance types, one region, right? Everything must have alarms. You don't get a button for that.

And I think once you do that, your automation becomes… because there's no net new code… your automation becomes more of a function in your software, right? We don't code review or create tickets for every possible input to a function. We code review the fact that like we understand the range and the output of this function and we can predictably tell it won't have a weird side effect like dumping memory to a third party or something like that, right?

I think like that's the big thing about compliance - it doesn't have to be, you know, GitOps code reviews, ServiceNow tickets, all this like JIRA stuff. It's a nightmare in general to, I think, search through that stuff in an audit.

Instead we have this database that can take snapshots of stuff and we can query it and we can start to like get rich history. And be able to like diff points in time to be like what's the difference between the last audit and now and how do we tell a story about how we got here. And I think like that's really attractive when you tell that story, that you can be faster and make compliance less of a headache.

C‍ory: 24:28

By making it something people think about ahead of time, right?

‍I love that analogy of we don't code review every single input. Maybe you do … hold on, somebody in the Elixir camp is like, “Oh no, I use stream test and we test literally every single input.” But like most people don't. The reality is most Infrastructure as Code is a bag of configuration, where we're essentially writing code to be like, we're going to make this HTTP call from a CI pipeline.

‍And it's just like, okay, my app makes HTTP calls all day and I look at the function, but I don't look at the millions of possible input combinations that are coming from my users. My assumption is I've written a good little black box that does a piece of magic. We've made sure the insides of that black box are happy. And then we can use it. We're attaching it to a web controller. We're attaching it to a gRPC function. People are calling it now.

‍And when we get to Ops, we're like, “Whoa, whoa, whoa. You want to call that function? Let me take a look at that first.” Right? Which when you think about the scope of software and how we write and the differences in our two systems from the development side to the operations side, when you think about it like a developer, it seems extremely silly. But this is where we are.

‍This other really great analogy, I think, especially with the reactive processes most people have around compliance… I felt this very recently. So I haveTSA PreCheck. I love it. You know me. I'm dramatic. I love the nicest… I like the nicest everything. I'm like, you know, man, if I go to an airport and they don't have a PreCheck line, like there's a good chance I might take an Uber to get to the other airport. And we were flying. I can't remember where I was flying from. But I got in the TSA PreCheck line and I'm like, “Yeah, self-service. I'm cruising.”

‍And it was one of those airports where they don't actually have TSA PreCheck, they just have the line and then they put you into the regular line with that card, but they don't quite give you like all the rights. So all of a sudden I'm taking my fucking laptop out, I've got my shoes off and I'm like, “I was just in the PreCheck line, but now like I'm getting scrubbed down.”

‍And that's what most self-service feels like to me. It's like, “Yeah, no, you guys can do whatever you want.” Like we did it. We've delivered self-service. And then you're like, “Sweet, I can just…” and it's like, no. Plan slaps you, checkoff slaps you, OPA slaps you. You deploy something that's wrong, you're pinging somebody in Slack and they're like, “Let me tell you why you did it wrong. You should have had me look at the PR because I know databases more than the person that approved the PR.” Right?

‍Our processes in many self-service systems are inefficient. We want to give folks self-service. Self-service is I come in, I pay, I get my food, I go. Like it's not… me just getting tripped on the way out the door, somebody's stopping to check on my bag when I'm leaving McDonald's. Self-service today really is only half the picture and we kind of just trip people up along the way.

‍That being said, it does feel like many self-service systems are almost intimidating to operations. I know that if I don't have something where somebody's pre-approved, they're going to come back to me to ask me if they can do this. I've given them the ability, but now there's a PR. But the scarier part of that is when that PR gets merged, whether I'm involved or not, I'm back to that 1 a.m. problem that you mentioned earlier. I don't know the system that you've built now, right? And I think that is, rightly so, terrifying to many ops people who are currently underwater.

‍Besides adopting a system like Massdriver, how do operations engineers get to that place where they can give people self-service and truly feel that they're going to get compliant systems that they know adhere to all their company's organizational non-negotiables on their own? Like, what options are there out there for somebody to go and put something together like this without like leaning towards necessarily a vendor?

Dave: 28:42

Yeah, I mean that's an interesting question, right? I think we've seen orgs do some version of this. I think if you asked me six, seven years ago if like your first version of your platform should be like a CLI that makes the Terraform and whatever else for a new environment and like concatenate some stuff together and enhance it to a developer… I would have been like that's probably a good first step. But I think even that is really challenging right?

I think, at some point, you need to get away from using Git as a database to like understand your system right and monitor your system, and you have to move to something that looks more like an API with a database.

I just think that's… ops is a horizontal team in your organization, which is kind of weird because we have a lot of vertical teams in tech orgs. And so you need to do the thing that the greatest e-commerce stores in the world did to actually… or they weren't e-commerce at the time, they were kind of just commerce and then they figured out, “Oh, I can do millions of transactions on the internet.” And I think we have to just start adopting that posture. And even if it's something small like a Backstage plugin… which I think will still get you somewhere, it may not get you as far as you need to go… I think just getting in the mode of doing that. Like, really think about your module abstractions, right? And eliminate that net new code problem you have with IaC tools. Because that's where a lot of this miserable compliance stuff happens.

Cory: 30:15

Can we tap on that for a second? I want to talk about abstractions because I feel like this is something we've been talking about lately and I feel like it's resonating with a ton of people.

‍That net new code thing is… you know, I know that we kind of talked about it a little bit a few minutes ago, but I would love you to just kind like… I'm gonna say it, I'm gonna use a real shit-heady phrase here… I'd love for you to double-click on that just for a moment. Can you double-click on that because I feel like that's one of those things that like is hard to… it's hard to grasp until like you've actually adopted a system where it doesn't exist anymore.

‍You mentioned the Snowflake main.tf, can you talk through… just for the people that aren't familiar, or maybe the people that are familiar that need to have this torturously dragged out… what does the process of consuming infrastructure as code look like for the average engineer?

Dave: 31:04

Yeah, so let's say you're at the current state of the art. You've made modules that are reusable, right? You've limited the amount of new Terraform someone has to write. It is impossible with something like Terraform, Open Tofu, to not have a main.tf file that actually pulls in that module and connects it to something else, right?

You're constantly kind of in this infinity cycle. You can't actually just deploy that module. There needs to be some controlling thing. And I think… that requires a code review. It's new code. You can't just say deploy this brand new code.

I think, if you use your module though, you're going to that like modules as automation. And if you can just control the inputs to that module and deploy it without writing that new code, like your compliance burden shrinks massively. It's just like you use the module like a function as opposed to like a library in a programming language.

Cory: 32:01

Yeah, and I think that's one of the things that's a bit odd, right?

‍We do write it as code today and so you're like, “Oh, this has to go through a code review.” And I think one of the things to make clear here, because I feel like when we write about this online, I think people miss this - we think code reviews are great.

Dave: 32:18

Yeah.

Cory: 32:18

I'll look at your Elixir code all day long. You'll look at my Elixir code all day long. We have critiques for each other. We do that. But when it comes to infrastructure, again, it's making an HTTP call. It's calling a function. And it's like that we never… maybe in a test suite, but that's about the extent of it.

‍But that's the wild thing about IaC - we do have these test toolings. We can use things like Terratest or even Open Tofu or Terraform test to test our modules where we wrote the shared ones. Why do we have somebody looking at the code down the line?

‍And I think (1) is because it's an artifact of the way that we do it through something like Git. And I think (2) is I feel like people are still just a bit nervous about that loss of control. If you can have approvals in place, there's nothing in SOC 2 that says a bag of meat needs to click a Yes button. It says you need to have approvals. You need to make sure things are compliant. You need to have an approval process in place. It doesn't say it can't be automated.

‍I think that's one of the key things is like not kind of chaining yourself to this idea of like compliance theater where you're like somebody has to click the button. Like ta-da I'm here. I'm going to click the button now and things can roll forward. Like our job is automation. But then we put this like one part where we're like, we've done so much automation, but let's stop and ask a bag of meat what they think about it. Right?

‍Meanwhile, we're starting to see AI everything. AI this, AI that. But we're like, we're still going to stop and approve something, right?

‍How can we be more proactive about these approvals? How can we limit the inputs to these systems so that they're approved from the get-go, right? You can just go run the thing instead of writing some code, getting it approved to run the thing.

‍It's really wild like how much of our job is about automation. But when you think about like what we've automated, it's like… It's like we make these calls from like one place and it's in the weirdest runtime, like your CI/CD, right? And like we're not truly automating it because there's always like a little stop process in it. So it's like automated-ish. Ish.

Dave: 34:27

Yeah, I mean, it's funny, that going from reactive to proactive, it's like… the spirit of a lot of these compliance frameworks is like prevent unauthorized changes to your system, right? And I think unauthorized changes is all about like limiting the scope.

So with like code, right? New code, the potential inputs into new code are infinite. Anyone can type anything they want, including something malicious. And it's really important that we really focus in on that and probably get more than one set of eyes on it, right?

But when it comes to configuration, it just feels like a giant hurdle. Like it should just be JSON going past to your automation, right? And like people should just be able to self-serve and that should be audited, right? But the second that you have rich validation and limited number of inputs, it's like, cool, you're pre-approved, you're reactive… I mean, you're proactive now instead of reactive.

I even think about code review and how much I actually trust code review. Like how many times have you gotten a giant Terraform PR and you're reading it… halfway through you're like, “Cool that's an instant size, it's real. This will pass.” versus like, “I'm going to go call up the product lead on this product to make sure that that instance type is actually appropriate. I'm going to call the CFO to be like, we anywhere near a cloud budget because this might push us to the edge.”

I feel like that's not happening in code review. But when you're proactive in building these automation systems, you can actually bake that stuff in ahead of time. And if somebody needs to go outside of that loop, outside of the 80 % use case, you've built a product feedback loop.

A developer has to be like, “I think there's a case for a new instance type.” And it's like, “Cool. There's a couple of stakeholders who are going to have an opinion on that. Let's determine whether or not that's reasonable. And let's add it.” Or “Let's make a new thing that maybe there is one use case for this new thing. And if you need this, you can use that. But we'll leave the old one the way it was.”

C‍ory: 36:31

Yeah, I think the thing that's wild with that is like… you know, I'm a Kubernetes fanboy. I develop locally in Kubernetes. That's my life. I'm fine with it. I'm happy. I'm happy where I'm at, guys. Don't judge me… but like, you know, we're in Kubernetes. We're making like, “Hey, what are your resource requests? Like, how many CPUs? Like, how many RAMs? Like, meh.” And I'm like, “Oh, like, I run my thing locally. I got an idea. I maybe need like 256 megs of RAM for this.” (We write in Elixir, guys, so we don't really require a bunch. Enjoy your Javas).

‍But like, you know, I throw in some numbers. I'm like, “Hey, you know, one to two CPUs.” Like I got an idea of what my resource requests are, but it is funny like hearing instance types.

‍Like this is one that always drives me wild is, you can put anything you want in that field. It's a string. I could type and type whatever I want. And like, man, I've been using RDS since it came out. I forget to put the DB dot on my instance types almost every single fucking time.

‍Like it's like I know databases. My master's is in database systems. I know how to size a database, right? Every single time I make an RDS instance, I'm Googling the fucking obtuse names of instance classes in AWS, trying to remember how much RAM they have, how much memory do they have, how many CPUs they have. That in and of itself, pre-approvals aside, it is such a fumbly way to understand the system.

‍Seeing a PR come through, and you're like, “OK, oh, yeah, yeah, yeah. OK, he's got our 6G XL. That probably makes sense. I mean, I don't know anything about how he's using this database, but I mean, I know that's a big one, so that's probably great. And I missed the DB part on the front of it, so like, oops, that's going to fall on its face as soon as the plan goes through.”

‍But like, just that notion of like, “Hey, here's some instance types that we do use.” Maybe we don't use T class at all in AWS because they're burstable, they run out of compute, they get weird. Maybe we just say we don't want people putting them in whatsoever. Being able to say here's a preset list you can select from. Or maybe these are reserved instances that we have, and we want to use this specific type. Or maybe you're on one cloud (I won't mention any names, but you could Google it) and maybe there's an instance type that's just always gone and you look at it and you go, that's perfect. Maybe you know the workload inside and out and you're like, that is the perfect instance size. But man, that thing is always exhausted and there's never one there. Right?

‍There's just so much tediousness around it. But if I'm in Kubernetes, I'm just typing in how much RAM I want, how many CPUs I want. Like I'm not talking about instance sizes, right? And it works fine for so many workloads.

‍That idea of bringing that kind… not necessarily saying that's the right thing for every team, but we've seen this with teams today where it's like, “Hey, how many CPUs and how much RAM do you need?” That's what they have in Massdriver. And somebody's like, “I need two CPUs and four gigs of RAM” and it picks the right instance size for them on EC2, or it picks the right database size. There are abstractions that make sense to your org that you can start to build around your infrastructure as code to make it accessible to people.

‍And I think the wild thing about this is not just that accessibility. That's great. I love that. I love that we give people that. It gives people the time to look into it more when they want to.

‍But I think the two things that happen, there's two extremely powerful knowledge transfer things… which is the point of our job in DevOps. We're supposed to tear down the silos. Mix the corn in the hay or whatever. It's never about mixing the corn in the hay. Silos serve a perfectly sound purpose. They keep the rodents out. They keep the rain off the food. We like them... But there's knowledge transfer in designing a good abstraction. And two pieces of knowledge transfer happen that I think that we generally miss in IaC.

‍One is codifying my expertise as an ops person, right? Like to say, hey, this is how we decide on what the instance size should be for a database. I want to know how much data you have today, what the expected growth is over the year, read to write ratios, right? Like there's questions that I will ask you as a DBA to help you size your system. I'm asking you those questions. And then I tell you an instance size and you go and type it in. And then somebody else asks me the same thing, “Hey, what instance types should I use?” I'm going to ask them the same questions.

Why are those questions not the variables to my Terraform? Right? Because now if I write a formula inside my Terraform using… locals are using some sort of like one of the new dynamic providers where you can call functions and go or whatnot. I can actually code my expertise into this thing. And now everybody understands how I do this. If I die, get hit by a bus, whatever, how we got to these values is understood.

‍And more importantly, the outside of this is exposed to developers in a way that they understand. They know, “Yeah, we're expecting about 3 million users our first month. And like, this is about the average size of the data that we're going to take.”

‍I can make a prediction at how big my database is going to be and so I can say, “OK, how often do we want to come back in and think about resizing databases?” We aim to do this maybe once or twice a year, maybe every two years.

‍These are constraints that we have in our business. And I say, “OK, you know what? I'm going to have the questions in such a way that it's going to pick a good instance size that's going to work for that team's growth for a year.” And if they're explosive and they grow way faster, great. I've got a lot more time to work with them at the six-month mark than a year from now if we need to resize, re-optimize, or move them to another database that's maybe a bit bigger.

‍I think the second knowledge transfer thing that happens there is the outside of your Terraform modules become documentation for that system. ‍When you look at a Terraform module and you're like, “Oh it's using a db.r6g.xlarge.” You go it's using a db.r6g.xlarge and you have no fucking idea why. Maybe you go through Git, and then you find the Git comment, and it's like changes. And you're like, that's not useful.

‍But when the module's inputs are, “I have 300 gigs of data today, we have a read to write ratio of x. The average data that we're writing this thing is y.” you have documentation about that system. And sure, it's going to change over time. And when that changes, it's going to change the instance type. But that's fine.

‍We're going to change that instance type if you picked the wrong one too. We're going to change that instance type if you pick the right one when something changes about the way that the database works.

‍And so the abstraction thing to me is always just… has driven me nuts - like the pressure against it. I'm very excited because the amount of people that I've seen on Reddit and LinkedIn talking about abstractions recently and starting to build abstractions using their IaC modules is extremely exciting to me.

‍This is one of those areas that I think that people need to think about it a lot more. It's not just a way of making it easier to get. It's a way of making it easier to understand for both sides, that we don't have today.

That knowledge transfer is all in Slack. Or maybe it's happening in a ServiceNow ticket, but it's so hard to search for. And it's a degree of separation from “your code” (I threw some air quotes on that).

Dave: 43:39

What's happening at the water cooler if you're RTO, right? Which is the worst place for that knowledge transfer to happen.

C‍ory: 43:45

But it happens. People are like, “We like people in the office. They talk and they figure things out.” It's like, yeah, and then they usually don't go back to their desk and be like, “Yeah, we talked about why it's that.” Somebody just was going back to their desk trying to… they're probably texting themself the fucking instance size as they're walking away from you so they don't forget and don't have to ask you again on Slack and feel like an idiot, right?

Dave: 44:03

Yeah, I mean, all of this is, I think, summed up in like - Ops needs to become a product team to serve an organization at scale. And fulfill the needs of a business, which is like create a platform that can deliver infrastructure to development teams without having to be deeply involved or having lots of ceremony around it. And I think the first step to that is an API with abstractions that is self-documenting.

Imagine making a logistics company and being like, “Yeah, instead of putting in the to address and the from address, you actually have to pick each leg of the journey and what you know mode it's going to be.” It's like all by boat over here and then we're getting on a train.

C‍ory: 44:44

I know.

Dave: 44:44

That would be the worst. I would go to FedEx every time, I'd pay double to just put in from and to and have someone come pick it up, right? And I feel like that's where Ops is now.

Cory: 44:55

Yeah, it's funny because it's like, that's way more complicated and the inputs I have are like a from and a to and how many days? I'm like three. And they're like, great, we're gonna put this motherfucker on an airplane. And I'm like, I don't give a shit so long as it's there before Christmas. Like, I don't care, you can put it on a fucking pack mule. Like, does it matter to me? You abstract away. I've given you a constraint and you said that you're going to make it happen and that's all that matters.

‍But we don't do that. We do pick, “Hey, how do you want to get from St. Louis to Indianapolis?” It's like, I don't give a shit. Like as long as it goes.

Dave: 45:28

I don't know that's why I'm calling you. I would do it myself if I knew right.

I think like that's where so much of like the DevOps loop broke down right it's just like… there's a lot of knowledge developers don't have and don't have time to get. And it's just like they're being encumbered with these details they don't care about and it's our job to be like, “Here's a ‘from’ here's a ‘to’. Have a nice day! You'll get tracking updates.”

Cory: 45:52

That one boils my piss right there, too… Sorry, this has triggered something in me from last week. You'll know what I'm talking about. I'll keep it. I'm going to do a verbal subtweet. But this idea of, “Hey, I don't think developers should have to know this to do it.” And people are like, “You're robbing developers of knowledge.” I'm like, “No, I didn't say they couldn't be curious and learn things. I'm saying they shouldn't have to.” We just, I mean, I don't know.

‍We collectively just suffered through this for about 20 years, and that is the burnout of the job. We have a very unique job. I think it's like us, woodworkers, and artists all have a similarity, and that is we get paid for doing something we like to do.

‍You know, it's not about taking the ability for them to learn away. It's honestly… it's being a bit respectful of people's time. I don't know how many articles there were on developer burnout in the twenty tens. You still get them today. Honestly, I feel like remote life has made everybody a bit more chill.

Dave: 46:50

Way more chill.

Cory: 46:51

But before 2020, it was like every week. It was just like an article from somebody about how burned out they… I feel like anybody I talked to was like, “Oh, fuck, I'm so close to burned out.” ‍And it's like, yeah, man, like everybody's working on startups, working 50 hours a week, maybe 60 hours a week. And then you're like, “Hey, you know what? We're running this stuff on the cloud. We're going to do self-service. And it's going to be your job to also figure that shit out.” Like that doesn't…

‍We don't look at our time at work and go, “Oh, okay, well, if I'm doing a bit more of this ops stuff, I guess I'm gonna spend 36 hours a week writing software and four hours on this ops stuff.” You go, “I'm not gonna make it home in time for dinner with my family tonight.”

‍And that's the thing that really just pisses me off so bad. When I see leadership, be like, no, developers need to know how to do all this stuff. It's like, dude, no they don't, man. My accountant and my lawyer both understand law, but guess what? My accountant's not fucking around with torts, he's trying to make sure I'm not cooking my books.

‍They have overlapping concerns but they have very specific services they’re providing to the business. They might have to understand each other’s world a little bit but to say like, “Oh, no, no, no, no, no, no. You have to understand this thing whole hog to just move forward in life.” Like that is disrespectful of people’s time in my opinion. Now I’m not saying, again, that they shouldn’t learn it. I think if they have the time and the curiosity, they should.

‍If I have two engineers on my team and one person is like, “I want to understand how this module works under the hood and how you decided the right database.” I'm like, “That's great. You should learn that. I'm excited for you to learn that.” And what I'm thinking in my head is this person might be a good fit for the operations team or the platform team in the future. And then if there's this other teammate on the same team, and they're like, “I don't give a fuck about this whatsoever.” I think, “That's great. You're gonna pay that guy's salary when you're making the company more money.”

‍Those are two absolutely valid takes on two people on the same fucking team. And I can appreciate both of those people's takes because I realized that they're different people. They're not all apples, right? And you're going to understand people's constraints, their drives, the things that excite them about the business more. And I think it's fine to let them follow that path if they want to learn. If they don't, it's like, great, just ship software, ship value. That's what you're here to do, right?

Dave: 49:05

Yeah, I think there's a Venn diagram of like how the internet works knowledge from developers and ops. And it's like probably, you know, the OSI model, right? And how networking works, right? It's like, does, how do CPUs work? Probably you need that as a sysadmin, you need that as a developer to really understand your constraints there, right?

But it's like more computer science stuff and less like how does AWS specifically handle routing requests, right? Like that's just so nitty-gritty and it… you don't usually find bugs there. And if you do, it's all hands on deck because it's a Heisenbug and it's going to be impossible to track down, right?

I just think, people who are like, “Everyone needs to be cloud experts.” I'm like, this is a job today on AWS, tomorrow it might be on Azure, the day after it might be GCP. Like why? Why should they become cloud experts when really you just want them to ship Node.js code, right?

C‍ory: 50:02

I think the crazier thing to me about that is the operations world has, I wrote about this in “Elephant in the Cloud”, not much has changed about the way that we write software over the past fucking 50 years. Like I said, I've got some variables, some loops, some stuff changes here and there from language to language, different syntax, maybe we added type systems, whatever.

‍But EVERYTHING has changed about operations seven times in the last 26 years. It's like by the time you learn this stuff, the new thing's here.

‍Even just fucking rewinding 14 years, people would start serverless… little sparks of serverless here or there. And that was the beginning of, “Oh, there's a new way to run our app on the cloud.” And that was a moment where it's like we ran our apps on instances. And then it was like, we run our apps on this serverless thing. But where we're at today is, these cloud services are chunks of our app now.

‍They're not this foundation that we're necessarily… I mean, we're sitting on them, but then we're like, okay, we're touching queues, we're touching SNS, we're touching SES, right? We're touching all these things that are like a little bit of software, right? And like that stuff still changes at the same pace.

‍And there's so much, as you said, there are so many like cloudisms to it. I understand how a queue works. I understand how Postgres works. I can query stuff from it. Doesn't mean I need to understand the availability zones - like that's probably somebody on the SREs side of the world, right? Or maybe the Ops side of world, depending on like what your org calls them.

‍But being like, “Shit, no, you need to understand how interzone replication works.” It's like, “No, they don't. If they're not interested in it, that person certainly fucking doesn't need to understand how it works so long as they're building product.” Like that's what I'm paying them here to do. But yeah, man.

‍I think one of the other big things besides us changing the language around how we talk about compliance… like cutting the red tape, moving people towards pre-approved systems, not reactive approval processes… the other big thing that has really landed for us… and I'll tell you what, honestly, surprised, I've always been frustrated by the thing I'm about to say, but very surprised, because I've been trying to figure out how to get it in the product for a very long time, and I know that's annoyed the piss out of you…but that is GitOps.

‍Very recently we've just kind of said, “You know what, fuck it. Like just fuck it completely.” Like it's not valuable to us. It's not valuable to our customers. We've teased so many potential customers saying like, “Hey, here's how you could do it with Massdriver. Here's how you can work around and get that GitOps flow. But us… about eight weeks ago at KubeCon, we were just like, “No, fuck it. We don't support it. We're never gonna support it.”

Dave: 52:47

Yeah.

Cory: 52:48

And people are like, “Great!” And I was very surprised that people would be like, I'm fine with that. We don't like it.

‍So GitOps, it's been a hot topic for years. And there's two models… people are like, “Hey, GitOps, it’s been a few years.” ‍Yo, GitOps… you know who invented GitOps? Heroku GitPush, that's who started it way the fuck back then. And they didn't have a pretentious name for it. But GitOps… man, I'm gonna earn some enemies on this episode.

Dave: 53:15

How is it that Heroku didn't have a pretentious name, given they had pretentious names for everything?

C‍ory: 53:22

Internally, was probably called like “twinkling waterfall 69-69” or something like that. [laughing]

‍But it's been this hot topic. There's like two GitOps. ‍There's like the GitOps where it's like, hey, I push my application and it ships with an Argo or a Flux or whatever. But then there's like GitOps for infrastructure as code, right? So it's like, I can write my Terraform. It's in the code alongside my app. I've got my Snowflake main.tf module that's referenced into 2,000 other modules. I'm going to GitPush this. It's going to run some pipelines. Look at us, we do automation. It's going to deploy some things.

‍But, like what does GitOps get most teams? Like what is the appeal? I'm asking you non-ironically, like what do you think the appeal actually is of it? And then what are the profound shortcomings that come after it that most teams just kind of miss?

Dave: 54:10

So I think the appeal is generally like… you can check off an automation box, right? Like you can go to your boss and be like I automated today, right? And I think that feels pretty good, especially if you came from a world where nothing was automated.

There is a giant leap that I think… like I know I take for granted because I got the skills early on to kind of like build these systems that were like automatable. But like, I think back to my first like jobs with internet software where it's like, “We're taking down the site on Saturday, and we're FTPing code into web 2 and like, we're gonna test it and deploy it.” That was a nightmare.

And I think this GitOps thing was suddenly like, this is so much better. We can actually just like hit this button and things will happen predictably. But I think the thing that hurts is like, when it's time to know what's going on with the system, Git is just a terrible read tool, right?

Especially when you have microservices or the surface area in the cloud is large. Like what instance types am I using in my orgs? Are they standardized, right? What VPCs have 0, 0, 0, 0 open? These are questions that are tough when your Git is kind of spread through apps or even in a Terralith. It can be really challenging. And so I think like that's a real downside to this GitOps problem.

Then it's also like there's a question of what even needs to be GitOps. Like I think you and I would defend GitOps when it comes to net new code. But if you need to up your DynamoDB read capacity from five to six… what does GitOps do for you there besides like add a barrier when you likely need to scale because of something?

It becomes a fire drill. And I think for like configuration management, it's like your write layer is now hindered and your read layer is hindered. And like, I just don't see any benefit to GitOps once you have good interfaces into self-service.

Cory: 56:11

Yeah, I mean, we used GitOps today. It's how we deploy our app. Massdriver runs on Massdriver, but the app build…GitPush, merges to main and ships the thing. We manage all of our infrastructure also on Massdriver, and we draw it.

‍The read-write layer is really the funny one, because if you've nailed GitOps… Ooh. We've nailed it. We’ve got a good shared repository where all of our modules are. Developers add one main.tf or they add 20 main.tfs, whatever they want to do, we let them pick. And they love it. They reference those modules. They deploy their pipeline after they copy and paste about 600 walls of YAML. They're having a great time. ‍But they still have an AWS credit. Why can't I just take the AWS creds away? They're creating drift. They're clicking on shit.

‍Agh, if I could just take this AWS credit away, but I can't because that's where they read. That's that's where they go in to see that the thing… It's not working. Why isn't it working? First place they go is the AWS console to double-check that things worked, right? The amount of times I've seen people go there before they go to like whatever their stupid monitoring dashboard is… because like you don't have a way to read.

‍You can read your code but that is your best approximation of like how you think that this thing is going to go through some other systems. And tearing that down… it's like, well, we have auditability. I can go in there and I can see who changed what and when. It's like, OK, can you?

‍Like here's an auditability one: How has that instance type changed over time? “Well, I just look for every single diff that includes that file and then loop through them all, parse the HCL, and capture the value of it. It's that easy, Cory.” It's like, okay. What if the directory moved three months ago? Oh, whoops. ‍It's just like, you can read. Yes, I get that you can read, but like the amount of effort… Again, like coming back to this idea of automation, your database is a glorified text file, glorified tree of files. That's not a database. It's version control system. Right? And like that read layer is not there.

‍And that's why Devs still have cloud, that's why they still end up creating drift. I mean, they also do it because it sucks to manage it that way, if we want to be honest.

‍But we just start to then end up in the next sprawl. Which is, OK, well, we don't understand the system. So what are we going to buy? Or what are we going to find in the CNCF landscape that can solve this problem? And now we're in tool sprawl. We're like, OK, well, we don't have a database of our configs. We don't understand the configuration of our system. It's auditable if somebody wants to sit down and grab an accountant and work their way through it, but it's not actionable.

Dave: 59:07

Yeah.

Cory: 59:09

Everything that we do in software…most of the time it's reading. Whether it's pulling data from the database or us looking at it. We write code once and for the most part we're reading it. We get that first read when the PR opens. We get that second read when somebody is changing it. We read this stuff a lot more than we write it. And the read side of GitOps is a fucking terrible story.

‍But this is where we've landed. And like, it's madness to me, right? So like...

Dave: 59:36

It's madness now, right? It's madness now. It wasn't at a time, right? Like I remember merging a PR and watching Heroku go and that was awesome. But here's the thing. I had zero choices about load balancers. I had barely any choices about instance types. I had barely any choices about databases. There's this whole layer of stuff I didn't have to think about, which was really cool. And that was kind of the trade-off.

And I think like with the full mass of the cloud and how we use it, that same Heroku experience just isn't possible. I have more stuff I need to read that changes more and more every day, right? And it's just like, cool, maybe for configuration and state change of the cloud, we do the same thing that the clouds do and we just store it in a database so we can have a better read interaction there. And then if we're going to make a security change, like we're going to change the encryption on all our S3 buckets, like that's probably a decent place for GitOps and we should roll that out.

I think where having an API becomes really powerful though is imagine making that change of like, we're going to change the encryption type of all of our S3 buckets for encryption at rest. Imagine having a database of all of the live configs of S3 that you can now test that change against to be like, “Will this rollout clean?” and then initiate that rollout.

Think about doing that in Git. What you do is you make that change to your module, you go into every place that consumes it, you change a ref, you run a plan… ooh, this one's not good, right? And you're going to do that over and over and over and over again, and you're just not going to gain any speed. Even the ability to just test the change against the configuration of every S3 bucket, to be like, three of them are going to be bad, and I'm going to have to focus in on these three, would save you a ton of time.

C‍ory: 01:01:21

Yeah, yeah. And that is wild, right?

‍And you can do this. I'd say it's going to require some effort. But you can get these configs into a database on your own, whether you build a little UI around a Terraform module or whatnot. But getting these configurations into a proper parameter store… something that you can actually query, aggregate on, manipulate at scale, I feel like that's the next phase of automation, right?

‍We have so many teams that are still underwater. There are still a lot of orgs that like are still adopting the cloud. Whether or not you think everybody should move to the cloud, like there's still plenty of people that are trying to today, and they're getting there slowly. The folks that are in the cloud, many of them are still adopting IaC.

IaC doesn't have a massive market penetration. Like it feels like it does because this is our space and we're in it all the time.

Dave: 01:02:12

Yeah.

C‍ory: 01:02:12

It's IaC everywhere we look, but for the average org, it's like there's a lot of them are still doing Click Ops. And they're having a good time. I mean, they're not, but they're having a good time - It's fine. It works, right?

‍But getting these things into databases does really start to unlock just an amount of power as an operations team and a cadence that is unmatched. When you can do things like… Hey, I do actually want to know maybe every single R6G class instance that we have, whether it's in EC2 or Redshift or RDS, whatever, like I need to find them all for some reason, right? ‍Or maybe there is something like encryption at rest. Like how many of our databases don't have encryption turned on?

‍That's a hard one. I mean, I'll tell you what, it's a particularly hard one to find, especially if you are letting developers make their own databases. ‍That's kind of the fear of lot of Ops people have for self-service. It's like, you don't know what people are doing. It's like, “Oh, we let them write their own Terraform. Well, they wrote their own Terraform, and they forgot to copy our Check-off script that says everything has to be encrypted at rest.” Oops.

‍But now if I can easily query for that and see it in a matter of seconds and then say, “Hey, what module did this come from? What team is it? And is it a parameter that we can actually change without causing a recreate?”

Apply that change in bulk, right?

Dave: 01:03:37

It makes a lot of sense.

I mean, that's the world I want to live in. I know that's the world you want to live in, right? It's just the way we were doing it before is tough as the cloud grows, right? And I think like that's that switch from consulting team to service team, right? I think a lot about the article you shared the other day about Amazon and Jeff Bezos and his kind of like rule around how teams will communicate with each other. And they were like, you will have an API. There won't be direct database access.

I don't think Amazon becomes Amazon unless that happens, right? It's like I have a bunch of atomic units that can communicate with each other through some like (A) auditable, but (B) like self-documenting way, right? With APIs.

You know what's funny? And now that I say this out loud… the Metas, the Ubers, the whatever, they have this already. They have this service team that can provide infrastructure self-service. And I think like we see a lot of defense of the GitOps way But it's like the biggest, most successful companies do this at scale without GitOps. Without their engineers learning the cloud or their infrastructure if they're in a data center or something like that, right? It's just like this is where the industry is. The question is when are we going to catch up to it?

Cory: 01:04:54

Yeah. And it's funny because like I feel like we've heard, “Well, that's… I mean, they're Amazon.” It's like, yeah, but they weren't. I mean, it was the company name, but they weren't how we know them today. Right?

‍Like that was a couple of things. It's leadership. It's like true thought leadership. Right? And it's not just somebody saying, Hey, we're going to do this. And I don't have a strategy for getting this there. I just… I decried it. I stood on a mountain and screamed it and I hope people do it. But like, it's not just thinking these things. It's actually like figuring out how to get them done.

‍There are some skills, I think, that a lot of engineering, particularly operations side engineering management, doesn't necessarily excel at. A lot of organizations, I think, are struggling with this - that internal marketing skill.

‍You've got to be able to walk the walk and talk the talk to the rest of your org about how you're going to do this. Right? You've got to set a strategy for these teams to be able to get it. Can I get them budget? Can I get the business to pump breaks so that I can show that these guys can create some value? And I feel like that's a part… that's a hard one. That's much harder than learning yourself some Terraform

‍You know, I saw this early this morning on Reddit. Somebody was like, “I just got into a director of operations role. Like, what should I need to know?” And I'm just like, “Bro, you need to know how to fucking budget. You need to have how to prove your team's worth. You need to know how to market your fucking team to everybody else.” Like everybody's like, “You’ve got to learn CI/CD.” And I'm like, “Fuck, no, you don't. No, you don't. You're just gonna become the top fucking meat gate.” Like you need to figure out how to operate an operations team within a business where the rest of the business can understand and respect it so that you can get the same access to resources that they do, right? You don't want to be a support team.

Dave: 01:06:37

Right.

C‍ory: 01:06:37

Like you want to be something that's enabling that business, not something that's just kind of supporting it. I feel like that's one of the places… we're missing a whole hell of a lot of that in a lot of orgs. And it's hard to get to.

‍That being said, this is my last question. You've been in the space a long time. We've built this startup together. Like, what are some key lessons or like surprising things that you've kind of learned?

‍Maybe it's while at Massdriver, like throughout your career, that you think a person stepping into a new Director of Operations, Director of DevOps, Senior VP of Platform Engineering… like one of these roles where you have this team of people that are probably underwater. They're trying to serve as many developers as possible, but they're fucking stressed. They're getting paged at 2 a.m.

‍Like, what advice do you have for that person who's trying to figure out like, how do I actually make this team the things that you read about in the Phoenix project, right? Like, how do I actually do this?

Dave: 01:07:36

Yeah, so I guess the first thing I would say is like Fred Flintstone hit yourself in the head with a frying pan and forget that you were a programmer. You're not thinking at that level anymore. That's really not what like operations within a business is about. It's not about like, should we be using Helm or Kustomize, right? Like that's not your concern anymore. That's, your engineer's concern.

I think the question you need to ask yourself is, “Who are the stakeholders in this business? Who is making demands of things that we need to produce in the cloud? And who is putting constraints on those things?”

Is your CFO happy with the cloud bill? Is it growing at a rate that they're a little bit alarmed with? A lot of people don't think to go hit up the CFO and ask that question, right? Like, are we hitting a point in time where the cloud bill no longer makes sense given our revenue? Or are those two things… trajectory… the right direction together, right?

I mean like, I think directly asking the CISO, “What are our compliance constraints? And where can we actually eliminate some of the ceremony while still enabling that compliance?” That's where your speed is going to come from - just not being over encumbered by stuff that is just kind of poorly understood and it's always the way you've done it.

I think with like SOC 2 people were like, “Well, you definitely have to have that Git code review thing in a service.” Again, before any of these technologies, there was SOC 2. Which means there's some way to do it that is not the way we do it today. There has to be a quicker way.

Those are my two big ones. It’s just like… map those stakeholders. Your Devs, your product teams are 20% of this cloud equation, right? They have needs. But you have other stakeholders that have constraints. You have to map those and understand those and then figure out a system to like satisfy everybody. That's where you start to become like a manager and a thought leader instead of a coder that is now in management.

C‍ory: 01:09:39

Yeah, I like that.

‍As a director, it's nice to touch code every once in a while just so that you can understand the state of the system. And I used to do this - when I was director of engineering, I let everybody pick a ticket that they didn't want to do, and they could assign it to me. And I just did the shit work, but I’d spend like a day a week writing code, right?

‍That was something I always did so I could understand how like what the worst was. And it got to a point where I'm like, “The worst isn't that bad anymore. Like we're not doing this anymore.” Because like everything started to hum, right? That was how I got my feet wet in an org - give me the shittiest stuff. I want to understand how bad it is.

‍I think that's really the key is like, you have to shed that, “I was the Ops person here for 20 years and now I'm running it.” Like, okay, no, it's a different role now. Or if you're stepping out to another company, “I did ops here for 15 years. Like I understand how to do it. I'm going to go do it at this other company now as a manager. And I've got people underneath me.”

‍It's a new job. Like you do just have to kind of recoil from it a bit. Like you understand it. And that's important because that's going to help you do like the tactical things. Do I have the right teammates? Do I have like too much overlap? Do I have enough people?

‍Like that switch from being a tactical operations engineer… that sounds a little too aggressive now that I say it out loud.

Dave: 01:11:01

Yeah, yeah, yeah.

Cory: 01:11:02

Yeah, not that… we ain't shooting people.

‍Right, like just the strategic thinking of like, how do I do this at scale across this organization? It becomes a game of strategy, right? You are playing with a lot of different variables. You're playing with CFO money and CISO concerns, right? Like it is a big shift.

Dave: 01:11:19

Yeah.

C‍ory: 01:11:20

It is a big shift. And I feel like that's where we've fumbled as a community, time and time again, is management that doesn't go out and do the right work by their operations teams so that they can actually become the next Amazon. ‍It's possible. People do it all the time. But I think what's crucial is that person that's sitting at the top's ability to think strategically about it.

‍I love that.

Dave: 01:11:45

Yeah, and I think that's my advice for everybody just in general.

One of the things I did through my entire career, I had a really, really good CTO in my first job who was… he was skeptical of new technology. He was very like, “We don't need that, we work for a business.” I just remember looking at him and having to think through… he would say things that I totally disagree with. And I would just kinda sit back, because I respected him, and I would go like, “What is he thinking right now? Looking from the top down.”

And I’d try and figure it out. Sometimes I was probably wrong - I never really asked them. But I've done that my whole career where it's like, “Okay, if my job was to enable some business thing, what would I be saying at the top?” And sometimes I agree with people, and sometimes I disagree. But I think like getting into that practice of being like, “PHP doesn't matter to me anymore. Revenue for the next quarter does. Getting to the next round does. The stock going up does. How do I change the way that I talk about what we're doing?”

I'm probably not saying like, “Burn down our old platform.” I'm trying to be like, “We need to invest in this. This service is slow. Let's move it out of the old platform.” instead of like flat out rebuild. And I think like, I don't know, that led to a lot of maturity that put me in a good place for management.

C‍ory: 01:12:55

Yeah. Last question, I promise this is the last one. The other two were also the last two questions, but this is the final one.

‍Where's the space headed - next three to five years? In three to five years are we in the same place? Do I need to put a bigger timeline on it? Like, where does platform and DevOps roles go from here? There's a lot of debt.

Dave: 01:13:17

There's a lot of debt.

I think it depends how much you believe in AI. Which I don't, and I definitely don't believe in it for managing the cloud. I'm still kind of stuck on like why does my AI write Ruby instead of writing Assembly. It doesn't need to communicate with me. It needs to communicate with the computer. And I just don't think… I don't think we've gotten to there until we're there. I don't think that's on my radar in the next 10 years.

I think the biggest thing is going to be APIs. I think a couple of interesting things are happening. People are understanding the ceremony around Git is a real pain, and it's causing pain. And I think we've seen that more in the last month or two than ever. If I go like, “Is this how you do it? Does it hurt?” They're like, “Yep. Yep, it does.”

I think we're seeing a lot of people look at Backstage and realizing that it's a great read tool for managers and creating JIRA boards, but it's not a cloud management tool. But I think people are like into the idea that like a UI that's just a series of forms that can do this advanced stuff is really important.

So I think that's where we're headed - the idea that like we need a platform. I worry about the term platform engineering. I hope that kind of goes away because I think the wells already poisoned with it, the same way that like it kind of was with DevOps early on. DevOps’ big thing was like CI/CD pipelines and that's it. The way we manage the cloud didn't really get solved there.

Cory: 01:14:49

No.

Dave: 01:14:49

So whatever it is… databases and APIs for automating the self-service of cloud management, I think, is where we have to be.

Exactly what that looks like, I think there's a ton of people with interesting ideas. And so it'll be interesting to see how those ideas kind of continue to grow and form and people adopt them.

Cory: 01:15:09

Awesome. Well, I appreciate you coming on the show.

Dave: 01:15:12

Thanks. I appreciate you having me. I feel like we never get to talk.

C‍ory: 01:15:16

I'll see you later today. [laughs]

‍Awesome, everyone. Thanks for tuning into this episode. Like I said, the next couple of episodes are going to be talking to different folks in our space, a lot of the folks that are involved in Open Tofu. And it'll just be about what the DevTool makers are building, like what their vision is for the company and where they see Platform and DevOps going in the next couple of years.

‍Awesome, everyone. Thanks so much, and we'll see you next time.

Episode 23

22nd Jan 2025

Beyond GitOps: Rethinking Cloud Self-Service with Dave Williams

Transcript

Listen for free

About the Podcast