Engineering Culture Change with Stack Overflow’s Peter O’Connor

Episode Description

Join us as we learn about Stack Overflow's monumental shift from on-premises infrastructure to the cloud with Peter O'Connor, Director of Platform Engineering. Peter shares invaluable insights on navigating the complexities of migrating a beloved developer platform, balancing technical challenges with team dynamics, and fostering a culture of innovation. From the intricacies of lift-and-shift strategies to the nuances of adopting microservices, this episode offers a masterclass in modern platform engineering. Discover how Stack Overflow is revolutionizing its architecture while maintaining its core mission of serving developers worldwide.

Episode Transcript

Welcome back to the Platform Engineering Podcast. I'm Corey O'Daniel. Today I have a special guest, Peter O 'Connor, the Director of Platform Engineering at Stack Overflow.

Now, if you've ever Googled a programming question, you've undoubtedly ended up on Stack Overflow. It's the go-to resource for developers worldwide. If it's not working, it's probably the first place you're going, right? 

What you might not know is, behind the scenes, Stack Overflow is undergoing a monumental shift from running their own hardware to the cloud. And it's a journey that's pretty fascinating as it is complex, and Peter's right at the heart of it. We're going to dive into his career, the challenges of moving a beloved developer platform to the cloud, and what the future holds for Stack Overflow. 

Peter, welcome to the show.

Cory, thanks for having me. Really appreciate it. Happy to be here.

The Challenges of Having an Apostrophe in Your Name

Yeah, so before we dive in, I want to talk about something that's near and to my heart. This is off the path of Platform Engineering. The burden of an apostrophe in the last name–I'm sure you know the pain. 

Can you give our listeners a peek into the apostrophically challenged? How often do you find yourself at the bottom of a form, and you're like, “I'm about to hit submit”, and you do, and you find yourself scrolling all the way back up to the top to fix your last name.

It happens at least weekly. You know, in 2024 we have not solved special characters in names and I don't understand what happened that we got here. And going to the bottom of the page, you're right, that's like the worst thing. The other worst thing is when you go to do credit card entering, and you're like okay, my credit card says there's an apostrophe in it. I go to type it into the payment portal and the payment portal's like you can't have apostrophes. I'm like I don't know what to tell you, you're not going to take it otherwise.

One of the funniest things happened the other day. I ordered something. Not through Amazon, I don't know where it came from. But when I got it, my name was O, Ampersand, like Pound symbol… it was pretty much the HTML code for an apostrophe. I was like, “Now that!” I love that somebody HTML escaped it before they shipped it. That was a fantastic one for me.

I will tell you, sometimes I have to hold back to think like O, apostrophe, like pipe, drop table. I have to resist it every time.

You can see where it's happening though, right? Like you will hit submit on forms every once in a while and you'll get a 500 error. And I'm like, I know something is amiss here. There's something I could fiddle around with, but I’ve got to get back to work, so I'm not going to. But maybe I'll drop them an email. I'll hit their bug bounty page up. 

Right.

Awesome. We should actually… well, maybe at the end of the call, we'll Google that problem. See if we can find the answer on Stack Overflow.

Nice, nice.

From Chemistry Teacher to Platform Engineering

So you’ve got an interesting background. Before getting into development, you were a chemistry teacher, a computer science teacher, now you're leading engineering teams. How did you go through that transition from education to software, and how has that experience in education influenced your leadership and approach to engineering?

Dramatically. The similarities and information I've gotten from teaching just really taught me so much about how to be both a good leader and like a good platform engineer. 

I went out of college thinking like chemistry is the way it is. Don't go into computer science. You don't want to do the thing you love, you might just hate doing it as a job. So I was like, sweet, do chemistry. And then I was like, teaching… you know I've always been pretty good at teaching… so I was like, sweet, let's give this a shot. So I taught high school for a while and college and it was amazing. 

Inviting people into a world and a domain that you love is really exciting and brings a joy to my world that I really… I kind of thirst for. And that's why, when you talk about how does this make a good manager or platform engineer, you know, it's either customers or the people that I'm trying to empower as my team, to say like, listen, I want to help you out. I'm here to set you up for success and get you to be very, very successful here. And what can I do to bring you into that pit of success? 

Teaching, it's a lot about that. So the skills that I get about making sure we're building the right thing for the customer comes from teaching. Where you're like okay if someone doesn't want to learn a subject, like why? What do you not like? What is not interesting you here? And every person's a little different, so you have to really like figure it out and understand. You have like your main path, like the class itself. And you're like okay, well generally these are the reasons that people don't like chemistry. It's just kind of boring, dry– I can solve that. 

And you have the edge cases, right? Well, there are some students who just don't want to learn even after that. And I really blew something up for them, and they still don't want to learn chemistry. Like, what? You're driving for the next “What's wrong?”, “Why?”

That translates into the platform engineering–whether it be a DevEx team, a tooling team, whatever you're calling your platform team–what your team's trying to do is it's serving your product engineers. 

Some of them have some pretty easy ones: “Be really nice if I had a great GitHub action flow” or “I wish I could just get an easy way into Kubernetes”. Cool, I mean that makes sense. But then there are other people who have lots of other issues that are like not in the mainstream, and so you have to then search them out, find them where they are, and then like bring that back. And then say okay what are we going to do to like bring them into it. 

It's a lot of the same skill sets about teaching that go into both the managing and, what this podcast is about, the platform engineering. It's a lot of that.

Yeah, and it sounds like in both roles you occasionally just blow some stuff up.

Oh yeah, absolutely.

 A little production, some gases over here, just blow it up. It's very funny you said that because, as you said that, I actually had a flashback. I remembered the guy's name, Mister Vaughts, if he's still out there. He was my high school chemistry teacher, and he definitely blew stuff up in class all the time. Which, I don't know, is that just like a high school chemistry teacher thing? Is that the move?

Yes, it is. If you really want to get people on the thing. My first day is a demo where… I got taught by this seasoned veteran, Sam Lucas, if you're still around, but doubt he'd be listening to this… is you take a Pringles can (I think they're still in cans, I haven't bought one in a while) and you put a hole on one side and on the other side, and then you just (you obviously eat the chips) and then you just fill it up with hydrogen, put it on a little stand, and you light the top and the hydrogen makes this little candle.

It's the best demo ever. The candle, like it starts going lower and lower and lower and lower because the flame is getting lower and oxygen's coming in. This isn't in a chemistry lesson here… but what's really fun about it is it gets inside, the fire's inside the can for like a good like three to four seconds. And then it finally reaches critical mass of oxygen to hydrogen and just explodes and shoots up like a rocket. So you'd like just convince the students, you're like, “Oh man, I messed this up. God, I really screwed it up.” And you just like walk away from it.

Oh man, I've gotten so many students. I swear students have jumped out of their seats across the room.

That's funny. He did a… What is it called? Like a mulligan? The big water bottles. You know, those big ones you put on top of the thing. 

Yeah, yeah.

He did one of those, and it shot across the room and broke some stuff. And I think he actually did mess that one up. I think it went a little further than he meant to. 

Stack Overflow's Journey to the Cloud

Well, so, okay. What's interesting here though, you've been running on prem since Stack Overflow started, right?

Yeah.

Moving to the cloud, so you actually have a migration going on. This transition culturally within the org to platform engineering at the same time. And then you also kind of have some upskilling you have to do at the same time, because you have to get your team educated on how to manage stuff in the cloud. So I would love to get into that a bit, I'd love to hear how you're juggling it. 

But what was the goal in Stack Overflow moving to the cloud? Because I feel like there's so many blog posts about the monolith of Stack Overflow, how they run on prem. I feel like on Hacker News, it's always like, “You don't need microservices. Stack Overflow has been a monolith for 20 years. Just do it that way.” 

What was the final thing that pushed Stack Overflow towards starting to adopt some of these other processes and architectures?

We could dig into this question a lot because there's been stops, starts and continues along the way. In Hacker News, it's always fun. It's seeing how people model our architecture in Hacker News or on LinkedIn. I'm like, you know, I don’t know. I saw something earlier today, it was like you're cacheless, you don't use Redis. I'm like, we leverage Redis actually pretty heavy here. But anyway, there are a ton of blog posts and it's something we're really proud of and I'm proud of. I wouldn't be here today if the people who started this company didn't start with the monolith.

I think that was an amazing decision. I think what they've done to it to optimize it is fantastic. And it's a really, really slim and cool and highly optimized model. That's huge! The fact that they were to do this on nine web servers is insane to me. I think it is just a testament to what you can do if you buckle down and do things and just try to heavily opinionate and optimize. And so that's awesome.

The thing is though, like, when we're thinking about going to the cloud, well why would we want to go to the cloud? Well, we already did that with our SaaS product, our Stack Overflow Teams product. It got lifted into Azure. There's some blog posts on that as well.

That was not easy. That was mainly for a compliance reason, like it was a good way to add value to the business. But for the public platform, like it's a harder sell. Well, the decision, you know, mostly came down to the idea that, well, we have a data center in New York and that data center is closing. So we could do a few options. We could find a new data center. We could probably just lease equipment from somebody (and say, hey just you take care of the equipment part, we don't want to be in that business anymore) and run it. 

When we took a look at those options and then compared it to like a cloud migration, it definitely made more sense just to go to cloud. It made sense from like a money point of view for us. Like if we could just go to the cloud, if we look at the estimates, it's going to help us kind of lower those costs and what we're maintaining two data centers right now, because we run like an active passive kind of mindset here. If Stack Overflow goes down, there's usually a read-only mode that you go into, and you're in a different data center then.

Oh, interesting.

Yeah, it's actually really cool. But then we said if we could just do that in the cloud, right now with GCP, that would be the best thing for us to do. And we get some up skills. I wouldn't say this is like a business per se…well, I would say maybe it is a business decision... but upskilling your engineers and making sure that they like feel like they're doing something technologically relevant is also good. 

So like if we're saying, hey, we're going to help you be cloud skill ready, that's a win too, right? Because we really believe in our people. And if we're not really doing the best thing to help them in their future… because they're not always going to be with us, at some point they will likely move on. Let's make sure that they're ready for the next gig too. 

So, it's not like a single answer. I mean if it was, it's definitely the money. It's economical for us to go this way, but it's the people too.

Upskilling your engineers and making sure that they like feel like they're doing something technologically relevant is good. 

Yeah. And are you feeling like in that migration…is it a lot of, we've got stuff on hardware and we're moving it almost like lift and shifting it to the cloud, or are you actually doing some like re-platforming of the tooling you have there and services?

Yeah, if I would characterize it today, it's a 90-95% lift then shift strategy. The big thing we want to try to do though is we want to try our best to sensibly default to a service pattern for new things so we're not adding more into the model. Because the more we put into that, the harder it becomes to lift. We think about cloud running on this, the costs of running that can just balloon if we don't do it in a cloud native way with services. 

So for us, it was like, we need to get out of the data center. We don't really have time to sit there and extract every service that we want to. So the best thing is we'll lift and shift it out and get it there. 

One of the key things is there are supporting service architecture already. The way we serve tags is through actually a service. Is it a microservice? I don't know. I don't know what you call a microservice. Is it lines of code? Is it how large? I don't know. The domain is? I don't know. But there are many other services that support it. It's not a ton, but they exist. So it's not like there's only the monolith. 

Yeah, I'm not sure where the line is between a microservice and a service, but I just tend to make all my services right-sized. I just nailed them the first time. 

Nice, nice. There’s a book there somewhere… if you make them right-sized the first time.

There's actually a fantastic blog post called… it's microservices… split join criteria for microservices. It's got a really funny, it's like a nihilist approach… I'll have to find it. It's a great article on like how to size services when you're splitting up and like when to decide to split and join things. I'll look up the name and put it in the show notes at the end.

Yeah, I feel like getting that that size right is always a challenge, right? And you know, it's bigger than a single MySQL table. 

Yeah, yeah, much bigger.

Yeah, much bigger than that. Very cool.

So you’ve got all these things kind of going on at the same time. Was a part of this shift towards the cloud also to start adopting platform engineering tooling and practices at the same time, or did that predate the cloud decision?

The Birth of Platform Engineering at Stack Overflow

That was probably around the same time when we wanted to do that. At that point,Stack Overflow was trying to say, OK, let's actually do this in a cloud way. Let's just create a platform engineering team to do it, to help us along the way. And that will make us cloud ready.

I was part of the original hiring process. I was part of like, let's create this platform group. I was hired on as an IC on what we call core engineering. And we were trying to do some of that stuff. We wanted to create a service baseline. What does a service baseline look like? We also were trying to create our own identity platform, to replace the provider of identity that we have for Stack Overflow right now. 

We tried those things and it was sort of like, this will work, right? Make a platform engineering team, call it platform engineering, and platform engineering happens. And that definitely, in my opinion, would be one of our false starts.

It didn't work very well. We ran into roadblocks a lot. And there was just uncomfortableness about what to do and how to go about it. The people culture at Stack Overflow was great, but the culture for like, let's do services just didn't exist yet. So it was definitely something that was a struggle.

With the cultural hurdles there, did you already have buy-in? You'd buy in from the org, it was just like figuring out how to do it was the problem? Or was it like getting the amorphous blob of the org, like getting them on the same page, was that like the problem? Or was it just like the actual, okay, now they're in, how do we do this?

Yeah, think it was more… the amorphous blob is a good one... The leadership had definitely said we should do this, let's create services, this is how it's going to work. And they were into it. And so it was a top-down style, you know, kind of let's do this. The ICs and the other teams were not really like invested in that, right? 

They've been working in the monolith, they know the monolith. They know how it works. I can put my feature in the monolith. Product wants me to make this new thing that's going to be good for Stack Overflow Teams or for the public platform, and it's going to help our community out. And that's great. And they didn't see the value. So it's like having a lemonade stand in the middle of an orchard of lemons. No one needs it because we already have all these lemons around here. 

It's like having a lemonade stand in the middle of an orchard of lemons. No one needs it because we already have all these lemons around here. 

So to get the buy-in, the things we were trying, like I was talking about, just didn't work. And that really kind of occurred for a while. And my team had some frustrations around it. We wanted to help people be cloud native and lift things into the cloud, and it just was a big effort that didn't result very far.

So I would say that's probably one of the biggest reasons it did struggle. I don't think they were wrong for trying that idea. That works probably in some places. It just didn't work here.

If it works in places, I feel like it's a subset of them. I feel like that's a common story of like, okay, we're doing this. My company sponsored re:Invent last year and the number of people that we met that were platform engineers (like that was their title) but they were like, we're just kind of Ops rebranded. Right. 

There was a wild amount of skepticism about just like the idea and the role at re:Invent, because so many people were kind of in this like place where they're like, “We got the raise. We got the new title. It looks great on LinkedIn, but like I'm doing the same work.” And that's a bummer for most of them, right? Because they thought that their job was going to change and they might be building something. They're producing something, but the team's not using it. Like that culturally is like, it just feels like a stone in the gut. 

Exactly. When I was like deep in on that team it was just like we're trying. 

I remember we even sent a survey out. Like we said, okay we’ve got to use some of your soft skills. Time to soft skill this up and see if we can just get some buy-in.  And I remember that was fresh off of when we had done this identity service, and we said nobody wants this thing so let's just kind of put it back for a while. And we did the survey, and you know, we had some people right that were interested. And we interviewed them and they were like, this is cool I like this. But you know, they weren't the majority. The majority wasn't you know looking at doing that, and it was just another gut punch.

As a new team too, one of the things we had, is there were still some services… I'm betting almost all platform teams (at least everywhere I've been), you start a platform of some sort and there's always some amount of like these services that just like nobody wants, and you just kind of get them, right? And you're like, okay.

[Cory laughs]

You're like, sure I'll own that. I don't know what it does, but I'll do it. So then we just switched gears and I think for a few months we sat around, well we didn’t just sit around, we investigated their system. We said let's just focus on this for a while, and we'll come back. And that's what we did.

Yeah. How did you get that first IC that was outside of your group that had that “aha” moment or like, “Okay I am the champion outside of this platform engineering team of this platform.”? What was it that finally got that first person or first team invested in what you were doing?

Embedding and Enablement: Keys to Success

That's a yeah, that's that was definitely a big pivotal moment. I would probably take a step back first though and say that like around about a year and a couple of months ago… last year was pretty rough for a lot of companies, and so we had also gone through a layoff and it was hard. 

But during that time, both before and after, there was a partnership made between Reliability Engineering at our company and my team. I had become manager at that point, and we kind of got together and said we need to find a way to make this work. What we tried before didn't work. 

So (I told you before about some of the services that we had), what we did is we said, okay, first we just need to know can we even do this. Like you said, Stack Overflow has been on metal for so long, like, how do we take a service and put it into the cloud, right? Most of us know how, but like here, here, how do we do that? 

So we took two services. There's one called Traduciere, which is a translation system, a community driven translation system. And IDP, our identity provider… I wouldn't say old school, I don't want to offend anybody… the legacy one that works great, and we still use today. How do we get them into the cloud? So my team, like I said, we partnered with them and said, let's just try this. We did it, we got them up there, and we learned a lot.

And then it comes to what you're asking. We had that learning. So how do we take this learning and do something with it? Because we’d now seen the pattern here at Stack Overflow of how this works. We needed to know our patterns, not someone else's patterns, we need to know what we need to do.

We needed to know our patterns, not someone else's patterns, we need to know what we need to do.

And that's when a team that wanted to deploy a new service… they were people who were kind of influential a little bit, a team that had high output. People had a lot of regard for what they would do. They kind of were innovators, right? They were what I would say when we talk product, they were early, early adopters. They were seeking the newest solution. AI had just come out and their thing was like, okay, let's try something new, we have space, let's try something new. 

That's when I said, all right, you want to try something new? I have a team member that we'd be happy to lend over to you and embed, and we'll have Reliability Engineering support us for any infrastructure needs to understand what does that take to do what you're doing. And that was a pivotal moment. Embedding into that team and getting that team enabled because they wanted to be, just changed everything.

We learned a lot, we got a service out for them, we brought learning back to us and everybody was happy. We got the product done. We got engineers on that team interested. And we got learnings. It was just like win, win, win everywhere. And that's when the ball started really rolling.

Embedding into that team and getting that team enabled because they wanted to be, just changed everything.

So enablement enablement was like the magic. 

Yeah, for sure. 

That is awesome. 

I've seen so many teams where there was just like this massive debt hurdle that was like just tormenting people. Like the survey goes out and it's like, our CI/CA builds are slow. And it's like that's the first thing that the platform team does, optimize some builds. And it's like, okay, we helped you with like your Docker images. But like that self-service, that enablement is still miles away, where like they get the real taste of like what this team can do. I feel like that is where a lot of people with this Operations, DevOps, SRE background can really turn into a 10X (I hate saying that) engineer… like the recruiters say this, it's a silly phrase, they say it all the time, but like this is the opportunity to really do that within a business. If you can make your engineers be able to do things themselves, like that is one of the biggest superpowers you can have as an Operations or SRE and like getting them to that point. 

So they get there. They're able to manage some of this stuff themselves. They're able to move this service forward, get this AI product like into development.

With that group that had gotten some enablement, they felt the ability to kind of have a little bit of control over getting the service into the cloud, getting this AI product deployed. Like from there, were they your word of mouth? Did they go out and kind of like spread the gospel of platform engineering to the rest of the team? Or was that still like a marketing effort from your team's side to get people to start coming over and working with you? 

Yes. The answer is yes to all of that. 

So yeah, if starting platforms is hard and when you get a little bit of like evangelism and people are happy with it, you take everything you can get and you try to do more. So that team was very, very proud of what they did. And they did a lot of talking. If I remember correctly, they even did a demo or two explaining how it worked. And they even talked to our product managers to say, look how fast this can be done and look what we can do. I mean, not fast originally, because that's part of the issue at the beginning is you're learning, and it takes a little longer. But, you know, look how iterations can go faster. And so they started talking a lot themselves. 

And that gave us like, okay, we have the spark, don't lose it this time. So that then is when my brain started going with the team, okay folks, how are we going to do this? How are we going to not lose this momentum?

Marketing Platform Engineering Internally

And that's when I'd say the biggest thing is going back to the standard kind of product style. Marketing has to go into your big thing. Shine a light on the team. Shine a light on everything that they're doing well. Have my team, have a light shine on there so its people are proud and excited and like, look we did this. That Marketing–never stop. 

I mean people probably get tired of me marketing when I'm trying to do this. So, you know, we'll do things like office hours every other week. We have a steering committee that meets to discuss what we should be doing and what the problems are right now. 

Side note–any good team that you're going to do marketing with needs a good name and a good logo. One of my team members, out of hand one day, said, “Hey this is like free refills at a soda machine, right? You want a service, just go up and just refill, and you get a new one.”  And we're like I first were like… at first we were like hahaha. Is that the name? And we're like, sure, that's the name. So we have this big cup logo with eyes and stuff. It's great, everybody loves it. We have a Slack emoji. We react to things with the free refills.

[Laughing] That's how you roll platform engineering out, it’s just a Slack emoji. Once you have that, it's on its own from there. Like you don't have to do anything anymore. 

I’m taking vacations.

Yeah, we’ve got to go. I'm going to drop the emoji six times, I’ve got to bolt. Enjoy it guys. 

So that first team that you embedded with, you start getting the word of mouth, you start doing the marketing… which I would love to talk a bit more about the marketing, because I think that's something that's very interesting… but was that the end of embedding in teams or are you still like loaning out your engineers to embed with people to find those problems? Or are people kind of bringing them to you now?

I wish embedding was more during the time period from then and in between. We'd have people like liaison, I'd say would be kind of the max. But, like I was talking about earlier, we were really slim on people, right? So embedding comes with a high cost too, right? Remember, I told you about these other services we have to handle too. And it's a balance to handle those things and that, but we did a pretty good job of that. 

So then it became like let's build what the base product is that we're trying to do. Because now we can't just keep embedding right off the bat, we need to find a way to self-serve. And that's when it led to, let's document what we did. Let's get a process system set out. And then once we have documentation, highlight the points that were really the pain and automate those. 

And that's kind of what took the next six months or so to create a CLI. Test it out. We'd have a few other teams do some exploring because there was some word of mouth going out. And we were loud about what we were doing to some degree saying, “Hey, come do this. It's really fun.” And that time was a big innovation time for us. 

Embedding is starting back up again, actually. We found some space to do this. The services I was talking about, we're finding ways to pragmatically chunk them down and have them take less of our time. We also are partnering with another team. 

We have a team called Stacks. They do the Stack Overflow design system. You can actually do StackOverflow.design and you can download it, it’s open source. They're responsible for the Chrome look and feel, but they also do a lot of UI enablement too. They actually are the ones who enabled us to do Svelte and Islands in our monolithic code base. And so they've always been interested.

One of the problems you're going to run into is definitely UI, right? How do you get these you UIs in there? So we're partnering with them. Actually as of like… next week or technically it might even have started… by the time this comes out we will definitely have started. There's a team that wants to do something very unique with the UI, like embed a hot path UI. And so we are embedding over there to try that out and see what that looks like, and bring that information back as well.

So we see the value in it, for sure.  It's definitely been the most valuable, but you have to place it with a smaller team. You have to place it appropriately so you're not spending that cash all the time, that money in that way all the time.

Yeah, I really liked that idea of embedding. I feel like that's interesting. I’ve definitely seen SRE teams do that well. Especially when you're trying to get in with a service that's been a bit fidgety or goes down often.

Being able to get in and work with a team and see like where their struggle is, it's pretty interesting. Because, you know, the survey side of the world, it's something that I think many Operations folk or SRE will struggle with. Beause you are building a product. And they are, time and time again, the furthest team from the product.

When you're talking about your main product, like if you’re talking about Stack Overflow, you're talking about Macy's.com, whatever it is, the ops people are the furthest from product. They're probably not in product meetings. They might not even have a product manager. Now all of a sudden they are a product, right? 

And just getting those soft skills to survey right is hard. Like without introducing your own biases. Like, does the CI system suck? Yes. The answer is always yes. Even if it's fantastic, it still sucks, right? And so it can be very hard to survey. So that is an interesting strategy of actually like just putting people on the team and like being able to see… like, let me see the pains that you're feeling and what doesn't feel like it should be your service’s responsibility.

Dealing with Global State and Data Access in Services

That's what some of the participants in this current embedding is too. We are actually running a survey again. It's a lower effort (lower effort than an embedding) tool, but it's one tool, right? One tool in the toolbox. Is it the best tool? Well, it's a tool and it will give you information. 

But when you're in a code base and you're sitting there saying, “Okay, I would like to get information about the current stack exchange site I'm on.” and you find out that the way you do that is through this global immutable state. Then you're just like, “Oh no, how am I going to replicate that in a service?” We don't have that. We don't want that. Our sensible default is not to have a global immutable state in a service. And again, it's not critiquing why it's there. It's more just saying, that's going to be a pain. And what do we do there? 

Or data access. Boy, when you need database access, it's on some object. I just operate it as some function that allows me to grab it. This is so easy. In a service, that's going to be a little harder. And how do I even get the data? Because you don't want to create a monolith by database. Like, I create a service but we connect to the same database. Okay, we're struggling again, so let's not do that. 

Yeah, you find out so much being the closest to the problem. I've said that more recently about things than anything, like go where the problems are, go where the people are to find out how to solve them. It’s like a mantra that works in so many different ways. And here it definitely does.

Yeah. I mean, I like a good survey. Don't get me wrong, but I feel like if you don't know how to survey and you don't know how to process the results… you might have a lot of help on the Stack Overflow side of the world given that you guys have the Stack Overflow survey… like, if you're a small team and you're trying to figure this out, sometimes the information you get back from the survey isn't really what people want. It's like the bias in your questions. They told you what sucks about the things you asked about, but then you just missed like a whole part of their world that you don't know. 

Go where the problems are, go where the people are to find out how to solve them.

I really liked that approach.

Taking on the Challenge of Marketing

Let's talk about the marketing a bit. Because I feel like this is another thing that Operations stories are pretty far from. I’ve got a few friends in the space that aren't necessarily like the most boisterous or talkative people. They're not going out there gloating about what they're doing and why you should be using it. They're just there to do the hard work, and they're dependable. So it's funny. It's like this other role that you have to learn how to do to have a good platform team, to get your team excited. But it's also one of those skill sets that's just so far from the average Operations person. So how did your team go about the marketing internally, originally? What were some of the challenges and like upskilling of yourselves there to do it effectively?

Yeah, that's actually like a fantastic question. I will admit that I am one of the luckiest people on the planet when it comes to teams right now. And I'm not trying to like, “Wow, let me tell you about all these people and is anyone looking for anybody?” Please no, I love them. Please don't take them. They are though.

You are not allowed to poach them.

Please do not. They are amazing people. 

I have four software engineers right now that have worked in some version of platform engineering somewhere or they've had to do things themselves somewhere else. And they have already some good soft skills and willingness to talk to engineers. And that right there is already coming over some of the stumbling block. Because you have to go out and talk to these other engineers and understand what their problems are. 

So there'd be a lot of times where, in my one-on-ones, I would talk with them and say, I need someone to facilitate this meeting, can you facilitate this meeting. And if you have issues like what can we do, what is worrying you, like why being in front of people worries you. And doing a lot of coaching. Like, let's try it, see what happens, and then if you don't like it we'll come back and figure it out again. 

We'll get those soft skills to a point of… you don't have to be amazing, you’ve just got to be able to do it okay. And doing okay is all we need. I would say they do it really well, a lot of them. And that's really good. And that is going to be someone's toughest part. I think you need individuals on platform engineering that are more than just the best coder, best reliability engineer, SRE. You need those, it's like any composition of a team, you need the right T shapes and then you want to make that comb shape for the team.

But you definitely have to have people that if they don't have the soft skills, they're willing to develop them. And it's almost like developing TPMs out of them, right? Like I need you to become a TPM and that's really hard because that, for some people, can be a very scary word. I don't want to be a TPM. I don't need you to be a TPM, but I need to have qualities and put a hat on that says I can take this here and do it.

To like maybe put a final thought in on it. Finding someone in your org that probably has some of those skills and is probably good at them is a great idea. I always try to empower my people and one of my people, who actually came from the Stacks team, he is great at taking outcomes, boiling them down to like what are the opportunities, and then finding the problems. He can do that through frequent contacts, frequent talking. And then he mentors people when he does all this. Taking that person and saying okay… again shining that light. Showing this person's doing amazing. And that will pull other people to say, “That person is doing good stuff. What can I do to be a part of that.” because people like to be part of good change.

Yeah, that's interesting. You know, one of the things with the marketing of it is there are plenty of topics on the internet about like DevOps versus Platform Engineering. And I'm sure some people that know me are listening to this are like, “Hey, wait, you, did that dude.” Yes, I did. But there's plenty of videos of people asking me about that so we won't get into it. But I feel like, when we look at where we are as a community of people that work in the cloud and do DevOps and Operations and whatnot, many orgs are still where we were as a collective in 2008, before like DevOps was coined.

We haven't gotten there. It went from this idea of this culture, these two teams working together to like it's a role now in many organizations. I don't think there's anything wrong with DevOps. Obviously I think platform engineering is an extension of it. But I think what we did wrong as far as getting our organizations to adopt it and take it seriously was that we didn't market it. 

Not like companies selling it. The company’s selling it did a fine job marketing it. They marketed the shit out of it. But like internally, like us selling it to our businesses, us selling our business counterparts, us selling it to our teammates is something that I think that we did not do a good job at. And I think that's kind of why DevOps isn't what it should have been for many organizations. 

And that's one of the things that's always kind of excited me about this renewed interest in platform engineering, renewed interest because it's not new.

Yeah.

There was this like product and marketing like tone on it from the get go. And I'm very excited by that. I know that for some people the marketing part’s not exciting. But it's catching on quick and people are getting excited about it and people are actually doing the work this time around. And I think that's going to be the big difference. 

I would say, if you're on a team and you're struggling to get your organization to like buy in and be excited about it, you're probably missing a little bit of that marketing. You're probably telling them what you will get with platform engineering. You're not telling them why it's important and what problems it's going to solve. And Illustrating that with numbers and case studies from within your company. That is great stuff that can really get the idea in everybody's mind and thinking of it day in day out. 

If you're on a team and you're struggling to get your organization to buy in and be excited about it, you're probably missing a little bit of that marketing. You're probably telling them what you will get with platform engineering. You're not telling them why it's important and what problems it's going to solve.

Sorry, random segue. It's just when we get into the marketing, I'm just like, “Yes.” This is such an important part of the work.

I would take that a step further. One thing that I could have told you more about is also thinking about… I had by CTO buy-in… some places have a lot of metrics that tie things together and you can like easily calculate ROI if I decrease my build time or make my pipelines better. And you can do that. At Stack Overflow, we didn't have that connection as easily. So, luckily, I could talk to a CTO who was very educated on the idea and gave me space, gave the team space. Like, okay, you have some space to do this, give it a shot and see how it works. 

Then, you know, we did all the things we've talked about, but part of that marketing is, interestingly enough, talking to things like the product managers and being like, “Llisten, do you want fast iterations? Oh. you do. Boy, do I have something for you.” And they're just like, “Ahh”. In fact, I had one product manager tell me just recently that they are just so happy with how fast they can iterate on something that this team's deployed. It's just like, you can get iterations out like that compared to the monolith. And in fact, the team, I think, doesn't really want to go back to the monolith anymore because they love their four to five minute from code build to deployment compared to the 40-minute build of the monolith.

Mm-hmm.

Then you're hitting your marketing around everyone, because everyone's part of the same journey. So if I can't explain it to a product manager, or my team can't, then we need to work on that because they need to see the value too. Listen, in the end we all are here to help the business and the product… like you said the literal product we're trying to do has to benefit from what we're doing. Otherwise, we're playing with ones and zeros for fun. Which is great and all, but we want to get further for the business so we can go farther.

We want to maximize shareholder value. 

Oh, no. Make profit, okay.

Sorry - So for people that weren't watching on YouTube, I had full jazz hands when he was talking about the PMs from other teams. But it's a really great point, right? Like if you're working on a platform team and you can't explain the benefits and the problems that it solves to somebody that's non-technical, that is a problem right there.

That is pretty keen. Like even just exercising that conversation on a few PMs in your org to get them to get it. Because then you will probably start to get this one, “I was talking to Cory the other day and he said their build time’s like eight times faster than ours.” How much is that slowing you all down in like getting these features out, right? Like you might, you might create yourself some like free marketing.

Exactly, and free marketing is the best marketing. No pay, you don't have to do anything for it, great.

It is the best marketing. 

Technologies and Trends

So as you work through your platform journey, what are some of the technologies and trends that you are seeing in the cloud? Because there's a lot of stuff you may not have had access to when you were running on prem. So you're seeing all these services in the cloud, you're seeing AI and LLMs coming down the pipeline. What technologies and trends are you seeing that are exciting to your team to incorporate into your platform?

That's a really good question. It's a really hard question too, because I love it all. 

I'm seeing things that are really interesting in the ownership side of things. I was just looking at and playing around with OpenCost the other day. The idea that I can deploy to the cloud and Kubernetes clusters and get an idea of costs from a unified framework and it works everywhere, my mind was blown. Because cost is an ownership thing that I think teams should try to own. I don't think they should care about the dollar number, but they should care if somebody comes in and says they need to reduce by 10%. You know, that's something they can work on. So that was an interesting space that I saw. Is it the most influential? I'm not sure.

One of the things that I would say that we're really interested in, I'm going to say it personally, is more looking at workflow provisioning of infrastructure and how that's going to try to become standardized. I was reviewing a really cool new CNCF option…. I forget, I think they're like, just early, in called Score.

Yeah.

Yeah, and I'm like this is awesome. I can provision infrastructure, and it's understandable, and it's updatable, and I can see where everything is at any one time. That's pretty amazing. And I think that's a good way to give contracts. Even in platform engineering, you probably have something like a base team that does some of that stuff. Like, “Hey, I'm the one, we own provisioning of your infrastructure.” Cool, so I want a nice contract to make that easy, Score does that. Like I think that's a good way. 

I've seen workflow in the past where you just create your own… whatever, right? It could be a JSON file, a YAML… who writes in YAML anymore? No, YAML's fine, I'm sorry.

Might be my favorite language.

Oh, I hoped you were going to say XML, but okay. 

But yeah, you'd have some contract with them, and you would pass it along, and they would provision the stuff for you automagically, which is great. But Score, I think, is a really cool way to do it. The standardization of that space, I think, is a next level iteration for where we're going in the cloud. And I'm excited to see where goes.

Yeah, I think that one's pretty interesting because I feel like a few years ago in some orgs it was like this team manages infrastructure and you all do your apps. There are a lot of orgs still like this today. Like if I need a database or a queue, I have to go open a ticket, change management, GitHub or Jira or whatever, tap somebody, sneak a Slack message out there and be like, “Hey. Can I get an SQS queue?” Or GCP pub sub queue or something. 

But the funny thing is like when we think about our apps… like where the cloud started… our apps ran on the cloud. It was like, I got a classic VPC from AWS and I got an EC2 instance and I put my app on it and it runs. But now like our apps are the cloud. Like my app needs cloud services for it to function, right? And that shift of like, I'm using cues, I'm using step functions, I'm using object stores… like these things are very specific to my app, not how we run applications. I need that control. 

That grinds teams to a halt, if all infrastructure provisioning has to go through this infrastructure team. Like go tap one of those people if you need a queue. And like being able to extend the ability for a developer to get certain cloud services when they need it (with your compliance and all your practices in place), like that right there is, I feel, one of the Holy grails in this shift towards self-service and platform engineering. Letting people just grab that stuff because it's...

I feel like we've just gotten in this weird spot where it's like, but that's infrastructure. That's not theirs. And it's like, no, it is. I can brew install that stuff locally. Like if I can brew install it, I should be able to make it myself there. The catches is like, are my guardrails in place? Now that's the part that freaks all those Ops folk out. They didn't copy the OPA policies, right?

How do we get all that stuff in place and let them run because that's what they need. Our apps are getting to be less code as we're starting to buy into some of these commodity services. So let the developers get it, otherwise, if they don't, they're going to code around you or they're going to ClickOps it… beware Ops folk.

Yeah. They'll write some Terraform and you don't want them writing Terraform, do you? 

No, no, you don't because they're just going to bug you about it. Is this right? How many zones do I need? I'm going to all of them. 

How do I do a loop again?

The cost is stuff is interesting too though, because I feel like that is a really important thing. Again not like necessarily every dollar but a lot of developers are out of the loop and even some Ops folk are out of the loop as to how a change to a piece of infrastructure affects the cost. And for many orgs, like three months down the line, the CFO is like, “Hey, how do I cut this cloud bill?” And it's like, well, you just introduced a lot of work for me to go and do that. But like, if you can start to see using… you said OpenCost, right? I'll put a little show note in there so we can throw a link in the show notes. 

I think that's important, because while you might not care about the dollar amount per se, if you see that there's a big fluctuation in the cost of your services, like that tells you something. We're scaling or somebody's misconfigured something, right? But like, if you know that your company's under a budget, you’d know there's something that probably needs your attention now versus when it's an emergency in three months. And that's a level of, I guess, enablement and visibility that many developers just don't have.

I 100 % agree. One of the things we talk about a lot in enabling platform is ownership, and ownership of lock, stock, and barrel of everything. That to me goes all the way to the costs. 

I know in my organization, and every organization I think I've been part of (I just think it should change), is that I'm sitting here working on my code, and then someone taps me on the shoulder and says you know that Kubernetes instance is costing us $100,000 a year? You're like, “Whoops. How was I supposed to know? I didn't know.” And I think if you can expose that, like a metric right up front, that has such a power to it. Because now it's not just about how do I deploy the coolest engineering thing that delivers my product value, but also like how can I do this by minimizing that number?

Come on, we all like gamifying, right? I got my instance down to bucks as opposed to, you know, hundreds of dollars.

And it doesn't matter what org you're in, right? If I'm in a startup and my bill can only be X thousands of dollars and maybe even smaller, costs matter a lot there. And then if you're a big FinTech company, costs matter there too because everything's a budget. And it matters, it needs to be streamlined and understood. And even for us, we have the same issues, right? We're always looking at, do we really need this integration? Do we really need to have these on as VMs? Can we do Kubernetes here? 

Honestly, we haven't talked much about it because it's AI. Ppeople are going to tell me AI is coming to platform too. But as we deploy more AI resources, anybody does, those costs are just going to go out of this world. And if your system's like, I'll make 20 LLM calls just to do this one request, to give this person a good recipe. I mean you could do that, but you could use Elasticsearch and then just have a good re-ranker on it and it’ll cost you like 1/100th the cost. 

AI and Platform Engineering: Future Considerations

Yeah. AI and platform is going to be interesting. I actually had a guest on a few weeks back and they were talking about how they're incorporating AI into one of the internal platforms at Google. I mean, it's not just the cost there, right? You're running it in the metal. 

What starts to become interesting is who owns that data if the platform engineering team is offering AI services? Like, so it starts to get real interesting. Like, who does own that data? Like, is it still owned by these 20 teams? 

If we can start to build a commodity interface for AI in the platform, is the data the platform engineering team’s now? That's where it starts to get real interesting too, because if we're thinking of ourselves as like a separate entity, a separate product, it's like, oh I own all their data now. But it's a pretty interesting topic.

Yeah, I'll be interested in it. I'm not against it. Just I want to see it first.

I'm bearish. I was bearish on the call. I'm bearish on all things AI. Except for the robots coming to get me. 

Oh yeah.

Because I'm mean to them. I'm not bearish about that. I'm pretty confident in that one.

Closing Thoughts and Career Advice

Awesome. Well, I know we're coming up to the top of the hour here. I want to be respectful of your time. One last question for you. Is there anything you've learned along the way, from your time as a teacher to your time as an IC or your time as an engineering manager or director, that you want to share or that you think could help people in their career? In their path?

Last question, loaded one. 

Yeah.

The best thing to realize for any person, no matter what they're doing in this field that we're in, is realize that we're all humans. We're all really trying to do amazing stuff, and we all want to work together. And learning how to enable each other to do awesome stuff. That we're all not the same. 

That can be such a powerful enablement tool to get you buy-in, get you education. To get you to have people who come along the way with you. To even convince people that the way you would like to design a system has a better pro and con matrix than the one they're suggesting.

Those soft skills, as we talked about earlier, and recognizing how to use them in the right way will be a game changer for everybody.

Even if you have no plan to be a manager, TPM, PM, CTO, it just helps you interface with the people who also want to help you do an amazing job, also do an amazing job. I would always say, think of the other person on the other side of your call or your desk and think about how you can interact with them better. And try your best with it. And I think you'll see good results if you always put your best foot forward.

Yeah, I like that. It's so easy for us to sit down at a computer and exercise our skills in development. Or to find an open source project and contribute to it and exercise our skills in development. But as software engineers, where we're spending 30 hours a week in our terminal, we don't get a lot of opportunity to exercise those soft skills. I really appreciate that. 

Awesome. Well, it's been super fun having you today. I really appreciate your time. I appreciate the extra time too… everybody, we tried to record this earlier this week and my laptop literally melted in the middle of the call so we had to re-tape. But I appreciate more time. 

Thanks for sharing the journey and the look into Stack Overflow. Where can people find you online?

You can find me on LinkedIn, that's probably the best place to get me. I kind of swear off most social media. So that'd be the best place. And I can also give you my Stack Overflow profile as well. 

Ooh, I'm going to go see how many points he got. 

Uhhh, it's...

Never mind. He has the most amount of points, you don't even need to go look at it. Just trust us. 

Yeah, yeah, yeah. It's the best.

Awesome. Well, thanks so much for the time. And if you enjoyed this episode, please take a moment to rate, follow, or subscribe to the Platform Engineering podcast on your favorite platform.

And if you haven't checked it out already, we just finished a five-part series on the foundations of the cloud. We talked to Mitchell Hashimoto about the founding of Terraform, Brian Grant about creating Kubernetes, and Adrian Cockcroft and his influence on what is AWS.

Thanks so much for tuning in, and we'll see you next time.

Important Links

Featured Guest

Peter O'Connor

Director of Platform Engineering at Stack Overflow