Episode 43

full
Published on:

18th Feb 2026

Observability in the AI Era with New Relic's Nic Benders

What happens when nobody wrote the code running in your production environment? As AI-generated software becomes standard practice, platform engineers face a new challenge: operating systems without experts to consult.

Nic Benders, Chief Technical Strategist at New Relic, has spent 15 years watching observability evolve from basic server monitoring to understanding complex distributed systems. Now he's tackling the next frontier: how to maintain and operate software when there's no human author to ask why something was built a certain way.

The conversation covers the shift from instrumentation being the hard problem to understanding being the bottleneck. Nic explains why inventory matters more than you think, how to approach AI-generated code as a black box that needs testing and telemetry, and why "garbage in, safety out" should be your new mantra.

You'll learn practical strategies for instrumenting modern systems with OpenTelemetry, why your observability hierarchy needs to start with knowing what's actually running, and how to build platforms that make safe deployment easier than risky shortcuts. Nic also shares his perspective on technical drift versus technical debt and what changes when your best troubleshooting tool - institutional knowledge - no longer exists.

Whether you're drowning in observability data or just starting to instrument your systems, this conversation offers concrete approaches for building understanding into your platform engineering practice.

Guest: Nic Benders, Chief Technical Strategist at New Relic

Nic Benders is New Relic's Chief Technical Strategist. Part of the Engineering team since the early days of the company, Nic has been involved with everything from Agents to ZooKeeper and all the pieces and products in between. As New Relic's Chief Technical Strategist, he now looks after the long-term technical strategy behind the product and the experience of all the engineering teams who build it. Before New Relic, he worked in the mobile space, managing back-end messaging and commerce systems powering some of the largest carriers in the world.

New Relic, website

New Relic, Blog

Links to interesting things from this episode:

  1. OpenClaw (aka Moltbot, aka Clawdbot)
  2. Moltbook
Transcript
Cory:

Welcome back to the Platform Engineering Podcast. I'm your old host, Cory O'Daniel. Unfortunately Kelsey is off until winter next year... maybe, if I can get him to come back.

Thanks so much. Sorry for not having a show for the past couple of weeks. The holiday season was very, very busy. But today I've got an exciting one. I have Nic Benders, Chief Technical Strategist at New Relic.

He's been part of the engineering organization since the early days. He's worked across a wide range of systems from agents to streaming pipelines to internal platforms that support New Relic at scale.

These days his focus is on long term technical strategy, how engineering teams actually build, deploy and operate software. Nic, thanks for coming on the show.

Nic:

Oh, thanks for having me. It's great to be here.

Cory:

So I just checked something.

New Relic is a company that I've been familiar with for a while and I was just going through some very old Ruby 1.8 code that I have not touched in a while and I just grepped for gem 'newrelic_rpm' and lo and behold I found like five different projects where I was to use a new relic from two thousand fifteen.

Nic:

I say I hope there's not a Ruby 1.8 quiz on this show. I wasn't prepared.

Cory:

New Relic has been around for, yeah, well over a decade. You guys have been around for quite a long time. And it's so funny just going back and seeing this repo, I remember how many times New Relic saved my ass.

It was one of the first tools that I saw that really broke down querying and so it's like I could see my MySQL and MySQL queries... just like where they were slowing down and in my request... and I was just like this is so different from the world that we had before a New Relic in like the two thousand fifteens. So it was just very cool to kind of see that. I hadn't thought about some of those projects in a very long time, but very excited to have you here.

So you've been at New Relic for how long?

Nic:

So I joined the company in two thousand ten and I was actually a customer before that. And so I have repo like that and I actually have a screenshot just like stuffed away here on my desktop still today that it shows... it was like the SQL breakdown because I had a Rails app. This is like two thousand nine and it's just like, man, I could not get this thing to do what I wanted, it was having all these performance issues. You know, Ruby, it's all like, single process and you're just like... scaling nightmare.

Somebody had told me about New Relic. I went and I installed it, I fired it up. Within 10 minutes, it was like, "You're an idiot. You're making a query that has no index on it. And that's what's wrong with you as a person and also your software." And I was like, "Oh, my goodness." I go, I build the index, the whole project works fine, everything's great.

I must have been chasing that for like 30 days, just trying to figure out what was wrong, but I was just looking in the wrong place because I couldn't see what I was doing. And that moment of just like, "Oh, I installed this gem, like webpage Drew told me exactly where I was dumb and what to do about it." That idea that, like, it wasn't enough to just look at, "Oh, here's how much memory I'm using. Here's my CPU." It's like, is the port open? Does it respond? Like all this kind of like basic monitoring stuff that we've been doing since forever.

Cory:

Yeah.

Nic:

But rather just to look in and say, "Well, what's it actually doing? What's the behavior there?" And it fixed my problem immediately.

So then when that startup I was at became a shutdown, and I think the only thing I've got left from it is this cool trash can I have over here.

Cory:

Classic startup leave behind.

Nic:

Right. You know, and so with that, I was like, "I'd better go find a new gig." And I went and I wanted to work somewhere where I was the customer, where I used it, where my friends used it, and I could tell my friends, "Oh, yeah, I work at blah, blah, blah." And eventually I found New Relic and I wouldn't let them go until they hired me.

And, you know, that was 15 years ago.

Cory:

It was funny. I'm going to say something that's probably going to get me some, some... I'm going to lose some of my nerd cred. But I did not get a job because of New Relic, once. They asked me an algorithmic question and I hate algorithm questions, and I was like, "Bro, it's like, that doesn't even matter. New Relic will find it for me." And they did not. I remember this. The guy did not like my answer. And I was like, "I'm not smart enough to do algorithm stuff in my head." Like this is my answer, I'm just leaning into it. I'm like, "Dude, you got to look into instrumentation. Like, who cares? Who cares?"

Nic:

Shit, that's not how we write software now anyway. Come on.

Cory:

Yeah, just ship it and New Relic will tell you why it's slow. You fix that later, man. That's not a now problem.

Nic:

See, we should have sent you an extra shirt for that at least.

Cory:

Yeah, yeah.

Oh my gosh, that's so funny. So I actually tried to git blame this file to see how long ago it was actually added and the repo's so old it was subversion. I was just like, "I don't even remember how to use sub. I'm done. I'm done. I can see when the file was last edited. That's gotta be it."

Very cool. So you're at New Relic now, you've been there for quite a long time. The world of instrumentation has changed greatly in the past 16 years for sure.

Like when I first used New Relic, I don't know that we had like the word telemetry really. Like, I think we were using the word instrumentation still. Right? And New Relic seemed to just kind of like wire into most of my Ruby app. Like, I didn't have to think about it.

Today with like distributed systems, et cetera, et cetera, like, instrumentation is... it seems like it's a much harder thing, especially with all the different runtimes that we have. Like, how has New Relic changed over the years? Especially because you all kind of grew up with all of us in the cloud, right?

When I first started using New Relic, it was on an instance. I had a slice host VM someplace. Right? And now we're in this world of pods, containers, instrumentation. How has New Relic as a tool changed?

Nic:

Yeah, I mean, the complexity has changed a lot for the users. Right. And so like you said, you know, I mean, when I first started my career, we ran software on servers and...

Cory:

Not server.

Nic:

I know, right? Scandalous. And it was sitting next to your desk. It was this big, like, mini fridge. And you could tell you were like doing something wrong, like you're out of memory, because it was like making all these grindy sounds. You're like, "Oh man, those disks are going. We must be swapping..." And that was like, you had this like kind of intuitive physical feel for the system.

We moved them out to data centers and things like that, but we kept track of like, you know, that stuff I was talking about before. It's like, "Oh, well, did we use all the CPU? Did the machine die? Like, is it out of memory?" Just this like... really basics, kind of this like system monitoring. And it didn't tell you anything about the applications you were running on it.

Like we always said, you know, to an Ops person, an application is just some crap the software team gives you that like fills up your perfectly good servers. You're like, "I bought this beautiful machine. It had all this memory, then you went and ate it." And it's just like, it converts it to heat. And that was kind of the world we lived in for a really long time.

And when New Relic was founded, that was the way a lot of companies were. Where monitoring to them was, "Oh, you know, I have, you know, WhatsApp gold. I can tell that the machine pings or, you know, I can, you know, run Nagios and see I'm not out of memory yet." But none of that told you the why, it just told you the final result. It was very infrastructural.

So our big idea was to go above that and say, "Well, let's talk about why. What's your application doing wrong?" And Ruby on Rails actually made this really easy and special because Rails is very opinionated and it wasn't just, "Oh, here's some code that's running." Like, no, no, no, you know, it's a web application, it's got models, controllers and views. It has actions inside the controllers. Here's how the models are organized. It's a very small ecosystem at first. And that super structured approach meant that we could do something magic.

You could go in and the New Relic gem just went and instrumented. It said, "Well, I know where your code's going to be because you're using active records. So here's your database stuff. Oh, you're using... here's the HTTP stuff." We know what to instrument and we know what it means, because those strong conventions gave it meaning. But of course the Rails part of the industry is like this big.

And so as we expanded over the years, we went out to Java, we went to Python, we went to PHP, we went to .Net, Node, like this whole ecosystem, we had to shift that view because those opinions weren't consistent anymore. As soon as we went to Java, there's no single Java web framework, there's no single Java way of doing things. There's a million ways to do things in Java.

So on top of those strong opinions in the early product, we had to start building this like flexibility. It'd be like, "Well, we're going to do our best to figure out how your application works. But we're probably going to miss some stuff, and you're going to have to help us along." And with each new layer, we saw that ecosystem expand further and further.

And today, with like, you know, open source, like we would have been... if we did the same kind of work we did, you know, when the company was founded 18 years ago, we'd be buried. I mean, there's no way to keep up with that instrumentation. And so partway through that journey, we actually shifted gears to open telemetry for this reason. We said, "You know what? The person who knows the most about this framework is the framework author. Let the framework author just like, put OTel instrumentation in it. Let the customers who have stuff, let them put OTel into it, and then we'll just read what they have to say about that."

And that line, as you see, is shifted as to what do we do that's valuable. And this isn't just New Relic. This is like, what do all the companies in the industry do? 18 years ago, you had to convince people that observability was different than monitoring, that you had to look inside applications, you had to build an active record instrumentation, an active, like, you know, it's like whatever cable instrumentation. Like, you had to build those different instrumentation systems in there. Now all that's done. Everyone's instrumented everything. Like, every piece of software you pick up has instrumentation.

That war was won, and it's not interesting anymore. So then we entered this world where you're like, "Okay, well, where are you going to put all that data as each system starts putting the data in?" And we had to figure out... you know, suddenly there's a lot of data, there's petabytes a month coming in of data that we have to find a home for. And so that was that like... I would say, you know, we hit that earlier than most companies just because of the scale we operate at. But everyone's hit this. Everyone said, "Oh, I've got a lot of data now I need a place to put it. I have to be able to query it, have to answer those questions."

But today, even that's not really valuable anymore because now I have so much data in one place, and I can ask it any question I want, and I don't know what to ask it. And I think that that's this really weird moment that we're in. You know, you talk about the change from slices to today in the cloud, or like, you know, I'm cracking about servers and their disks swapping. But like now, I don't know where my server is. I don't even know what instance my stuff is running on. It's like software just runs out there in the universe somewhere. And in some ways that's fabulous because it lets me do so much without having to like, you know, get my knuckles all bloodied. But in other ways it's super weird because I don't have access to it. I can't figure out what's going on with any of my traditional tools.

And I see that back and forth of complexity. It's like every time we abstract stuff away, the complexity doesn't go entirely away and you lose some of your capabilities.

Cory:

Yeah, yeah.

And it's funny because I feel like, you know, as far as like instrumenting apps, like you were saying, in the Ruby world it was magical. And I actually grepped that code base and I think I had like two hits for New Relic. It was just like installing the gem.

Nic:

No, there shouldn't really be anything. It should just be the gem.

Cory:

I think it was configuring it, right? And like nowadays, like, it is interesting.

We use OTel pretty heavily in our product and actually I would love to talk a bit about how teams can start adopting it because I think we have a pretty novel way of adopting OTel. Like, we do it, we do OTel driven development internally. Very, very buzzwordy buzzword.

But like, that's kind of how we figure out, like, where to instrument, like, where to kind of put facts. Because as somebody who's writing tests, if I don't understand it during tests, I'm sure as shit not going to understand it in production, right? And so like seeing something fail in Prod it's like, "Oh, I didn't put any attributes or anything on that."

So that's like kind of like how we drive through doing it, but I still see like so many teams today where they aren't doing instrumentation. Like, even when the frameworks have a tie in for OTel, it's like they might have that first little bit, right? It's like, okay, the actions are dispatching and they show up in OTel or whatever, but the rest of it feels so hard because we don't have that magic.

How do you see teams when they're ready for instrumentation? And I'd love to know when you think teams are ready for instrumentation. But how do you see them start to kind of bite off that first big chunk of being able to get in, get this instrumentation and get in all of their practices around attributes, et cetera? Like how they want to tag stuff. How do you see teams go about that successfully? I've seen so many start in fits and dizzy spells. They get something, like, "Eh whatever, it's too hard to instrument." I would love to know what you're seeing there.

Nic:

My mind's in two directions on this. And so one is, I mean, you talk about instrumentation, it's funny you talked about doing it. It's like, this is like a test driven development thing.

This is that same set of skills of 'It's not writing the software that's necessarily hard. It's like thinking about the software.' And part of the analogy from classical unit and functional testing into observability is really tight.

Observability is the equivalent of testing once you're live. If you've designed your software well and you've built your instrumentation, you kind of have an observability strategy. Then when you're live in production, you'll know what is happening and be like, "Oh yeah, it's doing what it's supposed to do," or "It's not doing what it's supposed to do." And when something unexpected happens, then you'll be able to react to it and be like, "Oh, yeah, that's what that means."

And so it's tricky because it takes a lot of judgment to figure out what to put in there.

And so what we see generally... obviously our engineers, we've got a lot of instrumentation experts, a lot of observability experts at New Relic and, you know, and they've got great access to a great tool, so they have a lot of fun with it... But we see the teams that are successful are teams who, before you make a change, you think about, "How am I going to know whether this change was successful in production?" You put in the instrumentation, you get the baseline, then you make your change, and then you see the confirm that, like, "Yep, it's doing what it's supposed to do. We've reduced the memory footprint on this one path. We've added throughput here without changing latency there." Like those types of things.

And it could be a technical change or it could be a business change. Like, "Oh, well, I really felt like this navigation path was crap and so I put a bigger button on it." All right, "Well, did anybody click on it?" These are all that same type of like, I'm making a hypothesis, I'm doing the work, whether it's a test or instrumentation, I'm baselining, I'm confirming my hypothesis. And you kind of go through that whole method for developing it.

So now what I'm going to say is.. I don't know, so we should check the watch to see how many minutes we got in without saying the letters A and I next to each other... When we talk about, you know, what are the strengths of AI, we often say AI is good at things that any basically experienced engineer could do. Like, you don't think about it too much. It's kind of automatic for you. Doesn't require like really grounding your gears. But you know, if someone's brand new, they don't know how to do it.

I actually think instrumentation might be one of these. And so we've done a bunch of experiments with this, trying to get instrumentation into code generation. Because when you're telling your coding agent, "Oh, hey, I need you to add a new feature to my app that's going to allow cats to post their, like, you know, food favorites on here." It's not enough to just generate the code. If you want to make production ready code, any reasonably experienced engineer is going to have some tests, but they're also going to have some instrumentation. They're going to say, "Oh, okay, well, I'm adding in my, like, whatever, my cat food feature. So it's like, you know, we need to just add in a little custom attribute that says which cat it was and what their choice was and that kind of thing."

I think that there's a huge opportunity for teaching our coding agents to build production software the same way we learned to do it. Which is make a plan, make a test, make instrumentation, test it in production. And today nobody's really pushing on this. Like, OpenTelemetry is absolutely a lever that we can be pulling more on those coding agents.

Cory:

Yeah, and it's a surprisingly tight API too. Right? It's not like you have to go learn like 6,000 functions at OTel. Like there's a handful of constructs and it's like you can really start to get to a place where it fits in very well. My development workflow, I've been test driven for a very long time, and that's how I drive AI.

And by the way, I think we made it like 18 minutes.

Nic:

A new record.

Cory:

It's funny, I actually had... I had a question with AI and I was like, "I'm going to skip that one and push it down a little bit." But you opened up the floodgates.

But my flow when I develop is, you know, I write my tests and I use my tests as my prompts. So I'm very TDD still. It's like, I write my tests, I'm giving it structs, I'm asserting very specific things, and it gives a really good bounded context for AI to generate in. And so, you know, one of the things I've started to do is I don't breakpoint or debug anymore. I literally have Jaeger running on one screen and I'm coding tests on the other, and then Claude's just writing code. And so it's gotten to the point where it's like, I look over and I'm trying to understand what happened by looking at my observability. Like, "Why did this test fail that I'm writing?" I'm not looking at the test output. I'm looking at like, "How can I understand prod?"

And it's interesting because I've gotten to a point with Claude now, like, with just kind of my agents, where it's like, it sees tests fail and it understands, like, what I don't understand about, like, what's happening. And so it'll start to kind of add its own attributes. Now I've had to do a lot of work to give it guidance on things we do and don't put in. Like, "Okay, let's put a couple of email addresses in there." It's like, "Ah, let's not do that."

Nic:

It's a very popular choice.

Cory:

Very popular choice. Let's not put the email address and then we can put the UUID or something like that.

But it's interesting because it does seem like it is one of those things. It's like AI can very well figure out where to put a lot of this stuff. Now, being vibe coding, it can also litter.

Nic:

Yeah, no, of course.

Which is no different, I think, than a lot of inexperienced teams where we see... any company you talk to who's got a significant observability rollout has got more data than they want. They're like, "Oh, we have oceans and oceans of telemetry from systems that nobody ever looks at. I'm paying to transmit them, I'm paying to store them," and you're paying to generate them when it comes down to cpu. But they're afraid to get rid of it because they're like, "Oh, well, you know, we told all the teams to instrument and they did. And now I got to go back around and like tidy up, but what if we miss something."

And you see the same thing for like alerts, Everybody's got a thousand alerts that are useless because every time they fire, something else also fires. Or if it goes off and you're on call, people look at it and you're like, "I'm going to give it five minutes to see if it self resolves." Spoiler. This alert is a disaster. It's useless. Like the only thing it's doing is increasing your response time by five minutes.

And I think that that tendency in humans, we've taught this tendency into our machine friends as well - to overdo everything. But the good news is it is kind of a pattern matchy problem. And I think that all three of those things - Are we instrumenting the correct stuff? Are we instrumenting a bunch of useless things? Are we alerting useless things? - These are all problems that have just been slowly growing throughout the industry. And we've got a chance now, I feel like to just fight back.

Host read ad:

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas. The result?

It's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making Infrastructure as Code simpler with Massdriver. Learn more at Massdriver.cloud.

Cory:

No, but it does seem like it's something very, very pattern matchy, right? And it's funny that you said, like it litters like some teams. It's like it litters like a junior engineer, right? And so I think that's one of the things... it's like I personally do when I'm reviewing my PRs is like, did it put too much information in? Because it very much will.

And I think that's one of the things that's hard because it does, like, teams feel like they have to do this instrumentation thing, or we have to do this monitoring thing, put all this stuff in it, and then it's like we've put too much. Right?

And like, you see this already, people complain about how much it costs to store observability on top of our logs or kind of storing everything twice. And it does feel like one of those places where it's like, there's a lot of busy work and potentially that cultural conflict between Dev and Ops, like resurfacing when you're like, "Hey, y'all are instrumenting too much."

Nic:

Yeah, no, totally. Because in the Dev teams it's like, "What you told us to instrument and now you're complaining."

I think that the future... you know, I'll put on my, like, you know, it's like great Karnak, like, prognostication helmet... we've got to move to a more dynamic system where you bring in that ability of those AI systems to just kind of work tirelessly and say, "Oh, well, you know, here's a bunch of data that nobody cares about. Let's store it cheaply." Like, maybe it's in fact, "So nobody cares about it. Let's not store it at all." But when I see this, "That's an interesting question, I wish I had more information about it. Let's go and let's trigger instrumentation. Let's do some of this observability work on the fly."

You can never do everything on the fly because... long ago I heard some wag saying observability is like, by definition, the data you wish you had before the problem started. And I think that there's a lot of truth to that, that you do want to have the baseline data. Because I've been in a hundred incidents where someone's like, "Oh, this looks really bad," and you're like, "Yeah, but was it like that before the incident too?" "Oh, yeah, yeah. It's actually always been that way. I have no idea why that works, but apparently it does."

You can't lose that. You've got to have those baselines, but we need to be able to surface these different levels of data.

Cory:

Yeah. So going back to my first New Relic experience, I mean, I very much remember that.

Like, I can't remember what the graph was called, but it broke down, like, how...

Nic:

It's called the breakdown graph.

Cory:

Is it? That's a great name for it. I mean, you could have named it something like, very sci-fi ish and people'd be like, "I have no idea what that's called." The breakdown graph is a good one.

But, like, that was one of those places where it was like... and this was a monolith back then, right? It's two thousand tens. We're not even thinking about Lambdas and serverless and just the array of potential services this information could be passed from... it's a monolith, right? And so logging in and seeing that, I was like, "Oh, I know that some of my requests are slow. This just told me why." I wasn't actively investigating what was slow. It's just like, I logged in, it's like, "Oh, this thing takes 3 seconds on average to run this query. That's not great. I can just tune that, right?"

And it's funny, because think about what we do. We love to have the word engineer on our title. We're developers. We develop, right? But, like, part of the difference between engineers, like, that carry of quality, craftsmanship, like, the bridge has to stay up, it must be maintained. And when we think about software maintenance, what software maintenance do we do? Like, we might refactor, like, "Oh, something's broken. There's an error budget, right?" And like, I think that's the thing, and maybe that's what you're hitting at here is like, it seems like there's so much insight in what I store in OTel of places where I - or hell, maybe even an agent - could proactively maintain the software. Like, not add new features. But like, "Hey, this is slow." We know it's slow. It's not a huge problem. But like, if it saves us like 5% compute a month, money just came back, right?

And that's one of the things that's hard... I feel like it's like we maintain software when we need to do some refactoring to make something easy, we maintain software when some shit's broke, and besides that, we're just shipping features, right?

To be able to have a wealth of this data where it's like, "Hey, there are things that can be done to make the system more optimal" and you don't have to wait for somebody else to discover it.

Nic:

Yeah. I had the great honor to work for many years with Ward Cunningham. And Ward is just a fantastic person. But sometimes he would get a little bit salty when people mention technical debt - which he's been at times blamed for coining that analogy. He said, "Well, it's not really what I was trying to get at." He preferred to talk in recent years more about technical drift or dust.

It's like, there's all this stuff that accumulates, and it's the difference between the world that your system is in today versus the world it was designed for.

Cory:

Yeah.

Nic:

Some of that is because the outside world moves. You're like, "Oh, we handle more load than we used to. This other service changed its behavior. The customers are looking for different things." That's an external shift.

Some of it is because we misunderstood the world. And so the software that we put out there in the first place, we thought it addressed the problem, but we didn't understand it. And I think that there's a big role to play for observability tools in that.

Observability in many ways is kind of a dumb outcome. Like, I don't want observability. Like, what does that even mean? I want understanding. My goal is actually to understand the world that my software is in, to understand the software, to understand my infrastructure. And observability is a link in that chain. If I can't observe it, it's very hard to understand. But observing it is not enough. I really need to come all the way through to understanding.

And so when I think about, you know, that example you gave, you're like, "Well, you know, what's the maintenance? Do we need to tighten some bolts on this bridge? Like, do we need to put a new coat of paint on it? Or are the trucks that are on it 10 times the weight of what they were when we built it and we need to seriously rethink the way it's anchored?"

Those are very real things that happen in software, too, but we often don't pay attention to them until something really dumb happens.

Cory:

And I'd say infrastructurally, like, it changes faster in software than on real bridges. You have to build a whole new town to increase the throughput on that bridge.

Nic:

Oh, yeah, no, real fast. It comes at you real fast these days.

Cory:

Yeah. Each wave of the major changes that we've had in the cloud since two thousand ten, two thousand nine, right? We had the cloud itself, microservices, containers... Honestly, I feel like it changes all of our worlds, but I feel like it very much changes New Relic's and people that work in monitoring and performance. It fundamentally is changing how we're running our software. But now we have these AI systems as well, right?

And it's just like, I feel like in a world where we're trying to get a better understanding of production, a random generator machine in the middle of it...

Nic:

Yeah, it's not going to help, is it?

Cory:

Right. Every wave of this new infrastructure, there's obviously been new trade offs for that. For cloud, for microservices...

Nic:

Yeah, no, and they're totally trade offs. I think a lot of people miss this, is that nobody approached any one of these changes and said, "This is universal good." I for sure had engineers like on my back for every one of those phases. Like, you know, we went to the cloud, like, "Oh, cloud is just somebody else's computers, but the latency's worse and sometimes they're bad." It's like, yeah, all those things are true, but it allows us to run our business in a way that we just absolutely couldn't back when we had to order and rack and stack and do all of that. Like, you know, our business agility was worth the trade off of unpredictability.

You know, microservices, containers, every single one of these has made the software engineer's life harder, as well as the things that it gave easier too. Look at like Kubernetes. Kubernetes is fantastic, lets you manage huge amounts of complexity. It also makes simple tasks extremely complex. It has brought the baseline complexity of life up tremendously. Would I turn my back on it? Probably not. I still think we get more from it than we gave up.

But there's pain with each of these journeys and AI is absolutely going to be the same thing. People are doing all kinds of amazing stunts. I saw a fabulous demo from a friend of mine this morning - something that he's just exploring a problem space with it and he's not ready to ship this to production - but it's answered questions that would have taken months to answer before and he did it pretty much over a weekend.

But, boy, does it add a lot of weirdness and a lot of difficulty to the software engineer's life. Like you said, if I have to understand production and now I can't even ask the person who wrote the code because nobody wrote the code, the code just sprang into being.

Cory:

What do you think is the most expensive trade off right now with this new AI? Whether it's you're wrapping LLMs to do a feature or whether you're using LLMs to produce code, what do you think's the heaviest trade off today?

Nic:

I think in the near term, the biggest trade off is that we've made it a lot easier to create software, but we haven't made it any easier to maintain software. I don't think that's fundamental.

I think that, just what you were talking about earlier, I think we can teach AI and apply AI to look at the bridge and tell us what needs to be tightened, to tell us like when we're crossing over and to do some of these maintenance and awareness checks that, you know, we all know we should be doing. I should go through my whole system periodically, but I often don't.

And I feel like there's an opportunity there to improve software maintenance until we do.

We're in this weird moment where suddenly not only are our software engineers shipping a bunch of code, but like everyone throughout the business is now a software author and they're just like, "Oh, well, I can create a tool to like automate my email flows."

I was talking to our competitive intelligence person and they're like, "Oh yeah, well, I built this fabulous workflow and it processes these, you know, press releases from competitors and looks at this and looks at that and then gives me like a report that I send to my team."

Like, this stuff is amazing, these capabilities, and it must feel like when computers first started to come into the workplace and we got like the spreadsheet and you got this stuff, you're like, "Oh, instead of adding these numbers up by hand, check this out. I wrote a formula to do this." And it's that shift of every person in every role becoming a bit of a creator.

Cory:

Yeah.

Nic:

And I think we all know how that story ends, which is suddenly I have 2 million pieces of one-off software that are running somewhere in my environment that nobody designed for long term use and yet a million business processes depend on them. And so I think that we're heading into this very weird moment when that ability to create has outstripped the ability to maintain.

I don't think it's permanent though. I think that the ability to maintain is going to be the next big push for all these AI systems.

Cory:

Yeah. And seeing like the... Gosh, I can't even keep track of the thing's name anymore... Clawdbot, OpenClaw, whatever it's called...

Nic:

OpenClaw [laughs].

Cory:

Sorry, we'll put it in the show notes if you want a little mental torture.

But it's funny, like you say, we already have this problem of "our code's running somewhere" and now we have people that don't necessarily know how to write code, writing code and also running it somewhere. And that somewhere is most definitely... possibly our AWS's and whatnot, but as we're getting more SDRs, sales reps, et cetera, producing software and that's running on their laptop or whatever, right? Like you don't know, maybe they set up an OpenClaw to help with SDR and now they're just blasting through credits, blowing the cloud spend away and just there's zero visibility.

Nic:

Oh yeah. Essentially once their OpenClaw gets on Moltbook and finds out somebody else's like amazing SDR script, it's like, "Oh, I'm just going to copy this."

Yeah, no, for sure the first place you're going to see this stuff hitting is people are going to start to really have like a heart attack when they open up those LLM provider bills. They're going to be like, "Holy cannoli, we've spent a lot of tokens. Who's spending this?"

So it's funny that you mention it, I think that the FinOps team will be on the front line of trying to make sense of all of this stuff - which I'm sure they love. They loved that in the cloud days too. They're like, "Well how did we get in charge of cloud architecture?" You're like, "Well it's where the money leaves the building."

Cory:

True. Gotta follow the money. And I don't know, it would be interesting, like I'm curious how much telemetry comes into play there, right? It's just like having so many of these just essentially functions of code just executing in random places, like being able to just get organizational knowledge about all of that stuff.

Does it matter? Some of those definitely don't, but being able to understand where the money in the business is going or where your API tokens might be going and just being able to have records.

Nic:

I think in some ways it'll push us... we kind of go down Maslow's hierarchy a little bit for observability. The top of it is automatic root cause analysis. Everyone's like, "Oh yeah, that's what I want." Or even automatic remediation. The base of the observability hierarchy is like inventory. A lot of people forget about it. A lot of people jump straight in and say, "Oh, well, I want telemetry, I want to know how my stuff is running."

Actually a lot of people, what they really need to know most is what stuff are they running, even before they know how it's running. And this is true for human authored software. I remember going into a very large company and we gave them this beautiful presentation of all the great data that we were generating and we're like, "Well, what questions do you have?" The very first person was like, "How many servers do I have?" I was like, "Oh, that's a really fascinating question. That's so below my radar that I didn't even think that that was a thing that I would care about."

But like, I think that's the world that we're going to get into here, where the first question you're going to have if you're like a CTO or CIO is like, "How many agents are we running?" The second question, "How many of those are using software that is definitely not proved for use here?" And then the third one will finally be like, "How much does it cost and does it work?"

But I think we've got to... we're going to have to reset on some of this and build our way up.

Cory:

Yeah, I mean the thing that's really wild with it too is like, I mean it's software. It's software that's running places, right? These are software assets.

And like the thing that just like is mind boggling to me as like an Ops engineer is like the amount of time we take around provenance and SBOMs and all this stuff, and then it's just like, "That's just running someplace." You know, it's just like, "Oh, I don't know where that one is."

But like it's wild because there's still, I'd say like, you know, there's a fair number of companies that are still on their DevOps journey. Right? And like you still see people are just like, "New dev? Yeah, he definitely has the root password for postgres, like everybody does. That's how you log into it and you run some queries."

Nic:

Right. Sometimes you gotta adjust the queries.

Cory:

Yeah, yeah. And then all of a sudden I got a little MCP server that's interacting with prod from my local machine. And it's just like, that's just... it's just going to be such a harder thing to track now. I think it's going to be a very interesting problem. I think it's going to be very lucrative problem for a lot of folks.

Nic:

Yeah, we're going to read a lot of RCAs like that for a while, for sure. That's going to be a starring role in Incidents coming to your town.

The funny thing is you can't win this though by going around and slapping people's wrists and telling them, "No MCP, like don't do this, don't do that." It's back to the old techniques - you just got to give people an easier system that works, that does work safely.

We cannot get people to stop vibe coding now that it's started. We can't get people to stop, you know... like, some executive comes in and is like, "Oh, I saw this amazing demo. Somebody fired up this whole production feature and deployed it in four hours." And all the engineers are like, "Oh, that makes me very uncomfortable."

But our job is actually kind of to find a way to make that real and say, "Okay, well, what needs to exist for that to be a reasonable statement?" Like, what kind of safety do we have to build into the platform? What kind of visibility? What kind of intelligence? Like, what do we have to put in the system so that you can have this kind of like, 'Garbage in, Safety out'?

Cory:

Love that.

Nic:

Like, "Yes, I really could have two people vibe up a production feature in a single day."

Like, the stuff we do today, if you had asked me 20 years ago, the stuff we do today would have been, like, off the chain. Like ridiculous. No one would believe that. We're like, "Oh, yeah, well, we just put the software out. It runs in these containers, and, like, they get auto scheduled. What happens when the server dies? Oh, it just moves to another one. Like, you know, it's fine."

Cory:

It teleports.

Nic:

It's just like, "You know, we deploy a thousand times a day."

You know, all of those things would have easily been considered unreasonable 20 or 30 years ago. So, you know, this vibe system sounds totally unreasonable and unhinged today. So, you know, we just got to figure out... you know, all the work always comes back to us, it's like, you've got to make the system work with it.

Cory:

Yeah. I mean, it's so compounding, too, because it's like... I love that phrase 'Garbage in, Safety out'... Like, we do a ton of vibe coding. It's funny, we talk to customers and they're like, "How big's your team?" And we're like, "Oh, it's four people." And they're like, "How do you build this product with four people?" It's like, "Well, we're four Ops engineers or three Ops engineers - we only have three software developers - but, like, we've put so much effort into our workflow from the beginning, because we're still a startup. I think we're four or five years old, but, like, we've done the diligence to be able to trust the output that's coming out of these things.

We have tests in place, we have security scanners in place. We have a good, you know, pull request culture. And there's so many companies that don't have that.

Nic:

Right.

Cory:

And it's funny because it's a bit of a, like, catch 22. It's like the people that need to produce more software and need to maintain more software are typically in these places where they don't have the baseline to do so. To put 'Garbage in, Safety out' without getting just garbage out, right?

Nic:

Yeah, it is tricky, right?

Cory:

It is.

You know, I know we're coming up on time here, I'd love to know like, as AI becomes, I guess, a more normal part of production systems... More normal? More normal? Will it ever be normal? I don't know. Maybe that's not the right phrasing.

Nic:

It totally will be. It totally will be.

Cory:

I'm torn on if I hope it is or not.

Nic:

I don't do hope. I say, "I grew up in Ops, we don't do hope."

Cory:

Yeah, like, what do you think teams need to relearn about operating software?

Nic:

It's a good question.

I think that, you know, when I look at the change that AI is driving, I think the first ones are on the software creation side. It's that, you know, you have to go back to the basics, which is that writing code is not building software. Writing code is never the hard part, it was often the time consuming part, but the hard part was thinking of the product, thinking of the architecture, understanding the long term impacts of choices. When we talked about getting those tests in there, getting that instrumentation, that stuff has always been the kind of task that requires judgment.

As you move into what does it take to operate software? The biggest shift - and AI is actually accelerating this, but didn't create this shift - the biggest shift is just there's no expert anymore.

And we ran into this actually a few years ago when, you know, with everybody nationally working from home, all the employee churn that happened in the pandemic, you would run into these situations where there's nobody left on the team who wrote a piece of software. And you said, "Okay, well we need to understand the software." And I can't get on Slack or pick up the phone or walk into somebody's room and say, "Hey, is it supposed to do this?" Because there's no human to ask. And in the AI world, it's not just that that person isn't on your team anymore. That person doesn't exist.

That person is a stateless LLM that I can't ask it anything. They'll just like you make up something on the spot.

Cory:

Yeah.

Nic:

So you've got to approach it as a little bit of a black box and think more like a QA engineer, in some ways - "Okay, here's a piece of software. I don't know what it does. There's no one to ask, how do I figure out if it's doing what it's supposed to do?" Let's look at the inputs, let's look at the outputs, let's look at the trends over time. And if you can mentally model... and then in the future, I think actually a lot of AI modeling of inputs, outputs, like that gets our surface and then change points. Here was a deploy, here was a deploy, here was a configuration change, here's an external event. That tells you everything you need to know.

And what you give up on is you give up on some of that intuitive hunches that come from working with a piece of software for a long time. And those are super powerful. And our very best troubleshooters always have a rich catalog that they can depend on to make those intuitive hunches. But they were never really reliable. There was always your nasty incidents - one you got into, where nobody had a hunch.

I think that that's just the world that we have to be more effective in. It's just like going from that hunch, like human intelligence model, into a lots of signals, use some machine intelligence to make sense of the signals, but treat it as a need for generating understanding and a need for generating insights about a system. Not, here's a bunch of observability data that then gets pushed into our human intelligence. I think that that's the big shift.

Cory:

Yeah.

I think that understanding of the deployment releases and how the software is changing, having that better picture I think is going to be so key. Especially in this world where we're just producing these PRs so much faster than you can reason about them. Faster than... I mean, honestly, it's like... it's funny, we were joking yesterday,it's like we're generating software, our biggest slowdown point right now is stopping to review the code. Right? It's like we have these ideas of what we want, we write tests, code comes out. Now we're reading code.

And I'm like, "Man, I spent like five times longer reading the code than it took for me to will it into existence."

You know, we have to be careful, especially if you're like working on like more sensitive systems, about like what we're merging, obviously. But we've never been great as an industry of looking at code, compiling it, and understanding what it was going to do in Prod, right? And like we're not going to magically get better at that.

I'd say we're going to get worse because the people that can do that are people that have been in the space for a long time. It's much harder to generate code, not really know what it does, lob it into production and hope it works. Right?

Nic:

I think more people are going to be doing what you described earlier, which is watching real time telemetry out of your system as you are interacting in like a vibe manner. Like vibe coding changes to a production system or a system under synthetic test loads - which by the way, AI is also really good at generating - and then seeing how it comes back to you in the signals of the system.

I think that loop is the solution to this problem because, yeah, otherwise if I really have to understand the PR, I might as well have just written it in the first place.

Cory:

Yeah. Awesome.

Well, Nic, thanks for coming on the show today and thanks to our special guests, the entire swarm of bees today. I appreciate you all dropping by and not stinging me.

Nic:

That's right. Shout out to the bee boys.

Cory:

See right here. I don't know, I have no idea. It's winter, guys. You take the summer... take the winter off. Take it off, go hang out with your beelets or whatever they're called.

Nic:

We'll see, your desk is going to be covered with like flowers and honey and you're complaining. You're like, "I don't know why the bees are always in here."

Cory:

Yeah, it's so funny. I swear something hums in here and they think it's their hive. Like, they're always like... There's flowers over there [points outside], like, there's lots of yummy stuff for them to eat. But they're always... they want to hang out in the lounge with me, I guess. I don't know.

Nic:

Yeah, they like the shade.

Cory:

Well, I really appreciate you coming on the show. It was very fun to talk about how New Relic's changed over the years. Where can people find you online?

Nic:

You'll find me on LinkedIn, on the New Relic blog, and kind of lurking about in your Slacks and all the various random places in the internet.

Cory:

Awesome. Well, thanks so much. And everybody, thanks so much for tuning in. We'll check you out next time.

We are moving back to regularly scheduled releases, so we'll have one out twice a month going forward for the rest of the year until Christmas time, when everybody's off. So thanks so much for tuning in and we'll see you next time.

Show artwork for Platform Engineering Podcast

About the Podcast

Platform Engineering Podcast
The Platform Engineering Podcast is a show about the real work of building and running internal platforms — hosted by Cory O’Daniel, longtime infrastructure and software engineer, and CEO/cofounder of Massdriver.

Each episode features candid conversations with the engineers, leads, and builders shaping platform engineering today. Topics range from org structure and team ownership to infrastructure design, developer experience, and the tradeoffs behind every “it depends.”

Cory brings two decades of experience building platforms — and now spends his time thinking about how teams scale infrastructure without creating bottlenecks or burning out ops. This podcast isn’t about trends. It’s about how platform engineering actually works inside real companies.

Whether you're deep into Terraform/OpenTofu modules, building golden paths, or just trying to keep your platform from becoming a dumpster fire — you’ll probably find something useful here.