What CVEs Did for Security, CREs Are Doing for Reliability

Did you know that software engineers often "learn things the hard way" because they lack a standardized system to share knowledge about reliability issues? While security professionals have CVEs to catalog vulnerabilities, reliability engineers have been left to reinvent the wheel with each new bug or outage.

Tony Meehan, co-founder and CTO of Prequel, introduces us to Common Reliability Enumerations (CREs) - an open-source approach that's doing for reliability what CVEs did for security. After spending a decade at the NSA hunting vulnerabilities, Tony recognized that the same community-driven approach could revolutionize how we handle reliability issues.

This conversation covers:

How CREs help developers detect and mitigate reliability issues before they cause outages
The open-source tools Preq and CRE that allow teams to leverage community knowledge
Practical ways to implement these tools in your development workflow (locally, in CI/CD, and production)
How this approach can reduce cloud costs by identifying issues rather than over-provisioning
Tips for debugging mysterious production issues when no CRE exists yet

Guest: Tony Meehan, CTO at Prequel

Tony is an engineering leader obsessed with bugs. He dedicated a decade to vulnerability and exploit development at the National Security Agency (NSA) before leading Engineering at Endgame and Elastic. In 2023, Tony co-founded Prequel to change the way application failure is detected and resolved.

Tony Meehan, X

prequel.dev

github.com/prequel-dev

Prequel, X

Links to interesting things from this episode:

Transcript

Intro: 00:00:04

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory: 00:00:39

Welcome back to the Platform Engineering Podcast. I'm your host, Corio Daniel. Today I'm joined by Tony Meehan, co founder and CTO of Prequel. Tony's background's wild.

He spent a decade at NSA working on vulnerabilities, so probably knows a tad bit more about security than I do.

Led engineering at Endgame and Elastic, and now he's on a mission to make software more reliable by shifting how we think about detection and failure. He's also one of those folks who genuinely loves bugs. Don't we all?

I gotta hear your weirdest production bug story at some point in time, but we're gonna get into all that today. Tony, welcome to the show. Can you tell us a little bit about you and how you got into the space?

Tony: 00:01:15

Yeah, Cory, I'm excited to be here in the dude shed with you out in LA. I mean, where to begin? I just... yeah, a little obsessed with bugs. Definitely kind of picked up the itch at the NSA looking for bugs for 10 years.

I don't know, I think finding bugs is, it's like the best way to learn how software works and have a better understanding of systems and it kind of scratches an OCD itch. You know, a little obsessed, but just my entire career, that's sort of been the theme.

And so, you know, after the NSA, I ended up joining a startup called Endgame, where we were building an endpoint security product. And so I got to do more of that and ended up joining Elastic a couple of years later, which was super fun. I loved working there.

But yeah, I think I've always sort of had this nagging obsession with finding bugs and really wanted to find ways to... you know, my background's cybersecurity, but, you know, building software products, you would always have, you know, an outage, an incident. And it was exactly the same experience as looking for a vulnerability as it was to find out the root cause to some outage or some big bug. So definitely scratch that itch as well.

And we were just excited about taking some of the lessons that we learned in the security community in how to build people, build a community, have people work together through tools and take that to the space of reliability problems. And that's kind of how we ended up here.

So, anyway, it's awesome to be with you. I love listening to your podcast and yeah, great to talk to you today.

Cory: 00:02:40

Oh, I appreciate that. Thank you, I appreciate that. That's cool. So it's funny thinking about bugs and failures.

I feel like as a software developer, we're writing software, we're opening a PR, my team's looking at this PR, going to merge it. There's so much headiness around that PR, right?

There's the person who's not familiar with the feature that you're working on, reviewing the code, trying to make sure that there's no failures, bugs, weird quirks that are getting introduced. But then there's how that code can interact with the rest of the system once it's merged.

The idea of trying to find these failures across the entire code base, I feel like is an overwhelming idea for an individual software developer, even a team working on this.

When you start thinking about failures not just as incidents to clean up after, but as problems that you are trying to detect before they happen, how are you thinking through these problems? And where does this integrate with my tooling as a software developer? Where do I put this in? Is it in my build locally? Am I putting this in CI/CD? Is it analyzing just my PR or the whole code base, kind of reflecting on what's getting merged together?

Would love to learn a little bit more about that.

Tony: 00:03:49

Yeah, good question. A couple things maybe to talk about there. The first is, yeah, it's, you know, putting up a PR, there's abstraction upon abstraction upon abstraction.

You're new to a code base, it's a big team, there's just... there's a lot of complexity and a lot of hidden complexity that's abstracted away from people. And so I mean, failures and bugs and unexpected interactions are inevitable. And one of the things that... it is overwhelming.

It's an overwhelming experience. And as you build up even more experience as a software developer, these things are still going to happen.

Actually, I just saw someone tweet this today. They were talking about they learned something the hard way. And I think that's a bummer.

I think learning something the hard way is a bummer because someone else somewhere has run into the same problem before. Almost always.

You know, of course you can have bespoke bugs in an application you're writing, but at the same time there's common developer anti patterns that you can introduce into whatever you're writing.

So I think the starting point for us, like a good anchoring point is like, you know, contrast it with, you know, in the security community, anytime there's a new problem, you have this massive community of threat researchers that are posting their blogs where at the end of it they'll talk about, here's how you go and find this problem. But you know, in, in reliability and software bugs, it's like you're on your own, like, good luck, you have to go do the investigation.

And then after like hours or days of looking into it, you know, sometimes you'll stumble across like, "Oh my gosh, someone else has run into this exact same problem, and I'm the one that learned it the hard way. And here's how I could have detected it, here's how I could have mitigated it."

And I think the thing that we get really excited about is like, "How do we just start there?" Like, let's start with a detection.

How do you leverage community knowledge so that other people that have learned it the hard way can give you the benefit of that experience so you don't have to. That's like kind of the starting point.

And then like kind of maybe, and we can get into this, but like, kind of how our approach works is we work with this community of problem detection engineers. We have our own reliability research team where we write rules that kind of codify this knowledge of failure - like misconfigurations, known issues in open source software, or developer patterns, like I said - and takes that detection and couples it with a mitigation in the form of a rule and then actually runs that on the data where the data sits.

So instead of sending data, you know, continuously somewhere else - that can get expensive or unpredictable - we actually bring the rules to the problem and then constantly update those rules from things that the research teams are finding, that the community is finding.

And so that way if some new problem comes up in Apache, Kafka, or using something else, you know, like an ORM for SQL and there's some new issue the community is facing, like you get that intelligence immediately instead of learning about it the hard way hours, days later.

Cory: 00:06:44

Yeah, and I feel like for any engineering team, that's great, right?

I mean, I've seen ORMs where they were SQL injection proof, but then there's like this one corner case where it's like, you sure as heck can SQL inject that part, right? And it's hard. Like you, as a developer, you're trusting that this library is tested. Right?

And it may or may not be a CVE, but SQL injection is going to happen sometime, right? And to find that is hard. So as a developer, just working on your product for a customer, that is a tough problem space to think about.

But I feel like for platform teams where we're integrating with an unknown number of plugins, I feel like that is even more of a... I've got to tie into Terraform and whatever Terraform modules these people are remotely referencing. I've got to tie into, you know, this configuration to hit a web server to push some metrics. Like there's a lot more integrations that we face as platform engineers.

I feel like this fits great in the general engineering population, but for platform engineers, I feel like this is pretty critical in making sure that the systems that we're building are introducing things like server side request forgeries and whatnot.

Tony: 00:07:53

Yeah, yeah. I mean, look, we started off this journey really focused on developer bugs and applications, but as we worked more with customers - we're still doing that - but what we have learned is you're exactly right.

All of these interactions between different systems, Argo, Terraform, Vault, and then all of the sort of infrastructure components just to, you know, connect applications, the message queues and just all of these different things, just these interactions - there's just always something going wrong.

There's always like this chaos that's happening. Sometimes you're able to recover from it and sometimes it's just sort of this lingering thing that eventually is going to become some customer issue.

And I think, as we saw that - and then we would go do research on these problems and see that there were many people that had already run into them, talked about how they fixed it or mitigated it and it's some open source GitHub issue or whatever it was - we got excited about the idea of how do we actually codify that knowledge in the form of intelligence.

We call them common reliability enumerations.So it's a rule that would allow you to basically automatically know about this thing instead of discovering it later. So that, that's kind of where the idea came from.

too. When I was at Endgame in: 2017

Cory: 00:09:22

Oh no. Yeah, that's rough.

Tony: 00:09:23

We would have these message backups and they just wouldn't get processed. We were using NATS, which is great. We actually still use NATS now, but we were super early adopters.

We ended up spending several days debugging the problem and discovered there was a deadlock in the client library that we were using. And about four or five weeks before that there was a GitHub issue where a bunch of people in the community had discovered this problem and like, "Here's how you can fix it." And so we ended up five or six days later, found the exact same issue. And it was like this bittersweet moment of like, "Okay, we've tested this, we know this is a problem, this fixes it." But also, "Oh no, what do we do if this happens again? How would we know about something like this next time?"

So I think that kernel, that experience, the most interesting production outage, I think blossomed into what we're doing today. It just, in a good way, haunted us.

Cory: 00:10:15

Yeah, I feel like I had that very similar thing happen to me recently and I was just banging my face against a wall for hours. It was a minor update, I think it was like the Erlang OTP version that we use and it changed something in how OpenSSL worked. And all of a sudden... it's like a minor update, our entire test suite passed... as soon as it hit production, no email notifications went out - just ceased.

And so we immediately rolled back. But then it was like, we need to do this upgrade for some other stuff, but there's something about this that just breaks the way authentication works with SMTP.

And like it's just all of our... all of our stuff, like everything's just worked for... the code base, it's like this part that hasn't been touched in three years...just ceased to work after like a minor... a minor version upgrade too. And it's just like what has happened?

We couldn't figure it out. And like, we couldn't get that like magic incantation of Google to like surface it. And it was literally just like searching for hours - cutting this, cutting that, like trying to figure out exactly what it was.

And then we got a search term that like hit somebody, on this like very specific library, that was like, "There's this weird scenario that I'm in that it's not working." And it was just like 38 conversations down on GitHub was the answer and it was just like, "Okay, the answer is bump to the next minor version up."

Tony: 00:11:33

Yeah, yeah, yeah.

And that's actually, that's kind of the funny thing is that a lot of the time the answer is like, "Yeah, you gotta upgrade to this newer version that just came out, you know, a month ago that fixes your specific problem you're having right now." I love those explorations and investigations to figure out like, "All right, we don't know why this is broken, but we gotta figure it out."

And then that moment where you finally do figure it out, it's like, "Oh, my gosh, this is awesome. We figured it out. This is great." It's like a nice endorphin release. It's a fun experience.

Cory: 00:12:00

It is, it is.

Okay, so is there something... is there a rule that Prequel has today that if you... that you're like, "Okay, I know this one's a problem."? Is there one that will give people listening right now anxiety that they're like, "Oh, my God, I didn't think about that."

Tony: 00:12:14

Oh, what an interesting question.

If you've been around building software applications... even for a short amount of time, but long enough... like, you probably already have this anxiety, like you already know where the bodies are buried. Like, "Oh, man, this is. This is not going to be good."

I think the thing that's really interesting is when you can do sort of this... We've built this distributed matching engine that allows you to do things sequences of events, like A followed by B followed by C, and to do correlations on those things - like, "Hey, on the same IP address or host name and then with negative conditions too." So, like, false positives is a thing that you have to kind of pay attention to because if you're telling someone there's a problem and it's not a problem - you do that long enough, they're going to ignore it. And then there's a problem and it gets ignored.

So I say all of that because some of the rules that we have, they'll look for if you run containers inside of certain cloud environments that have a C group configuration that prevents child processes when they crash - like an OOM crash - it prevents the main container from crashing. So you never know this is happening. It's like a silent OOM.

You know, when you see, like, nginx start having worker processes silently OOM because it's trying to process too many ingress objects at the same time, that then produces these 500s for your customers. Like kind of stringing all these things together, I think... that's one of the things that we've seen a couple times where people thought things were going fine and then we would sort of piece together like, "Hey, we see this problem happening in nginx coupled with some stuff that we're seeing in Kubernetes events as well as this application.", like putting those three things together. I think people didn't even know there was a problem with nginx because, again, the container was just running.

So yeah, I think it probably depends on what technology you're using. Like there's also problems with RabbitMQ that don't even produce metrics to trigger alarms. There's some example of this for every technology.

There's probably too many to go through. But yeah, that's a good question.

Cory: 00:14:13

Yeah, it's a good segue too because everybody that's now panicking about silent out of memory issues with nginx and RabbitMQ right now needs to go check out at least two open source libraries, right? At least two.

Tony: 00:14:25

Yep, that's right. That's right.

Cory: 00:14:27

So Prequel's open source. You have already open sourced them.

So I want to get into what they are and then like what brought you all to open source them, being that, you know, they were previously closed source. So it's CRE and Preq?

Tony: 00:14:39

Yep, that's right. GitHub.com/prequel-dev and then CRE and Preq.

And so CREs are where the community is working together to publish these CREs - these rules that describe problems and mitigation and how to detect them. So it's kind of like marrying that knowledge in a way that makes it shareable and automatically updatable.

And then Preq is how you actually use those to go and detect the problems in your environment. And so that tool, Preq, runs on Mac, Windows, Linux. Runs in Kubernetes. We have lots of exciting things planned for it.

You basically take those rules and run it on your data. And the way you can plug it in is it can run standalone, or you can run it as a kubectl plug in, you can run it inside of your Kubernetes cluster as a CRON job.

Cory: 00:15:23

Oh cool.

Tony: 00:15:23

There's lots of different ways to consume it.

And yeah, I'm happy to get into the motivations for doing the open source, but those are the two tools that we just launched a couple of weeks ago.

Cory: 00:15:32

Yeah, yeah, let's talk about the tools a bit and then we can get into motivations. I love talking about... especially like given all the big license change recently... I love seeing companies still open sourcing products and like kind of what drives them to do it.

Tony: 00:15:42

Yeah, it matters.

Cory: 00:15:43

It does, it does.

So Preq, you can run it locally too. So I can bring it into like a pre commit and like start to see this stuff before I even open a PR.

I feel like that's one of the things that's like disheartening, as a developer doing TDD I sit there, I write these tests, I write this code, I get it working perfect, I push it up to git and then all of a sudden Dependabot's like, "You're a fool, you did that wrong." And then I have to go rethink how I did something, right?

And so you can bring this right into a pre-commit, have it running locally, and have more confidence in your build before even bringing your team in.

Tony: 00:16:15

Yeah, and you don't even have to contribute back either. I mean, we want people to contribute back.

But there are a lot of people that are writing rules today that are very unique and specific to what they're doing and yet they still benefit from updates from the community whenever there are new rules published, you know, every couple of days.

Cory: 00:16:29

So let's talk about... so CRE, like that's the root of it. That's where the rules are then. And Preq's the tool that you run.

So let's maybe talk about like CRE a bit. So like, what are these rules? Like are they language specific? Are they like protocol specific? Like what level of knowledge do you have to have to like start working on and developing these type of rules?

Tony: 00:16:46

The most important part about the Common Reliability Enumeration schema is that it's a schema, it's really just a set of fields. If you know YAML, you know CRE. So you know how to write a CRE.

Cory: 00:16:57

Oh yeah, we all know YAML.

Tony: 00:16:58

Yeah, exactly. You know, it's sort of. There's that famous XKCD article about like the query language to solve all query languages. This is the last one.

You know, it's not a new language, it's just YAML. And very simply it's just describing a problem, its severity, its impact, how easy or hard it is to mitigate the actual language.

What is the cause of this problem? What is the impact other people in the community have seen? Like when I saw this problem, this is what would happen.

And also like if there's a mitigation. So like, like you Said the like comet that was buried 30, 30 comments deep in a GitHub issue. Like, here's the how to fix it.

Like how do you surface that up to the top. Yeah. And then coupling that information, like what is this problem for? How do you fix it with the actual way to find it.

So that way when you find it, you're immediately like the Google search with that term that you, you know, you finally found has already been done for you. Like it's right there. You just go do that, that thing and then there's references to all of those results.

So I mean that's at a fundamental level, that's, that's the idea and the way the language actually works or way the. It's YAML.

But the way the description of the problem works is it's just like I said, it's a sequence of events you could also do set, so order doesn't matter. But you're describing these conditions that must be true or not true within a window of time with correlations that can help you find that problem.

And for preq, the open source tool, the data sources that you can run, the rules on are things like standard DIN log data, configuration data, and then the enterprise commercial version has a much richer set of data sources that you can run it on, like process events, kubernetes events, time series data, lots of other data that you might be interested in looking at.

Cory: 00:18:42

Very cool.

So like, as far as like SRE and like kubernetes, you're going to just attach it to all the events that are happening, have that like kind of running on those events as they're coming through.

Tony: 00:18:52

Yeah, exactly. Yeah. Yep.

Like if you're, you know, if you want to know, hey, do I have a deployment with too many replicas scheduled on the same node or in the same cloud region and I want to know about it because there's a risk of an outage taking down my service. Like you can do things like that.

Cory: 00:19:08

Oh, that's cool. So it's not. So it's not just like CVEs and like, oh, I found the SSRF. It's like, yo, this, this right here is going to absolutely ruin your day.

Tony: 00:19:18

Yes, yes.

Cory: 00:19:18

When, when US east one goes down again.

Tony: 00:19:20

Yeah, exactly. It really is about trying to prevent people from finding out the hard way.

We want to take advantage of that one person that first time that found out the hard way. Let's make that the last person that had to find out the hard way.

And then take, take this knowledge and spread it out and Use it in an automated fashion so that whenever that happens to someone else, it's detected and it's mitigatable, like immediately.

Cory: 00:19:43

Yeah. Oh my gosh, I wish I knew about this weeks ago.

Tony: 00:19:47

Yeah, well, I mean, look, it's a new idea, it's a new approach.

You know, we were doing this in security for a long time, but again, in reliability, when there's a problem, you just, you go to your dashboard, you go to your dashboards, you probably have, you know, tens or hundreds and you're just sort of around for a while and then you narrow in and then you do exactly what you just said. You're googling, you're asking people. You know, it's a long process, drawn out process to kind of like Neo from the Matrix.

Learn what's happening here. What do I need to learn about right this, like what's happening right now?

So yeah, I think we're excited about, you know, instead of starting with an investigation, how do you start with the detection.

Host-read ad: 00:20:24

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling, root modules, CI CD scripts and Terraform. Just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what MassDriver does. Ops teams upload your trusted infrastructure as code modules to our registry.

Your developers, they don't have to touch Terraform, build root modules or even copy a single line of CI CD scripts. They just diagram their cloud infrastructure. MassDriver pulls the modules and deploys exactly what's on their canvas.

The result, it's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making infrastructure as code simpler with MassDriver. Learn more at MassDriver Cloud.

Cory: 00:21:21

Let's say I get something's detected. Let's say that I have this running maybe so I can run it in a GitHub action. Yeah.

Tony: 00:21:28

Yep.

Cory: 00:21:29

So that will just like in my, in my build I'll just see that boom, that workflow fails and here's the issue. And then there's like a link to what the resolution is or does the tool actually suggest, like suggest the change.

Tony: 00:21:41

So the, the pre tool itself. So maybe a couple things here where, where you can run it. People are running it in CI jobs, they're running in Jenkins builds.

But a lot of People are actually getting a lot of advantage or a lot of benefit from running in production. So they'll run in production, QA and Jenkins. So they try to find issues early, but sometimes single slip through.

And then you can also run it as like, you know, a build job, a CI job. And then when a problem is detected, you are presented with the CRE schema and the rule and the mitigation, the references.

But you also have an opportunity to automate it with a runbook.

So you can do, you know, things like create a JIRA ticket, send a Slack notification, or you can even execute like an arbitrary binary or shell script, given the input of what was found and take some specific action. And then there's even rules that you can specify in those automated runbooks, those automated actions.

So for this cre, if this happens, I want you to do these three things in order.

Cory: 00:22:35

Nice.

Tony: 00:22:36

It's sort of all about like, hey, we do want a human to be able to like, make a judgment on this call. But you could also automate it, you know, if you feel very comfortable with that automation.

Cory: 00:22:44

That's a cool integration because I know. So, I mean, I'm think as I'm hearing this, I'm like, this is so cool.

But I'm like, I'm immediately afraid to like, throw it in the code base because you're like, man, if this just like stops all my builds. Because we've, you know, we've got these things that we just don't know about. But.

But to be able to tie it into a runbook where it's like, hey, I want to know and I want a ticket opened.

Tony: 00:23:02

Exactly.

Cory: 00:23:03

I just don't want to halt the build. Right. I would love to see warnings, but that is, that is pretty rad. Right? And now I feel like that's.

That's actually a bit of a boon because now I feel like this is one of those things that's so hard. This is actually. I love this.

This is one of those things that sucks so hard is like communicating debt to like, your project managers, product owners, et cetera.

Tony: 00:23:21

Right?

Cory: 00:23:21

And so it's like, oh, hey, person, that's probably not looking at actions. There's a wall of stuff that's wrong here. And they're like, I don't understand any of this, but they do understand tickets, right?

And if your roombucks start opening up a bunch of security vulnerability tickets, like, that could be the thing that helps you push through, like this true SRE idea of like, error budget. Like, we have problems and it's hard for us to communicate it to the rest of the org.

But now it's just like, look, this is stuff that we need to focus on, and it's tickets, and somebody. Somebody's gotta schedule it or close it right now. Right.

It's so much easier than showing somebody a wall of a failed build and being like, I told you, we've got some security things we need to deal with. Like, that is tight.

Tony: 00:24:00

We actually do have several customers that have specifically said giving our product managers and leadership visibility into, like, the daily chaos that we have to wrangle and fix is actually providing almost as much value as actually preempt, like, early on fixing those issues before customers are impacted. Yeah, because, yeah, it is a lot of sort of unseen work that platform engineering teams are having to deal with on a daily basis.

And I think giving light and visibility to that is very helpful.

Cory: 00:24:33

Yeah, that is tight. I remember there was a company I worked for maybe eight or nine years ago, and we had.

We had this debt problem where it was like, the product had existed for, like, 10 years. It was revenue positive from, like, day one. So this company just went and went and went and went and went.

And there was just so many debt, like, just left by the wayside. And it was just. Everything was always in the pursuit of revenue.

And this company was very good at making money, but, like, the code base was just torturous to work with. And we had a really hard time communicating the debt, like, to the team. And we actually built some tooling internally to surface it.

And so, like, we had these, like, comments that you could put in that was like, hey, this is impacted by another piece of debt.

And so you could put in, like, the ticket number, and it would actually build a dashboard, like, relating back all the tickets that were, like, slow and, like, off from their estimates that were tagged with debt. And it would be like, you'd go look at a ticket that was like, hey, this is the debt, and you'd see, like, 48 PRs reference it.

And so it was very visible. And what happened was the product managers started seeing this, how this debt was impacting the features that they were trying to get out.

And now all of a sudden, that empowered us to start prioritizing debt. It was. It was easy to communicate. I feel like this is. This is great for security teams and SREs that are like, we have problems.

We know there's problems. And, yeah, that is a really cool integration.

Tony: 00:25:51

And I think it's also. I think the thing that gets us excited is how do you take that one or. One or two person team, small team somewhere.

Cory: 00:26:01

They're all small teams.

Tony: 00:26:02

Yeah, yeah, exactly. And just like, how can we all benefit from one another's collective knowledge? Like, how do we do that? How do we enable that? Yeah.

And that, to us, is very exciting because the exact same approach worked in security 20 years ago because it was the same story then. It was a small team of security people.

And once people found a way to share, kind of instantly when there was a new problem, the game kind of changed. And I think there's an opportunity and reliability to do the exact same thing.

Cory: 00:26:32

That is pretty neat to go back into CRE really quick. The rules are, they're all open source as a community. Everybody's. Everybody's putting those rules back in there.

What quality gates are there to make sure that people aren't like, kind of poisoning the well?

Tony: 00:26:45

That's a very good question. And this is perhaps where, you know, the NSA background is helpful, because he'll get.

Cory: 00:26:51

He'll send people after you. No, no, no, no.

Tony: 00:26:54

That's actually not what I meant. No, it is what I meant. That's what I meant.

Cory: 00:26:58

It is what I meant.

Tony: 00:27:00

Whenever you're looking into a problem, it's really important to have a reproduction. You need a way to reproduce the problem so that you can validate, you know, what the problem is. You can see it.

Cory: 00:27:10

Yeah.

Tony: 00:27:11

And you can validate that. You can detect it and even fix it. And so one of the kind of core rules that.

That we have for any submission for CRE is that you have to be able to demonstrate the reproduction. So we're not discouraging people from using, you know, AI to, you know, help articulate some of the words for your title and that sort of thing.

But the end of the day, without a reproduction where you can prove the problem's happening and prove that the rule works, we can't accept the submission. And I think that's like the.

Probably one of the most important quality gates for accepting rules in a community is to demonstrate that you have the reproduction and it's not just a video. It's like you got to actually have a shared repository somewhere where there's an actual reproduction scenario that anyone can run.

It's a scientific method. You gotta be able to allow the community to replicate the test you did.

Cory: 00:28:05

That's pretty cool. I mean, it's rigorous, but I mean, that's how you stop people from just populating it with junk, turning it into adware for their security firm.

Right. That is very cool. And I feel like so bringing up LLMs there, like being able to use an AI to punch up the titles and whatnot.

How do you see the world of LLMs fitting into this?

I feel like, tell me to bleep any of this if I have to, but I feel like a partnership between you all and either like GitLab and GitHub, where it's like they have issues, they have just walls of comments. I feel like there's just. There's so many of these cres out there that people have found that are just like lost.

38 Comments deep in GitHub that's exactly right, yes.

Tony: 00:28:49

I mean, first of all, I think having a reproduction, it actually does require a lot of work.

And the good news is that there are many projects out there, like Istio and others that have troubleshooting guides. Through their experience of people finding problems and doing the reproductions because they ran into the problem, they've like taken all that knowledge and put them in these guides. And you can actually write rules from those things fairly quickly, which is kind of cool, but it's still a human doing it.

So I think there's a couple of things that we get really excited about with AI. The first is actually using it in the pipeline of reproductions. So I don't know if you've ever used OpenAI as codex, but it's actually pretty cool.

Like you could watch it check out a Docker container, download your GitHub repository, you give it a task to say, "Hey, I want you to go increase my test coverage to 50%." And it'll go and do that and it'll actually test it and then put up a PR and you can actually watch it do its work.

And I think one of the things that we've been really excited about is leveraging models in a very similar fashion, but for the reproduction. And so I think that's one way that we get... we're excited about the future of scaling a process like this with AI.And that's sort of an important thing that we thought about with things like LLMs.

I think another piece that's really important is the schema is nice. It marries this mitigation, the impact, the references with how to detect it. But sometimes nothing beats a really good story, especially whenever it's concise. And I think LLMs actually do a really good job of summarizing content, especially maybe complex content that's like, "Hey, first this thing happened over here, then this thing happened over there."

So another thing that we've been doing with LLMs is when a CRE is detecting a problem, we'll actually take an LLM and say, "Okay, we'll give us like the couple of sentences that describe the problem and walk us through it step by step, just a couple sentences at a time, and use the actual context of the rule."

And the cool benefit of this is that you're not taking gigs of RabbitMQ data and putting it into an AI model and telling it like, "Hey, tell me what happened." You actually have this intermediate step that's reduced your token count. It's more focused. And so the actual content you're sending to the LLM is much less and it's cheaper and it scales better. You know, your CFO might be happier. So I think that's like the second thing that we get excited about.

And then the third piece is just in rule creation itself. Just like imagine when you're doing a development in Cursor, like the same exact experience applies to writing a CRE.

Cory: 00:31:17

I don't know that it makes a CFO happier. I think you can only make a CFO less mad. I don't know that you can make them. I've never met one that you could make happier.

You can make them less frustrated.

Tony: 00:31:28

Yeah, yeah, yeah. Okay. Fair, fair, fair, fair.

Cory: 00:31:30

If you have a good cfo. Congratulations. Sorry.

Tony: 00:31:34

That's fine.

Cory: 00:31:36

I'd be curious, like, you said something there, like, about the Istio team, like it seems especially these teams that are managing extremely popular open source libraries, they actually have a wealth of this information maybe codified back here [signals to his head with his hands]or in their git repos. Is there like a means of... almost like a framework of how all these open source projects can get this stuff back in?

Tony: 00:32:01

Actually, that's an excellent question.

One of the things that we started working on in the last couple of weeks is partnering with open source projects. Because again, you're building up this wealth of knowledge of known issues. And it's not just the open source maintainers. You know, a lot of these open source projects have commercial companies behind them with customer success teams that have scripts that they run of all of their known issues. And so they're developing all of this stuff themselves.

Just imagine a world where you can take all of that knowledge and share it and put it in like a repeatable way that's detectable.

That gets really exciting to us because it just kind of speeds up all of those teams and makes that knowledge something that can be automated by a machine and then Leveraged by AI.

Cory: 00:32:45

Yeah.

And it's like, you know, if you are a for profit company that has an open source tool like that, to share your private knowledge is good for you because it's going to increase your open source adoption which is going to increase your pipeline for your enterprise product. Right?

Tony: 00:33:00

Yeah. And you asked earlier, sort of maybe what's the motivation behind launching an open source project? I mean, I think there's a couple things there.

Elastic for four years and...: 2019

That was really exciting to me.

Elastic's open source community is amazing. I think when you're building a community and leveraging knowledge, it's really important to put the mission first. The mission is what matters. We want a world to exist where learning it the hard way doesn't ever happen to anyone else, just happens once.

And so I think in order to make that true, there shouldn't be a paywall between that objective, you and that objective. So the open source I think aspect of it, it's just really important from a mission perspective, like how you actually achieve this goal. So that's why we went Apache 2, that's why we launched those two projects with that license. And I think it's going to pay off.

In the long run, we want the world to be a better place and I think open source is an important... I mean look, every commercial product that's ever been created uses open source. It's like that's just state of facts. So yeah, I think that gets us excited. The mission first focus with open source, that's the way to do it.

Cory: 00:34:24

Yeah, I think this is a product and I think this is a space that kind of rises the tide, like it lifts all boats. Because I think the reality is there's so many teams that are using these tools that aren't security experts.

curity expert. We would be in: 1992

And so the reality is everything that we do is impacted by the security constraints and experience of other companies.

Tony: 00:35:03

Yep, yep, exactly.

Cory: 00:35:04

And their outages. Right?

So it's like it is hard and it's like, you know, just knowing in the time that I develop outside of CEOing, like I have had exactly one of these. And it's like we would have been... I think we would have launched this feature we were working on like three days faster if this CRE wouldn't have cropped up on us. Right?

Tony: 00:35:25

Yeah, exactly.

Just to make sure it's crystal clear... I do this because we have a background in security... CREs and Preq are actually not for... it's not security. It's specifically only reliability.

There are lots of cool tools out there like Snyk and others - like you've actually had some conversations with folks at Snyk before. They're doing a great job handling/detecting vulnerabilities. And we've actually kind of abandoned that world because it is so big.

And we get really excited about trying to take those lessons, those same principles, but to an entirely different space, at least to us, which is reliability problems. Just normal, plain old, you know, interesting software bugs.

The sort of overlooked, I feel like, but very important because whenever you have an outage, it's typically because of a bug and not because of a vulnerability.

Cory: 00:36:13

Even more important because, I mean if you look at the surveys like year over year from Stack Overflow, State of CD, like the amount of people with cloud operations experience is going down relative to the number of software engineers that we have, because we're just producing them out of boot camps - which is great. We need more software developers - sorry, guy from Claude that disagrees with me, but we need more of them.

But we also need more operations experience, right? And like that SRE-ness is like... a lot of people's SRE, their reliability is directly, or I guess inversely, tied to their cloud costs. How do they solve problems? They just over provision.

You want to start getting your cloud cost under control? It's not buying a cloud cost tool, it's investing in SRE.

Tony: 00:36:55

Yeah, right.

Cory: 00:36:56

Being able to have more reliable systems with less compute is how you save money. Not by just buying a tool that's like, "Hey, this Aurora is expensive." It's like, "Yes, I know this Aurora is expensive. I have 85 gigs of RAM in it because I want to make sure it doesn't go down." It's like get somebody that knows how to run the thing.

Tony: 00:37:13

Actually, it's funny you say that. One of the biggest values that we've seen customers get from taking this approach has been in reducing their cloud costs. Because you're right.

In the past when there have been issues and problems, you kind of have a couple of levers you can pull. One is, okay, let's go take some people off some high priority feature and go investigate this problem and fix it.

Another one is add more replicas, scale it up and hope it happens less. And definitely that's something that I think you can... People are pulling that lever all the time because it's fast, but in the long run it does end up costing you a lot more money.

Cory: 00:37:54

Yeah, it was funny I was just talking to somebody the other day, like one of the things I love seeing in Terraform, one of the things that I try to do is I try to express the configurations in the developer's language.

So like, rather than say, "Hey, developer (who probably has no experience with AWS instances) which instance type do you want? Do you want an RG6 extra, extra large?" It's like, "I want one that's not going to wake me up at 2am." That's what I want as a developer, right?

I like to present my Terraform in like very much the developer's language. So it's like, "Hey, how much growth rate are you expecting on this database?" And then calculate like the instance type like behind the scenes for them.

And so I feel like, you know, a lot of times you'll go into organizations where they haven't had somebody with SRE or Operations experience and you look at an Aurora instance and it's got 15 replicas and you're like, "Why does it have 15 replicas?" And people are like, "Don't know." Like, "Why are they R6 extra larges?" and people are like, "We... that's just, that's a...". It's like this whole thing is expensive and we don't know why.

And it's just like being able to understand why and understand like the reliability of it is, I think, something that many organizations are missing. And I feel like a lot of them see that symptom of high cost and they try to treat cost rather than trying to treat a more professional approach to reliability.

Tony: 00:39:08

Yep, totally. What's exciting about technologies like Terraform and Docker is it's almost like this manifest approach of describing what you want. For Terraform it was infrastructure, for Docker sort of like software orchestration, but the same doesn't really exist for how to detect problems. You know, what you're going to have to monitor for.

And I think there's a real opportunity with CREs to do the same thing that Terraform and Docker did for their respective spaces. It's like, instead of coming up with a detection only after the problem, how about we actually say let's subscribe to the types of detections and monitors that we want to have in place first... so originally, when we're actually constructing and building this project, this software.

Cory: 00:39:48

Well, I know we're getting close to time. I have a few more questions that are a bit more rapid fire I'd love to ask you. How do you feel about that?

Tony: 00:39:55

Let's go.

Cory: 00:39:56

So first one, what is the weirdest or most memorable bug that you've actually chased down? If you can talk about it.

Tony: 00:40:04

d this problem that we had in: 2017

, like on a Medium article in: 2017

But I think the reason why that one was so rewarding is because sort of the start of the journey is like, there's no obvious answer to why this is happening. We have no idea what, what's going on here.

And we had to write some extra tools to do some additional introspection into the messages to kind of really hone in on this deadlock. And then once we had the theory, like, we had to test the hypothesis. And so actually testing that out and like seeing the reproduction and then seeing the reproduction go away with the fix was just like... I love a story that begins with, I have no idea how this ends. Like, no clue. Like I was just... no, no idea, but we're going to have to figure it out.

And then when you finally get that answer, especially after it's hard and like grueling and you're learning something new, you're learning new skill sets to actually solve the problem. To me, that's like a really rewarding experience.

And it's honestly why I get so excited about building a community around problem detection because there are many people out there that have gone through the exact same work. And wouldn't it be great if you could benefit from that? I think that's the thing that gets me really excited about it.

Cory: 00:41:49

Yeah, I think the thing that's so cool with like this whole idea is like, there's so many software projects that I've seen where they have like a whole section of their test suite around making sure regressions don't get reintroduced. And it's just like that's a way of treating a problem, not addressing a problem by treating a symptom of it.

Okay, so assume there is no CRE for something you've just discovered. Your years of expertise in hunting bugs, like, what are some tips and tricks, if there isn't a CRE for this, of like how to figure out what is causing this issue that you've learned first?

Tony: 00:42:22

What a good question. Maybe I would think about... So Brendan Gregg is someone that you should know. Go check out his blog. He talks actually about a little bit... he calls it the use process. There's sort of a process that he describes for how to go and find problems through data that you're collecting. And I think that's a good read. So go check that out.

Cory: 00:42:40

Okay, we'll put that in the show notes.

Tony: 00:42:42

Yeah, it's good. There's sort of like a general approach and then there's like specific skills to build that fit into this approach.

So when a problem happens, you basically are a detective. You're like a homicide detective. A murder has happened and you have to start putting together a timeline. Like, what has happened? When did it first start happening? You're going to have witnesses you have to go interview. Those witnesses could be logs, it could be humans, it could be TCP, like dumps from Wireshark. You know, whatever it is, you've got to go and collect a bunch of evidence.

And that evidence might be lying to you. It might be like a red herring. It might take you down a rabbit hole.

Cory: 00:43:22

A river full of red herrings.

Tony: 00:43:24

It doesn't matter. And I think the other thing is like, if you feel like you start making these assumptions, "Oh, I bet I know it's this. Or it's probably that."

You're wrong. You're probably wrong. And so again, a detective, a good detective is going to just follow the evidence. He or she's going to look at the timeline and pull all this stuff together. And as you're putting the timeline together, different theories can start emerging.

Like, "Okay, well, if I connect these dots in this order, it could be this", or "If I connect these dots, it could be that." So then you start thinking about what is the most likely hypothesis here that would explain how these things are happening?

And you also see holes like, "Oh, we don't know what's going on here. If the hypothesis is this is what's happening, well, we are missing a piece of evidence that would actually make this more likely. So let's go interview that witness." Like, we forgot to interview that, you know, whatever that is.

So I think that, like, that general approach is really important when it comes to hunting down a bug. It's, you're a detective, you're putting together a timeline, you need to go interview witnesses, and you can't make any assumptions.

And once you've collected enough data, you look at it, you see what theories emerge from it. If you have a theory, but the data doesn't support it, you either need to go find more data or eliminate the theory, and then you start testing it.

I don't know. That's a really good question. This is what I love. That's what I love, that is like it, right there.

Cory: 00:44:45

That's funny. I'm like, I wish I could go back in time and just, like, vet this against, like, every time it's happened. I am definitely a jump to my gut, like, "Oh, I got this." And I don't know how many times... this problem that I ran into the other day, I thought I knew exactly what it was.

I drove about eight hours of effort into, like, "I know what this is", and I was absolutely wrong. And it was just, like, I got to that point where I'm like, "I better go look for some data."

Tony: 00:45:08

Yeah

Cory: 00:45:09

Like, I better go... And it was just like, dude, I spent so much time like, "Oh, I know exactly, exactly why this is happening."

Tony: 00:45:18

It's so funny you say that.

Cory: 00:45:20

Nowhere close.

Tony: 00:45:20

I do it too. Honestly, I do it too. It's, "I've seen this story before. I bet it's this."

It's like, almost always when I see a problem come up, that's like my first instinct. And I have to fight it a little bit, though, like, "Okay, but I could be wrong. I could be wrong."

All that means is that you've just sort of had the experience build up over years, where you can accelerate that timeline and witness investigation process, and come up with the theory fairly quickly.

But as long as you're still validating it and like going to look - I think that's, you know, still the right approach.

Cory: 00:45:52

It's been really great having you on the show. This has been super fun.

So we'll put the links to the open source projects in the show notes, but where can people find you on social? On LinkedIn, X, Bluesky.

Tony: 00:46:03

Prequel.dev is the website. All of our socials are on there. You know we're on the Blueskys and the LinkedIns, but I would just send people to prequel.dev.

Check out our blog. Our reliability research team is always putting out new content. Like you asked that question earlier, "What's the most interesting rule?" Every couple of weeks we're putting up a new story that ends with a rule at the bottom of it. So I would just go check out our blog.

Definitely go check us out on GitHub, throw us a star. But better yet, try us out. We're looking to grow the community and excited about building this future together with everyone in it.

Cory: 00:46:38

And again, just a reminder for everybody, it sounds like it's easy to bring this in. You don't have to get it into your Kubernetes cluster, you can bring it down locally on your MacBook, Linux, whatever. Try it out, move it to CI, put it in production when you're ready.

Tony: 00:46:49

Yes, exactly. You can use this soup to nuts without ever talking to us.

Cory: 00:46:53

Hey, is there a thing an OPS person loves more?

Tony: 00:46:57

Exactly.

Cory: 00:47:00

Awesome. Well, it was great having you on the show and thanks so much for the time.

Episode 31

2nd Jul 2025

What CVEs Did for Security, CREs Are Doing for Reliability

Transcript

Listen for free

About the Podcast