Infrastructure As Code: Business Continuity And Disaster Recovery With Cory O'Daniel

Chris Hill sits down with Cory O'Daniel to talk about how Infrastructure as Code can help with disaster recovery and business continuity. From the technology and personnel challenges to scenarios such as losing one of your regions and the importance of backup plans, learn how IaC can be used to help ensure data and operations are not affected.

Love the show? Subscribe, rate, review, & share!

Guest: Cory O'Daniel, CEO at Massdriver

Transcript

Intro: 00:00

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering. Each episode brings you in-depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations, and DevOps practices. From tool reviews to valuable lessons from real-world projects to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Chris: 00:45

Hey, this is Chris Hill, COO and co-founder of Massdriver. Today I am talking with Cory O’Daniel, CEO and co-founder of Massdriver. And today we're talking about how IaC, things like Terraform, can help with disaster recovery and business continuity.

So let's talk about it. How can IaC help us with these scenarios?

Cory: 01:05

Yeah, honestly, I think infrastructure as code is probably one of the most key tools to business continuity and disaster recovery scenarios. The recent State of the City report had a pretty scary number in it. Only 27% of organizations are using infrastructure as code.

Wow. Yeah. And so when you think about during disasters that we can have from either just regional outages or more catastrophic disasters with data loss, like that's, it's pretty scary to think about, like, how are we going to reproduce these systems?

And in a world where a lot of people tend to be using scripts or doing click ops, it's very hard to get a system up in a different region when you're under pressure.

Chris: 01:40

Absolutely. So, I mean, how these scripts and things like that, it seems you need some sort of consistency and reproducibility. And is that really where IaC comes in?

Cory: 01:50

I think it is. And I think that doing good IaC that works well for disaster recovery is also, that is in and of itself a challenge, right? Like, so things like codifying regions into your code versus accepting them as inputs, right? It's something you really have to focus on when you're doing your development, but also what services, right?

So when you look at AWS, Azure GCP, from region to region, you are going to have services that are available or not available or configuration that is or isn't available yet. So the part of it isn't just codifying the systems that you have in infrastructure as code, but also being aware of where you can migrate these in the event of an outage so you aren't trying to replicate it to a region where that service might not be available.

Chris: 02:32

So you're talking a bit about the technology challenges. What about like the personnel challenges when you run into a situation like this, or even just, they call it the bus factor of staff changes, things like that. How can IaC help us with that sort of stuff?

Cory: 02:46

I think IaC is a great tool for collaboration, being able to show another engineer what you plan, the change you plan to make in your infrastructure and have that approved through a pull request. But one of the other things that's fantastic about it is as a documentation tool to actually understand what you have in the cloud, how it's configured, where it's deployed, and whether it's a bus factor of one and you lose the one operations engineer your team might have, or whether your company is going through a risk or layoffs, right? Like business continuity is there as well, right?

And so making sure that we have these tools in place so that teams that are coming on or teams that are taking over responsibilities for people that have left understand what is provisioned, how it's secured, and the different configurations that are applied to different regions.

Chris: 03:31

Absolutely. So let's dig in on like scenarios where you've got multiple regions. So let's just say you lose one of your regions and that's maybe your main region where you had data.

How do you manage data in these sort of scenarios? You know, if you need to come up in another region, how do you deal with the data problem?

Cory: 03:48

Data is the hardest. And with data also being the hardest, it's also probably the most costly, right? So IaC is fantastic for maybe describing some clusters and some databases and US West to let's say US East, because we know it's going to go down.

But let's say you've described all of your infrastructure applications are there, it goes down, you don't know how long it's going to go down. And so you want to replicate to let's say US West or central region. That's great.

But have you moved your data? Right? And I think that's one of the other key things with IaC is using IaC to actually manage backup planes.

So being able to say like, “Yes, we take backups that are in our east zone, and then we're going to replicate those to wherever we plan to restore in case of an outage.” Again, thinking through this ahead of time, you don't want to be caught off guard just trying to go from east to west during an outage to find out if service isn't there, but also making sure that that data is replicated. And from database to database, that is a challenge, right?

So it's very easy with something like MySQL or Postgres, where you can just restore snapshots over there and be snapshotting them. But when you're talking about Kafka, event-driven pipelines and whatnot, you have to think about what are in those pipelines, what events have been processed.

Chris: 04:55

That's interesting. So what are your thoughts on the differences between perhaps let's say disaster recovery and just running as multi-region and that being your disaster recovery?

Cory: 05:08

I think that that is an excellent option if you have the budget for it, right? I mean, you're looking at a couple of cost multipliers there. One is just having dual-homed infrastructure.

You're going to have twice the infrastructure. Yeah, you might have the load spread across both of those regions, but you're going to have a baseline cost for having an infrastructure in place even if the load's lower. So you are going to experience higher costs.

If you have the expertise to do it, that's also great, but may be hard to find. I think the harder thing there is actually bifurcating those services, right? So let's say that you have some sort of DNS resolution that is geo-based and routes people to the closest region, that's fantastic.

But now if I get to a service that's storing in Postgres and that's in my US West and I have a US East, that gets very difficult to synchronize across those Postgres instances, right? I don't have multi-master, multi-primary Postgres. So you have to start looking at things like how do I configure the wall to replicate to this other region and make sure they're replicating in both directions, which can start to get a bit heady.

So that is a choice, but it does require a lot of effort to get there. But that is a choice that might be necessary in some of the more high uptime environments.

Chris: 06:18

Would you recommend that, assuming cost was an issue, would you actually recommend that over disaster recovery or do you think purely having a disaster recovery plan is better?

Cory: 06:28

I think it really depends on the stage of the business and what their uptime requirements are, right? Like if you are a hospital, you might need to make sure your systems are online all the time and do some actually dual home or multi-region infrastructure where your load is spread across both of those. Or maybe you're doing a hot failover.

If you are less critical, maybe you're an e-commerce site, but you don't want to miss out on Black Friday, maybe you can handle an hour outage or maybe you can't and you want to be able to bring up infrastructure to at least serve checkout quickly, right? So I think it really depends on the stage of the business, the importance of that business to the world, which decision you make there. But I think a lot of organizations can go a long way just using IEC to kind of codify their entire environment and practice standing it up a few times just to make sure that everything works so that when it comes the day where it really has to work, it does.

But it really comes down to the business.

Chris: 07:19

So how do you see platform engineering and particularly things like Massdriver helping teams accomplish this and really build towards disaster recovery scenarios and being able to make sure that they maintain uptime even when these sort of things happen?

Cory 7:34

Well, one of the things I'd say, if you asked me this question a year ago, I'd have a very different answer, but seeing how IaC is still a concept that people want to try, but many organizations aren't adopting or putting as much time into, I think one of the biggest features is the fact that a lot of the IaC is already done for you, right? Grabbing something out of our marketplace and connecting it together, getting it deployed, you've passed the first hump that 73% of companies haven't passed. You've got infrastructure as code.

But more importantly, we have this concept built into the platform for environment parity. And our environment parity is very agnostic about where it's replicating infrastructure. It could be staging and production.

It could be production US and production EU for fully isolated environments, like if you have concerns about data sovereignty, or it could be just replicating your disaster recovery environment. Production has gone down. I want to replicate that over to US West, US East.

Now, you still have the same problems that you have with doing IEC on your own. You do need to practice these disaster recovery scenarios to make sure that the environment you're rolling out to has the services. You do still have to design a backup strategy to make sure that you're getting your data over there.

And whatever time constraint you have on your data, do I need to kind of have an hour of missing data, five minutes of missing data or whatnot?

Chris: 08:48

It seems like also the whole idea of backups and have availability can kind of be built into platforms like this. So not only do you enable the ability to replicate the infrastructure, where you can build in a lot of this expertise for the backups and the recoveries, right? Yeah.

Cory: 09:03

And backups and recovery is something that I think is very key to a good platform engineering posture. And it's one of those things that I feel like people don't talk about a lot. If you're in larger organizations or you've come from a data center, you think about backup and recovery.

But a lot of times in the cloud, you just think about snapshots. RDS is snapshotting it, a DynamoDB snapshot itself. But actually having good backup strategies and recovery strategies are absolutely critical for business continuity.

Chris: 09:28

Well, speaking of disaster recovery, I'm going to try to recover from this disaster of an interview by saying thanks and goodbye.

Cory: 09:34

Wow. Wow.

Outro: 09:39

Thank you for listening to this episode of the Platform Engineering Podcast. Have a topic you would love to learn more about? Let us know at cory at massdriver.cloud. That's C-O-R-Y at M-A-S-S-D-R-I-V-E-R dot cloud. Catch you on the next one.

Episode 5

27th Mar 2024

Infrastructure As Code: Business Continuity And Disaster Recovery With Cory O'Daniel

Transcript

Listen for free

About the Podcast