What Is Platform Engineering? A Cloud Operation Engineer’s Perspective
What Is Platform Engineering?
For teams ingrained in DevOps practices, Platform Engineering ushers in a broader horizon, a fresh perspective on managing infrastructure. But what does Platform Engineering offer, especially for those adept in cloud operations? In this episode, Cory O’Daniel talks to Chris Hill about platform engineering from the perspective of a cloud operations engineer. From the importance of security and compliance and the challenges faced by developers to the impact of Massdriver to deliver infrastructure management to engineers, Chris shares insights from his decade-long experience. Tune in for an exploration of platform engineering's evolution!
Hi, I'm Cory O'Daniel, CEO and co-founder of Massdriver.
I'm Chris Hill, COO and co-founder of Massdriver.
We're going to talk about platform engineering specifically from the point of view of an operations engineer. Before we get started, for people that aren't familiar with platform engineering, the canonical definition is designing, building, and maintaining the underlying infrastructure and tooling for creating software applications. Overarching goals are to increase developer productivity, standardized development processes, and ensure the scalability and reliability of the system as a whole. Chris, what would you say is missing from the canonical definition of platform engineering?
I would say security and compliance. In order to properly run all of these applications if you're going to scale anything, you have to factor in the security element of it. As these corporations are growing, compliance is going to become more of a concern. Particularly if you're going to be working in healthcare and you have to deal with HIPAA. If you're going to be working in EU, California now has these compliance requirements around protecting user data. If your systems don't have that, you're going to constantly be in a state of chasing after the compliance requirements that you forgot to implement originally.
I feel like as a developer, it's a lot of focusing on my features getting something ready for production and then getting caught not thinking of something security or compliance-related up front.
As you're developing a product, a lot of times it's going to be prototyping to begin with, so you're going to be creating something, you're going to throw it out there, see if it works, and then afterwards, you're like, "We'll figure that stuff out later." Unless you have a formalized process around figuring that out, what's going to happen is the next time you guys are doing an audit, the CISO is going to be chasing you down. It's not going to be chasing down the developers, it's going to be chasing down the operations engineer and say, "Why is this configured this way? Why isn't it configured this other way?" Now, they're going to be interrupting your job. They're going to be coming in and saying, "This needs to be changed."
You might not even be that aware of the system. You might not be aware of how it was configured, why was configured that way, and now, this is dropped on the top of your work pile of you have to go figure out how this thing is supposed to be configured. Also, you may then have to figure out how you're going to change the system to be able to support the compliance requirements that just happened to show up.
Yeah, it can be difficult as an engineer to know. Funny thing with compliance is it is pretty standard. If we can standardize that in our platforms, that's great, but even security. We have common security tools that we can run our applications. If those are part of our CICD pipeline are part of our internal development platform that takes a lot of that thought out of the developer's hands and let's them focus on features. Have your experiences as a cloud operations engineer influenced your approach to platform engineering?
I've been working as an operation engineer for close to a decade. Multiple companies and one of the main things I realized is all of these companies are doing pretty similar things with the same technologies, operating in the same public clouds, but they're approaching the problem and they're solving it as if it's a bespoke solution, as if what they're doing is a particularly unique way of doing it. You're seeing the same patterns and start realizing, do we need to approach this problem this way? It's very man-hour intensive. It's very talent intensive of having to bring people in and treat every single problem as if this is going to be a unique solution that we have to discover and design on our own.
In a world where we say, the perfect ratio of operations developers is 1 to 10, how does that approach of every single one of these systems need to bring up being a unique or bespoke solution? How does that affect the scale of your operations teams?
If you're an organization at that ratio, I would say you're lucky. Most of the time, the operations team, if it isn't understaffed, it certainly feels understaffed because you're stuck maintaining these solutions. I get it as a company, when you're an operations engineer, you're removed from the product. I would say you're 2 or 3 layers removed from the thing that's making the company money.
It's sometimes even hard, from an engineering perspective, to be able to justify the cost of having this engineering staff that isn't directly contributing to the product. You're running an understaffed team but having to build and maintain all of these bespoke solutions. You have to have some way to be able to scale this work, especially as the number of developers is growing but the number of operations engineers is not. At least not at the same rate.
The number of developers is growing but the number of operations engineers is not the same.
It seems like every organization eventually has a platform. It's just whether or not it was planned and thought through. A lot of it is these bespoke systems that are stitched together. What skills or knowledge are important in a cloud operations role when working as a platform engineering team building an internal platform?
I think one of the things is going to be as much experience as you can have with the tools that you're going to be working with. If you're going to be building a platform that's going to be sitting between the operations engineers and the developers, which that should be the goal. You have to know, of the things that the configuration of these things that we're building, what of this should be configurable, what should be exposed, and what should the developers be able to work with. A lot of that comes from pure experience knowing how to work with the tools.
You're going to have to have some form of IAC. You're going to need Terra Form, Plume, CloudFormation. You're going to need something like that. You're going to need to understand runtime. You're going to need to understand secrets. Obviously, you're going to need experience in these infrastructures' code tools. Anytime you're going to be managing infrastructure, you should be doing it wit some form of declarative language.
The other piece that's unique to the platform side of it is thinking about, how can you view what you're doing and what you're building as a product. Instead of thinking of it as a set of tasks and a set of work orders that's coming down, how can you approach your job like you are building a product that can be used as if the developing the development team or the software team is your customer? What would you provide to them? What would you be building to serve them in their needs?
Think about how you can view what you're doing and what you're building as a product instead of as a set of tasks.
I'm thinking of it as a product, especially as an operations engineer. It's a bit of a double-edged sword because your organization's looking at it and thinking, "Do we have we have a key product that we're building for our customers? Do we have the time or resources to build a second product for our engineers?" This is one of those things that if you invest in it, it will yield great returns. These engineers who work on the operation side who tend to do a lot of tasks and feel like they're doing ticket ops, this is the real value that is surfaceable and understandable in a business. Would you agree?
I absolutely agree. The big issue that these companies are going to run into is exactly what you said. Can you build and maintain two different products at the same time? You see a lot of these large companies that they are able to advocate for a lot of the things that they're doing but they have the large workforce to be able to do it. You need to find ways with constrained resources to be able to build these products. Sometimes you have to use existing products. You have to find ways to leverage existing things to be able to make the efforts that you're putting in multiply them to meet the needs of the developers.
Yeah, and like any product and agile software, you can do it in steps. You can increment. You can iterate. It doesn't have to be a platform that you go and build in a corner and then bring to your developers and say, "Here it is." You should have these developers involved trying to figure out, "What can we do to optimize your development workflows and start to standardize and centralize the management of this infrastructure?" How is your perspective on infrastructure and scalability evolves and leaning more into platform engineering?
Whenever I'm working on something new, I start looking at it through the lens of, "What does the developer need to change about this?" Previously working exclusively as purely on the operations team, you'll interface with the developers to gather requirements and understand what they're doing, but then you go back and you build it and you maintain it on your own. None of this is going to get exposed to developers. Maybe it depends on your organizational structure and how you guys manage the teams, but a lot of times for me I worked on it by myself and I maintain it by myself.
Looking at it now and saying, "What of this could be configurable?" There's a lot of times that you can build things. This is one of the best parts of it is you can build stuff now and I feel more of a sense of completeness when I'm building something new. When somebody asks for a new capability in AWS, I go and I research and I implement it. When it's done, it feels a lot more done because now I'm able to release this and it's able to be used not just through me, but the developers can come and they can deploy it, they can view it, they can monitor they can understand it, and I'm done with that work now, I can move on to the next thing.
A lot of times when I've been handed the thing that I need to deploy on, it's like, this is what I've been given. In this idea of platform as a product, it feels easier for me to almost acknowledge that I can get feedback. We need to change the way this thing works to better serve my team or maybe create a set of options here for how I'm running an ML workload versus running some transactional workload. That can feel very different than ticket ops or somebody saying, "Put your thing in a container and run it."
Do you remember when we were first getting this idea off the ground? You were a very hard sell. What motivated you to shift from cloud operations to platform engineering? What advice would you give to other operations engineers who are either considering this as an alternative for being tasked with trying to figure out how to scale their operations team with platform engineering?
When I first heard of the idea of Massdriver, my concern was I viewed my job as my skill set is too unique and too specific to be able to be abstracted into a platform that at the time I thought that the platform can do it. This isn't the idea of we can get rid of operation engineers, certainly not. What I realized is that the approach that we were taking with Massdriver was connecting these individual pieces of infrastructure together. A lot of times that is already how things were being represented. Being able to codify the relationships between things and identify the areas where there was a lot of work that was tedious and repetitive, important but tedious and repetitive.
I think you see this in every industry, especially with technology is that originally, problems get solved with human power. You just throw as much human capital at the problem as you can and from there, you start realizing, "What of this is simple? What if this is heuristic and repeatable and deterministic? Those are the things, then you can start automating, you can start simplifying, and eventually you can then start shifting it and solving these problems with technology instead of solving it with people.
Probably the operations field in this industry has been right for this for a little while. I think this is why you're seeing so much interest in this. There's almost this acknowledgment across the industry of, there's a lot here that has been made over the complicated and we should start using technology to make these tedious things a lot easier, a lot simpler, a lot faster, a lot more scalable. That is the biggest appeal. Platform engineering is what's going to solve that.
As a company that dog foods are own product, how is using mass driver to deliver infrastructure management to engineers change the way you work?
Going back to what I was talking about with feeling completeness when I'm done creating what we call our bundles, this piece of infrastructure now that can be redeployed or reused. It's funny, I'll go away and I'll look at the infrastructure that we have running in Massdriver and I'll be like, "I didn't know we were running that. I didn't know that's how this thing worked." There's almost some degree of separation I have now between the developers and their needs. I'm focusing more on giving them the tools that they need to be able to do their job instead of the ingredients and mixing it and baking it and presenting it.
There are self-services going on. I'll go in and I'll see pieces of our system that I didn't know we're using pieces of technology that I built. It's cool to see that these things that I'm making, I'm not having to be directly involved in the process. The hours I spent originally creating this have now been multiplied and are being used by the developers to solve problems. I don't have to be directly involved in that anymore. I've enabled them to be able to do their job fast.
That delivery of, "Here is something that I've codified my expertise not only into the bundle but into how it can connect to the rest of the system," feels a lot more complete than just handing somebody a bag of IAC where they have to fork it to make changes and the inputs and output interfaces get fuzzy over time. You don't know what this thing is being used for versus being able to say, "This has a very specific use case. This is a bucket for ETL. This is a bucket for logging." You're able to hide a lot of the unimportant details from that engineer and expose to them exactly what they need to do to configure something for ETL versus long-term storage.
Previously, you might look at it and say, Terraform has done that for a while with modules. Yes, but the relationships between that module and the things that use that those pieces weren't codified in the same way previously, the IAM can simply think, "Is my application a consumer or a producer of content into this S3 bucket? Do I need to read? Do I need to write? Do I need to do both?" They can focus on use cases, they can focus on what they need and have the trust that whatever they're deploying is not going to violate compliance requirements and it's not going to violate the security issues. Everybody can operate much more freely with confidence and I think that's what we want.
Awesome, Chris. Thanks for taking the time to sit down and talk about platform engineering with us.
Absolutely enjoyed it.