From Netflix to the Cloud: Adrian Cockroft on DevOps, Microservices, and Sustainability

Episode Description

In this episode Cory sits down with Adrian Cockroft, a pioneering technologist who played a crucial role in Netflix's transition to cloud computing and microservices architecture. Adrian shares insights from his impressive career, including his work at Netflix, AWS, and beyond. He discusses the evolution of DevOps practices, the rise of microservices, and the challenges of platform engineering in today's complex cloud environments. Adrian also delves into the pressing issue of sustainability in tech, offering valuable perspectives on the environmental impact of AI and machine learning workloads. Whether you're a seasoned DevOps professional or just starting your journey in cloud computing, this episode offers a wealth of knowledge from one of the industry's most influential figures.

Episode Transcript

Thanks for tuning into this episode of the Platform Engineering Podcast. I'm your host, Cory O'Daniel. And today I have Adrian Cockroft, a renowned technologist and thought leader in the fields of cloud computing, serverless architecture, and platform engineering. Adrian has an impressive career, particularly noted for his influential work at Netflix, where he helped pioneer microservices and cloud -native technologies. Adrian, it's a pleasure to have you here today. To start, could you tell us just a little bit about your early career and what led you to your work in the cloud?

I was a software developer for a while. I didn't do computer science as a degree. I have a physics degree - applied physics and electronics. So I got into software through real time coding, real time control and signal processing systems.

From Sun to eBay to Netflix

And then got into Sun Microsystems because we were using them and ended up working for Sun for a long time. And ended up in the performance area there. So performance capacity planning was my specialization, really. I ended up in the performance team at Sun. Wrote a book on performance tuning that lots of people bought. That got me to distinguished engineer at Sun because everyone had heard of the book. And one of the ways to sort of hack your career is to write down everything you know in a book. And if everyone buys the book, then that's your career, right? You can monetize that for a while.

And then as Sun shrank, I moved to eBay for a bit. Learned a bunch more about how to run large scale web facing services, properties, and some really interesting ideas that eBay had developed around effectively what became NoSQL. A lot of pattern-based automation of the way they did their rollouts of things. The technology doesn't look very familiar from today, but the concepts were pretty foundational. Met a lot of interesting people there as well.

And then went to Netflix in 2007, just as they were trying to figure out how to scale up from a DVD business, which was pretty small scale from an IT point of view, to an online service that needed to be reliable and scalable.

One of the ways to sort of hack your career is to write down everything you know in a book. And if everyone buys the book, then that's your career.

And, if you know the geography, eBay is about five miles from Netflix down the road. And a whole bunch of people that I knew who had figured this stuff out at eBay literally went five miles down the road and started working at Netflix. There were a group of us that came from eBay. And that was where, as we started doing scale, our data center team that knew how to run this small-scale data center basically couldn't figure out how to run the stuff we needed at scale reliably. So at some point we went… after like a several day outage, we went, now we need to do this differently and maybe we can use that Amazon thing, the AWS thing. So that was about 2008-2009, after I'd been at Netflix for a couple of years. That was the idea.

And the ideas around cloud computing was something I'd been developing back when I was at Sun as well. Sun couldn't figure out how to do it, but the concepts were there. So that was sort of how I got into cloud.

The Move to Cloud

For people that are maybe on the younger side, I remember Netflix DVDs. That was very exciting to me to be able to have, you know, Blockbuster show up in my mailbox. But like that is a pretty fundamentally different business. Like you have this successful model and then this change to the streaming that we have today.

That outage, was that the moment that you realized you needed to move to something like AWS or was it something you were already like thinking through and that was just kind of like the final straw?

I think it was from a combination of those things. It was partly that the way IT was set up at the time was it was a platform where we had IT run a platform and developers say, “Don't worry about anything. You just write your code and deliver it and we'll get it running. We'll make it reliable.” And they were going to make it reliable by buying high-end hardware.

We had IBM power series machines rather than cheap Linux boxes because they were more reliable. We had a SAN. We had expensive disks. So they said, we're just going to spend more. We didn't need a large scale hardware. So they spent a lot of money on it. It was a small proportion of revenue. But they said, we're just going to make it reliable so that developers don't have to worry about it and can go faster. And effectively, that model doesn't work. And that was why we kept having big outages.

But what took us out for a few days was a sad and silent data corruption that took down all our databases repeatedly.

Oh, wow!

We restored them and they corrupted again. So the outcome from that was that we decided that developers needed to understand and take control of the reliability of what we were building. So this was part of a DevOps movement, but from the developer side.

There was this whole discussion about whether it was DevOps or not, or whether we could call it noOps, because we basically told the Ops guys to keep running the data center and we moved the budget for AWS to the development side,  and we stole a couple of people out of operations, but we ran a developer culture cloud platform. We treated AWS as our platform and we just stopped dealing with the data center people entirely other than like DNS and stuff like that. They controlled some of the core networking pieces like that.

And that was sort of a philosophical transition, that we wanted to build a system where we assumed that the platform we were running on was unreliable rather than reliable.

And that was the big switch?

That was the big switch. And that meant we had to build software architectures which assumed that the components could go away at any time, which led to all the chaos engineering stuff, multiple zones and regions.

And eventually, in around 2011, there were some big outages at AWS that Netflix just kept running through and everyone else was down. And my good friend Jeremy Edberg was at Reddit at the time saying, “Why are you still up? Like, I know you're using the same region as we are and we're down. And I'm watching Netflix waiting for AWS to come up.” And grumbling on… he's a good friend, I met him at eBay. But that was one of those things where we showed that this architecture did work.

I mean, the early days of the architecture, everyone said what you're doing is crazy. You can't run a business on cloud. It's not how the world is going to work and you'll be back in data centers pretty soon. And that was sort of the 2009 kind of mentality.

So people have forgotten what the attitude was then, like all of the money and all of the reliability was all data center.

Yeah. And was that philosophical shift the biggest challenge in your migration to AWS or were there other challenges? Like AWS is very… not immature at this time… but like the service offerings were much less than what we have today.

It was very immature.

Okay. We can say that. Okay.

We were co-developing a whole lot of AWS services. We didn't have VPCs. We didn't have network security models. It was EC2 classic, if people remember that. We were running on that.

And SQS, right? I think that's what they started with.

We ran at S3, SQS… a little bit of SQS. SimpleDB when we started. We were trying to get any database IO throughput and that didn't work that well, so we got off of SimpleDB and moved to Cassandra in 2010. So that was the architectural thing and it was fun. We had a bunch of people that had been around for a long time, seen lots of architectures come and go over the years, and we deliberately built it as a series of platforms.

We had a team that was less than 10 people, I think, most of the time, that was building a platform library. Pretty much everything we did was in Java, so there was a library that had all the stuff you needed in it. We had tooling that knew how to deliver code to the cloud. We had a GUI that you could use for making AWS do things in the ways that we'd figured out. And we treated AWS as a platform.

And then the other decision we made was, were we going to buy in a platform? And there were a bunch of things you could buy at that time, RightScale, I think, was one of them. There were a bunch of tools like that. And what we decided was, if we bought something in, that vendor would then be incentivized to make that platform fatter.

Yeah.

And that's what you see. You end up with this fat platform that reduces the cloud underneath to its basic elements and tries to extract as much value from that layer of the stack. And what we decided to do is, we wanted to own our own platform because we wanted to be able to evolve it up the stack, retire things out of it. So we built a security architecture which, when AWS came out with its security architecture, we basically shut it down.

It was sort of layered in a way that we could get AWS to build more and more of the platform and ride that up the stack further as it went. That was the philosophy, or part of the philosophy. The biggest challenges for it, we were running on a single Oracle database for most of the code and people were used to doing transactions and things like that.

Yeah.

And then we moved to a distributed NoSQL architecture with every table on Oracle turned into a separate Cassandra database. So, you can't do transactions, you can't even do joins, that's all application logic now. So people had to figure out how to stop being Oracle database programmers and how to be distributed systems programmers. So that was another big piece of it.

The Power of Microservices and Putting Developers On Call

And then the other thing we did was we put everybody on call, in one big call tree hierarchy. So if you pushed code, you were on call for the code you just pushed, basically. And there wasn't a SRE team or Ops team that would pick up the call. There was a team that would figure out that somebody needed to be called and they would call you. So then the team self-organized into small groups that would support it. Anything customer facing that could take the website down was basically organized into this… basically a pager duty call tree that went all the way up the management chain.

Yeah, I feel like a lot of companies nowadays still struggle to mature through DevOps and push that DevOps culture. How hard was that portion of this paradigm shift? Putting developers on call, like putting the cloud in their hands, was that more difficult at that point in time? Or, with all of the excitement around the idea of DevOps and this like new power tool for developers, was it easier then or just as hard as it is now?

I think it's still hard and I think that it's hard to get people to be on call. And the question is whether you're going to get called anyway, because you wrote the code and whoever is staring at the broken thing is going to call you anyway. So you may as well be the first person there and not depend on somebody else. That was the philosophy.

We weren't calling it DevOps. We were following some things that…Werner Vogels wrote a paper (I forget exactly what it was called, but it was an ACM paper in around 2006, something like that. You find it and put it in the show notes), but one of the essence of it was run what you wrote. That was the philosophy. That's the way AWS runs. The service teams are on call for their thing, because they know what is the state of the thing.

And what we were building was a thing that had… because of the way Netflix does a lot of testing and we wanted it to be very rapid at deployment... If you pick a service name, there could be four or five versions of that service running in production. Different versions, each of them supporting a different A-B test or a different stage in the development. And most of them, there's a legacy version that supports some ancient thing that hasn't migrated yet, but we just need enough of it to support the traffic from that old thing. And then there's the mainstream one, which is taking most of the traffic. And then there's two or three developers working on it that’ve got their own custom versions in production related to different tests. And so if something goes wrong, you have to understand that there's all these different versions and why are they there and what's going on, and everything. The context is very high.

If you try to hand that off to somebody in Operations, like we used to in the data center, there was a weekly meeting or bi-weekly meeting with a huge number of people in the room arguing about what was going to be delivered and trying to pass on, this is how you're going to try and run it. And it was just a big mess. So we wanted to get away.

So we had a bunch of pain we were running away from. One was the bi-weekly Ops TOI (transfer of information) process. The ability for developers to push code anytime they wanted. And the developer responsibility went from, I'm building a JAR file which QA will integrate into a build and stick on as a release every two weeks (which was the model we were on in the data center) to that same JAR file is now going to have a service wrapped around it. So I own an API, some client libraries, and I have some dependencies to other things.

And some of those things that you were building were platform layers, and some of them were more business logic layers, but we didn't really distinguish. The work of an engineer wasn't that different between the platform team developers and the business logic developers, if you like. They were using the same tooling to deliver things.

And so like this shift, when you were in the data center, was Netflix already leaning into microservices or was the distributed nature of the cloud what kind of pushed you in that direction?

That was the microservice thing. So instead of building a JAR file that gets integrated into a monolith, and then you're blocked because somebody put a bug.. one of the people writing one of those JAR files shipped a bug so the entire monolith is blocked.

Yeah.

And your code has not hit production, even if it was perfect, because of somebody else's bug. That was another reason for making it independent. You know, if you ship perfect code all the time, you can ship as often as you want. Somebody else's bug doesn't stop you from shipping your code. So that was the big piece.

The idea was they were single-function. This piece of code did one thing and that was independently deployable and scalable. So you could have 10 instances of one piece of your code and a hundred of another and a thousand of another. Whereas if it's all bundled into a monolith, you've got a thousand or 500 of all of them. And you've got a bunch of code paths that are fairly lightly used, which are being deployed and are taking up memory and mixing in different types of requests.

If you mix slow requests and fast requests on the same service, you run into a bunch of issues with thread starvation and things like that. So you want to make the traffic to a service one thing that's pretty consistent. So that was the design.

That's the microservices model.And we didn't call it microservices until I started going to conferences and running into a whole bunch of people that said, we're calling that microservices now. So we thought, all right, we'll call that microservices too.

NoOps or DevOps?

And then, like with DevOps, we were doing something and eventually said we're kind of not doing DevOps because you're supposed to be running Puppet and Chef and saving your configuration as code. We don't have configuration, it's all program, it's all API driven.

Oh.

We didn't have Puppet and code descriptions of our descript deployments. It was still all code because we're all developers. We didn't need config files.

Yeah, did that change over time or was that kind of a pattern that Netflix kept?

At some point, we had some arguments online over whether we... Somebody said there's this thing called NoOps. And so that sounds like what we're doing. And then everybody in Ops got pissed off at us calling them NoOps. So we just don't have the Ops culture and tooling that you're talking about. So anyway, we had a few arguments and then people said, let's stop. We're just annoying people. So we'll stop calling it NoOps.

I remember John Willis, who is well known in this space, he came over one time and visited, and I just sat down and he said, “What you're doing is not what everyone else is doing, but it's really cool. So can we just call this DevOps?”  The deal was if we can call what we're doing DevOps, then fine. But we were approaching it from the developer… the way I put it, we taught our developers to operate rather than teaching our operators to develop.

Yeah.

And then the way DevOps ended up being nowadays, it's more like what used to be the Ops team got a DevOps label or SRE label and it's the old team, with better tooling, sitting between the developers and something else, right?

Whereas we deliberately didn't create that team. We built a platform that the developers could use to operate their code and put them on call so that they built reliable code.

Netflix tends to operate with processes that have feedback in them. Systems thinking, loops, feedback loops, rather than rules. It has principles and feedback, at that time anyway. I mean, I left 10 years ago and don’t know exactly how they run nowadays, I think it’s more conventional. But at that time, it was very systems thinking, very, no rules, no processes, we have principles and you create a well trodden path if you want people to go follow it, but you're not constrained to follow it. You could go find your own way if you wanted to.

Yeah, it's interesting. I feel like there's still a number of developers that are hesitant or resistant to like the idea of DevOps and many can be frustrated by how intimidating the cloud is. And I feel like at this time, like now versus then, the surface area of the cloud is very different. Do you think that developers today that are resistant to doing the cloud or doing infrastructure as code… Do you think that the shift there that you're talking about, like developers thinking how to operate versus doing it the Ops way, like using infrastructure as code… Do you think that that could be one of the issues that we have today? Or is it just that the cloud has gotten too complicated for most developers to kind of take on on their own?

Yeah. What we had then was… I think AWS had 15 to 20 services instead of 150 to 200, depending on what you count, now. And what happened in terms of the market for cloud at that time, if you were a scale digital native company like Netflix, Pinterest, Uber, whatever, you could build that on the cloud.

While I was at Netflix, we had all these companies coming to talk to us. And that was one of the reasons I left Netflix, because I was spending all my time talking to other companies about how they wanted to try and figure out how to do this stuff. And I ended up at a venture capital firm in 2014, talking to all of the enterprises. Spent a lot of time at Capital One when they were figuring out their cloud strategy. And companies like that.

So then what you saw was enterprises moving to cloud and saying, well, we want all these enterprise things. And then AWS just went, well, if you're going to pay us, we'll build the thing. Right?

So then you ended up with all of the enterprise capabilities of AWS as another layer. And that's where it gets much more complex because traditional enterprises have effectively exported their complexity into the cloud. And they've got more automation, but they still have a lot of that complexity. And that was one phase.

But what we did at Netflix was we really simplified it. We built our own platform. We used all of the AWS facilities to build that. And we built a much simpler architecture that was very scalable and available, but solved the particular problem we had at the time.

Yeah.

I think the credit card front end that Capital One built was built along similar principles. They copied some of our architecture and it was highly available and multi-region and things like that. But that's a credit card front end. The core banking backend stuff is still… there's reasons why you can't build that on kind of what Netflix had built. The structure of the business is a little different.

If you're building a marketplace,that has to transact between any two entities, then you end up having to centralize things for consistency. Whereas Netflix didn't need consistency because it's just you and the movies. And the other members of Netflix don't really interact with you that much.

Priorities for Early Stage Companies

Thinking about early stage companies, like you're trying to get your product out the door. Sometimes looking at the cloud, it can seem like work can be pushed off to later, right? You're like, we can do that infrastructure as code thing later. We can do those clean pipelines later. We can do security and compliance later. A lot of times we'll kind of push that debt down the road. Like when is systems thinking and platform engineering and DevOps principles… like when is the right time, in your opinion, to start bringing that into your engineering org? Is it day one when you're starting to write code and build your business? Or do you think that there is an amount of debt where you can just go I'll just click some stuff in AWS and get my thing going and let somebody else deal with it down the road?

I mean, you should be building throwaway prototypes to start with. Because if you don't discover what your customers want, then you're going to go out of business before you get to scale. And at low scale, I like the serverless first kind of thing. You stand stuff up as a bunch of Lambda functions, keep iterating until you figure out what's needed.  

The Value Flywheel Effect by David Anderson, that's kind of my favorite book for how to do this. And that's a serverless-first kind of how to do it. Actually, most of the examples are from Liberty Mutual, but I think they have a few other more startup examples in there too. But the idea there is you're prototyping and until you get product market fit, you should be prototyping.

And then you have the problem of having customers that want your service to be up. You can solve that problem later. And yeah, you should certainly not be building too much platform infrastructure. But you should also be building by gluing together APIs. You should be buying everything you can as a service that isn't core to you, to the thing that you're building that's different.

There's three books on Netflix that I recommend people read. One is called That Will Never Work by Mark Randolph, who was the original CEO of Netflix when they started. And it's the story of the formation of Netflix. And his wife said that will never work. And that was Netflix, the DVD business anyway.

I love that.

Netflix and the Role of A-B Testing

He is a product guy and they A-B tested everything from day one. Absolutely from day one. That means you build up a formal understanding of your customers through A-B testing, not a guessing what they liked kind of thing. No, you have a test. That test had results, right? And getting that baked into the company, and everything at Netflix goes through A-B tests.

Even nowadays, I see somebody saying, Netflix is doing a new thing. They're figuring out how to do adverts, or they're figuring out how to do games or whatever. And I know that they were testing shutting down account sharing. And I heard they were trying it out in Colombia, so they're still using the old methodologies. They pick a country, they try it out, they see what works, they try it in a few more countries. And then they go, okay, that works. And then they gradually roll it out globally. And everyone says this is terrible, this is stupid. And then six months later, it looks like it did actually work.

Yeah.

Well, they tested exactly what they were doing. They don't blunder into things and roll stuff out globally without testing the hell out of it. Every option, every combination. That goes back to literally day one of the company. And that is one of the enduring strengths of Netflix, their testing culture. You can have an opinion, but the opinion is about what to test, not about what should be done. And that is a very different way of running a company.

Exactly. That was one of the things I struggled with when I founded my company. We're sitting around, thinking about, what is our go to market strategy? And we formalized it, we wrote it down, we're like, this is it. And we just did it for months with like no yield. And then we're like, why aren't we going about the business like we go about our software and A-B test things? And then we kind of went to this approach where we had like these small go to market experiments and we had like a desired outcome.

Once we started taking that experiment based approach, like that's where we really started to see things change. But like, it just felt so alien. Maybe that was me as a first time founder, but it just felt so alien at first to do that. But then in retrospect, it was like, that's the only thing that makes sense besides just spinning your wheels for months at a time.

Yeah, I think so. And that applies to the technology space as well. So the question that we keep asking is, what's the fastest, cheapest, smallest way to learn something? What is the thing we don't know the answer to? And what is the best way we can get an answer to that?

So one point is we need to do a multi-region. How should we replicate data multi-region? Should we build a replication pipeline with SQL, blah, blah? Or does Cassandra just do that? And there was a big argument one day in a meeting over it. And I came out of the meeting talking to the guy that ran the Cassandra team, saying, let's just run a test and see what happens.

So we wandered by the desk of one of the engineers and said, Can we come up with just a test? Let's just run a multi-region Cassandra cluster and beat the hell out of it and see what happens. And she said, okay, well, we've got a big backup file. We can restore into it. We've just received a bunch of an allocation of some machines from AWS and we'll allocate those to the test account and we'll run a test. And she created the cluster, did some restores.

The next day she runs some tests on it, because this is in the cloud. Like we allocated a 48 node Cassandra cluster over two regions and what was it? Yeah, 24, it was 666, six times or whatever it was. I don't know, some large cluster just allocated it because we had the machines and then ran some tests on it and beat the crap out of it. Without changing our config, it did 10 times what we needed it to do.

Oh, wow.

So we came up with some tests and sort of the next day we said, well, they've got the answer. I went into the meeting the following week and said, well, we did this test and it just works. So everyone went, “Okay then.”

Great, great, we'll take it.

It's a great example of bringing data to a question or problem. We shut down the whole discussion by running a test. And also of the power of cloud for being able to run a really high end test. We had 480 gigabits of bisectional bandwidth between two regions. We did a synchronization on it. And we actually called Amazon before we started it and said we don't know how fast this is going to go, but if something bad is happening to the network, tell us and we'll stop.

It's a great example of bringing data to a question or problem. We shut down the whole discussion by running a test.

Yeah, and this was on the big old classic. This was still EC2 classic. Everybody's in one network.

Yeah, EC2 classic. We had 10 gigabits on each of these nodes and we had 48 of them on either side - 48 in Oregon and 48 in Virginia. So we were pushing I forget how many terabytes of data across country as a single… how fast… unfortunately Cassandra single-threaded and it took longer than we thought but it didn't max out the network, so it was disappointing.

If you did that today, Cassandra would drive the network much harder. But then the network's more resilient.

I like that story, partly, because it's how to evolve an architecture by figuring out what's the big question? What's the simplest way we can get to get to an answer? And then the fact that that just shuts down the discussion and lets you move on to the next problem. And also the fact that cloud lets you run this kind of experiment.

Somebody said, well, how does it compare to data centers? I don't have time to create a data center machine to find out.

Common Pitfalls When Adopting Serverless

You know, with this serverless first approach, like just kind of getting out there building prototypes early, getting it in serverless… serverless is one of those things that's interesting. I've seen it succeed. I've also seen a couple of failure cases, right? Like getting too far into it and not thinking through a strategy. And now you're kind of sitting on maybe a mess of a framework you're using to manage it or whatnot.

What are some of the common pitfalls that you've seen companies, maybe companies that were in your portfolio at Battery, that they struggle with when adopting serverless? And how can they avoid some of the common mistakes, like over-architecting their serverless environments or running into that surprise billing bill that you occasionally stumble across in serverless?

Yeah, usually the biggest bill is the logging bill, it’s usually bigger than the compute bill.

I think that I would actually refine it to say step functions first. If you're trying to build some business logic, I would try and prototype it in step functions. I think it's one of the most powerful tools out there and most people don't use it because it's AWS specific. But if you're modeling business state changes, step functions is a super powerful tool for doing that.

It gets expensive if you try and drive it too fast. And there was that whole fuss last year with the blog post about Amazon moving away to a monolith or something, which was basically somebody stopping using step functions and moving to Lambda function. But anyway, moving on from that, I have a whole ranty blog post about that. But the point is that you prototype in step functions and it's an  incredibly quick and reliable way to model what you're trying to get done.

And then as you try to run traffic through it, bits of it will start getting expensive or slow and you start replacing those with maybe Lambda functions. And if that Lambda function gets to the point where it's running flat out all the time, or you need to hold some persistent state like a database or whatever, then fire up a container to run that. And that's my sort of approach. You'll build something quick that way, which is maximally leveraging the platform things that you can get from your cloud provider.

Kubernetes and Open Source

So that moves on to kind of like, well, where are we today and why is it complicated? Which moves on to kind of a whole new topic, which is sort of Kubernetes and why has that taken over and all of that kind of stuff. So as I left… I joined Battery Ventures in 2014 and that was when Docker started becoming a thing.

I started hanging around. I did a talk at DockerCon. I was talking about microservices, architecture, and looking at the evolution of all these different container management systems at the time. A whole lot of old talks on that subject.

And at the end of 2016, I joined AWS. And it was a couple of things. One, I was in marketing, I was a VP in marketing reporting to the CMO. And I was sort of doing some of the things that Werner does, going outside and talking to customers and keynoting summits and stuff. So that was part of the job. The other part of the job was to build an open source kind of community engagement marketing team. We had a team in engineering that was worried about licenses and using open source, but we didn't have anybody that was engaging with the community.

I had built the open source program at Netflix and figured out a whole bunch of things and been to a bunch of conferences. So I hired an open source team from outside, people that were well trusted in the open source community and built the team around them.

Interesting.

We were inside Amazon so we could influence Amazon, but outside visibility was like, these were people they knew and trusted. So we kind of built that trust bridge. And we basically fixed the Amazon reputation for open source over a few years. We sponsored OzCon, All Things Open, FOSDEM turned up.

I hired Arun Gupta as one of the first few people and he said, we really need to join CNCF and get into this Kubernetes stuff. So we went, all right. Together we wrote a paper - he wrote the first draft and I modified it. And I took it up management all the way to Andy Jassy to say we need to join CNCF.

We kind of went, all right, well, it looks like the alternative is worse. It's going to be difficult, but we need to try this. We need to do this. So a team was internally forming to build EKS, but hadn't done anything visibly externally.

Whenever there's an industry consortium formed against you, you should always join it.

We needed air cover for our activities before the product was launched. So that was one reason for joining CNCF that summer. EKS was launched in the fall, but there was about a six-month period where we were clearly doing CNCF stuff, even though we didn't have a product. And it was partly so that we could start building it. Bob Wise and the team started building out all the stuff and just engaging with the community.

Kubernetes is a difficult thing because what it does is it builds a different platform on top of AWS from the EC2 native platform. And you have all kinds of things. Like the firewall knows how to do EC2 firewalling, it didn't really know how to do Kubernetes firewalling for a long time. I think the latest version finally supports EKS, right?

So all of this integrated tooling that used the cloud, but used it the Amazon way and all worked together was very effective. But now there was this, well, we want something that's portable and these things are starting to commoditize out and become common capabilities. And there's going to be this common thing. If AWS had stayed outside CNCF, it would have given everyone else a tool to marginalize AWS with. And whenever there's an industry consortium formed against you, you should always join it. That's a strategy thing.

I've been around, I've seen this play out. Sun ran into this a bunch of times. You will end up joining it anyway, so you should join at the beginning. Otherwise they'll just say bad things about you and do things that get in your way. If you're in the room, then it's very hard for people to optimize against you. And in the end, AWS made a lot of money out of running Kubernetes. So that was fine.

But architecturally, it's a big problem because now there's two platforms. There's two ways to do IAM. There's two ways to do security groups. It's a complicated mess because now you've got this layering going on. And effectively what people have had to learn is, well, we're just going to learn this Kubernetes abstraction and we're going to treat it as its own layer. And the complexity is visible to you because now you're trying to manage a cloud. And Kubernetes is designed to do that in a data center, which means you have to manage the actual underlying things and there's sort of leaky abstractions because of that. When you're running on a cloud provider, there's a bunch of things it shouldn't really need to do, but it does because it needs to be portable.

Yeah.

So the complexity sort of came out of this… mostly enterprises coming in and saying, we want something portable, we want something that we can run in the data center and in the cloud and across clouds. And that thing happened and it's caused a big complicated mess, which is what we're all living in now.

Platform Engineering

Platform engineering, I think, comes from trying to deal with that complex platform. As a recent hot topic, effectively, it's a combination of trying to deal with that complexity and also the team topology book, which sort of gave it a name.

Yeah.

Platform engineering teams are a certain type of a team.So the name got currency and then you need tools for them. That's become a market for doing that. So that's my kind of theory for why we ended up where we are today anyway.

And I'd actually love to dig in on that a bit, because one of the things that we saw at AWS last year is, and this is something I was worried about, because I think platform engineering is important. I think you can do it and DevOps well. I think there's different platforms and what your abstraction is really depends on where you're running. Like you can build a great platform on Heroku. You can build a great platform on Git. But one of the things that was concerning to me is we've seen in our industry this rebranding of roles, right? Like a lot of DevOps people might be what you would have called Ops a few years ago, but like they got this role, right? And like it happened a bit with SRE outside of FANG, right? Where it's like, I'm an SRE. Like, what are you doing? It's like, I'm the Ops guy.

Used to be called a SysApp back in the 1980s.

Yeah. And so like we were worried that like, you know, people are going to start just rebranding as a platform engineer. And what we found at re:Invent last year and KubeCon Paris this year is this is actually happening.

A lot of people that are in this DevOps role are now a platform engineer. And it's funny, you'll talk to them and they're like, “I'm a platform engineer.” “Like, what do you do?” “They just rebranded my team because it's popular now.”

With this happening, what are your thoughts on the current state of platform engineering? Have we started to rebrand it and market it too much? Like, is that going to be a net negative for the idea? Or do you think that there is a boon that we can get out of this in engineering by how much it's being popularized right now?

I think it's a natural thing. It's a signaling/branding effort that basically says, I understand the tools and practices in this space, the modern tools and practices. I mean, it's happened in other areas. You know, like we used to call them statisticians and then they became data scientists and other AI engineers or something, right?

Valid.

They're the same people. You've learned the same things.Certainly statistics and data science is the same thing. But the practices and tooling are different, right? The underlying theory, when you get to pages of Greek symbols, it's all the same stuff. Underneath, they're all doing regressions and whatever, fitting models and clustering and things like that.

So there's a layer at which it's the same underlying… Like if you were a new graduate and you have a certain aptitude for something, like you learn the tooling of that generation and there's a job title that goes with it. And then you know how to do that thing. And then if you reinvent yourself year after year, maybe you change your job title and learn some new tricks, but your aptitude is the same thing.

Like some people are happy staring at pages of Greek symbols. Other people are happy staring at a bunch of machines trying to figure out why they aren't working today. Or writing code or whatever it is that you like doing. So that's, I think, the sort of underlying, like where is talent and how can you apply it?

And then how do you optimize your career is to sort of figure out, if all the hiring job adverts say I want to hire platform engineers, then you're a platform engineer, right? And then that sort of forms a trend and it's just like that's just natural.

It's sort of a marketing branding thing. Having worked in marketing a lot, yeah, marketing works. There's a phrase that says perception is reality. And that's what marketing does. It sort of takes perception and reality forms around that perception. And in some sense, yeah, it's marketing, but that's not a bad thing. It's a way to brand something. And hopefully it's better, right? You've got some more tooling. You've got a better understanding of some good practices. And then every now and again, people reinvent or rediscover something from 20, 30 or 50 years ago or whatever.

You know, Conway's law pops up from 1967. And I get to hang out with Mel Conway every now and again, which is very cool, but he's in his 80s now. He was a young guy in his 20s when he wrote that paper.

Repatriation

Speaking of discovering things that are old, I don't know how pervasive it is, but there's definitely been some articles about it recently, like this idea of repatriation. Like people leaving the cloud and going back to data centers. Like, what are your thoughts on like maybe smaller organizations or medium sized organizations that may not warrant the power of what you can get in the data center, going back towards the data center? Is that a good place for many organizations? Or do you think that for earlier stage companies being in a pass or in the cloud is a better place for their workloads?

Why are you doing it? I mean, if you're just doing it to save money, then that can be.If you can't get what you need somewhere else. I mean, what competency are you trying to build up? That's really it. And then how much ROI are you going to get from that investment? Everything should be driven off of that.

If you are a company that needs to learn in the long term how to run stuff yourself, sure. Most people, if you're just trying to build it and become a SaaS provider for something else, then if your selling proposition is you're going to do it cheaper than everyone else and most of the cost is cloud provider costs, then you need to find a way of doing it cheaper. And maybe you can take out some somebody else's margin and figure out a way to optimize it.

It's usually cost, I think, is what people look at. And the problem there is if the true cost is bigger than the little Maths you did. So the total cost of ownership is typically much higher and the distraction of having to deal with it.

I need an AI machine to do some training. Am I just going to hit a button and get one from the cloud for a few hours, or am I going to go through the hassle of trying to get this machine running and sitting in the corner of the room getting too hot and, you know, every now and again I have to go buy a new board for it or whatever. And how much time am I spending doing that versus getting one from a cloud provider for a few hours.

How much ROI are you going to get from that investment? Everything should be driven off of that.

So that's one way of looking at it. The other thing is that running stuff in the data center is hard. It always has been hard, but it's become easier with the layering of the platform and the tools. So Kubernetes and the various tooling around it and the services you can get from hosted data centers - it's just less work and less hassle and to do it yourself than it used to be. So it's more viable to do.

And then the other thing is just the big shortage of GPUs. If you can get a GPU, maybe you want to just run it yourself. If you can physically buy one, if you can't buy one or you can't buy enough of them, you'll end up using a cloud provider because they actually have them. You can do it by the hour and share them, that kind of thing. But those are most expensive because they're trying to make you not sit on them for the hours you're not using them because they want other people to use them.

And then there's the new tier of cloud providers like Lambda Labs, which is actually where Jeremy Edberg is currently working. I mentioned him earlier. Where we just sell GPUs basically. And it's a very simple layer over the top for doing that. It's not a full on AWS or GCP or Azure. It's more like a hosting model with a bit on top of it. And so I think the other reason that this is happening is just this big new workload that suddenly turned up that's very expensive to run and people are still figuring out around GPUs. I mean, there's a massive shortage and they're expensive right now. So that's probably a temporary situation. They're probably going to get cheaper and easier to get as production catches up with demand.

AI and Machine Learning

You've also talked a lot about the need for multiple platform teams, internal and external platforms. With AI and all this machine learning that we're doing, how do you see AI being used by platform engineers or DevOps engineers in pursuit of platform engineering or just in their jobs in general?

Yeah, there's a few good things there going on. So I did this blog post called “Platform Engineering Teams Done Right…”  and I talked about different types of platform teams. So there's a mobile platform team. There's like an API platform. There's maybe a CI/CD platform. There's all these different aspects. And each of each of these is itself a platform and that requires a different set of skills potentially and tooling. And then the cloud provider is another layer of platform.

I wouldn't see it as one platform team that tried to do all of those things. I'd see it as a series of platform teams, more like those Russian doll things where there's layers of platform. Where the top layer is the one that is basically expressing your business problem in nice, easy to consume chunks. That book I mentioned on the value flywheel effect by David Anderson, they use CDK patterns to do building blocks where you could assemble a product extremely quickly from predefined patterns that all had the best practices baked in.

Oh, very cool.

It's like a Lego brick version of building patterns. You can build the full on custom model car or you can get the Lego version that's got lots of sharp corners and looks a bit weird, but you built it in an hour instead of several days.

Yeah.

That's kind of the analogy that I like to have, particularly if you're starting out or you're exploring an area, you want to do the Lego version until you figure out the shape of the Lego thing that we want to build the shiny, smoothed off version of. I've done a few talks where I use that analogy more explicitly. So that kind of works, from my point of view, as the way to think about it.

Sustainability

With some of your work you were doing at Amazon around sustainability, like how has the heat and energy consumption of GPUs and our need for these, how has that affected sustainability at Amazon and just, you know, your work and your thoughts on sustainability in general?

That's a hot topic right now. Literally a hot topic, I guess.

I'll get you a 400 foot fan.

Yeah. Well, I left Amazon two years ago. The last couple of years there I was in the sustainability organization, trying to just get AWS's act together. So that they both had some common messaging. If you ever go to re:Invent, there's a sustainability track. We had a lot of arguing to get that track to exist. It still exists. So, you know, a few things like that. Like setting up the internal mechanisms for getting everybody together.

There were bits of sustainability happening in scattered pockets all across AWS. And what I did there for a year or so was gather all that together and get some mechanisms and some common language and the standard presentation deck that everyone used.

I couldn't actually fix some of the things like transparency and stuff about the tooling, but we did produce the well-architected guide for sustainability as a pillar and a lot of advice in there on how to build more sustainable applications. That was kind of what got done there.

What's happened since then… well, AI was happening then… but the real GPU explosion that happened last year, the reports on what happened have not come out yet for Google and AWS. They come out in July, detailing what happened in 2023. So we don't have data for that.

But Microsoft releases a little earlier, their roll up of their sustainability data is quicker for various reasons. Mostly Amazon has a huge global delivery business, which takes a lot of time to figure out, whereas Microsoft is a more software organization, so it's faster for them.

They released their annual report and there was a bunch of talk about this a month or so ago because they said, we were working on reducing our carbon footprint, but it started going up because of all the AI rollout that we did. We had an unexpectedly large deployment of very hot GPUs. And we've had to do that very quickly. Whereas building out more sustainable, more solar farms and wind farms and whatever is on a longer time scale. So they've ended up with a shortfall. So their numbers are going backwards.

And this is indicative of a problem across the industry. There is so much potential demand for these GPUs that are 700 watts to a kilowatt or more each GPU. To the point that the latest Nvidia ones are 1200 watts water cooled and 700 watts air cooled because if you go much beyond 700 watts, they just melt. You have to go to water cooled if you're up in the kilowatt range. They can see this is the deployment pattern we're looking at and this is the energy to supply it, and there's a shortfall and we're going to have to generate more energy to do it.

The sort of pessimistic view is we're going to have to go build a bunch of gas-fired power stations to do this. And delay turning off other things like coal plants and whatever. And the more optimistic one is, well, we keep underestimating the rate at which solar and wind and batteries are deploying. They are more than doubling every year, and we just need to crank up and do more. So you can see both sides of that argument.

My emphasis is, let's see if we can use GPUs more effectively, tune them up.

I think that the energy may not get cheaper quickly enough because of this, because the demand's going to be extra high. But that's kind of the problem that people are currently sort of working on. So my emphasis is, let's see if we can use GPUs more effectively, tune them up. I've been sort of trying to look at some observability tools and performance tuning around using GPUs more effectively.

Yeah, and it's not a nominal impact on sustainability. It's a pretty substantial impact versus traditional CPUs, right?

Yeah, I mean, there's been a number of sort of doom and gloom reports over the decade saying data centers are currently 3% or 4 % of global energy or carbon or whatever and, in a decade, it'll be 10%. And it's never actually been that, we've always managed to keep it at 3% or 4 % because of decarbonizing the energy that's used for it and being much more efficient about how IT is used.

Like the move from data center to cloud is a huge efficiency gain because the data centers for cloud providers are much higher utilization than the ones used by corporates. So with all these transitions, it will probably be somewhere in between. I think the percentage will go up, probably not to the extent of the worst claims, but it's likely to be difficult to work through that transition. Depending on which country you're in as well.

How do you advocate for sustainable practices in cloud? And this is obviously going to be potentially an important change...

Have you got another two hours for the podcast?

I've actually been on a few podcasts just talking about that. So the Green Software Foundation is probably the best place to go look. It's another Linux foundation thing. I tried to get AWS to join and ran out of time before I left and then AWS didn't end up joining. But that was one of the things I was working on. Google and Microsoft are there. I'm running a cloud provider focused project at the GSF where we're talking to the cloud providers and trying to get better information out of them and also to document all the tools around there and where the data comes from. There's a big Miro board that is publicly accessible that's linked off the GitHub account.

Oh, cool.

If you go to the Green Software Foundation, GitHub and look for Real-Time Cloud, there's a screenshot of an old version of the Miro and a link to the current version that people can go see. So if you're saying, what is the tooling and where does it get its numbers from? And I've heard of this thing with a weird name and what does this vendor do? The Kepler project from CNCF is a really key project. What does that actually do? All those kinds of things are end-to-end in a kind of a flow chart on the Miro board.

Awesome.

I’m happy if people find things that are missing to just keep adding to it. That's sort of the collective brain trust of everybody that's working on this, all the different bits and how they all fit together.

Yeah, we'll definitely include that in the show notes. That'd be cool to see.

Overcoming Resistance to Improving Practices

So I know we're getting to the top of the hour. One last question for you. For individuals, like individual contributors, that are looking to push for better DevOps or platform engineering practices in their org, but they're struggling to get there. Maybe the management doesn't believe in it or the engineering team maybe just says it's the Ops team's problem. What is an approach that an IC can use that really wants to better their practices in an organization that might be a bit resistant to it?

There is a well-architected guide if you're mostly around development and operations practices. There's the AWS one. There's a Microsoft one, which has actually been adopted by the GSF. So they're kind of the same thing. So the GSF has a guide, which is, I think, the same as whatever Microsoft publishes. They're roughly the same. A bunch of good advice there for how to think about it.

Internally, generally, the way to think about the pressures that are causing people to care about sustainability come from different areas. And I say it's top-down, bottom-up and side-to-side.

So top-down, there's regulations that say you need to disclose your carbon footprint. Depending on where you live in the world, those regulations are either happening now or somewhere in the future, but they're coming. They aren't going away.

In Europe they're current. The UK government has regulations, for you to sell to the UK government you have to disclose your carbon footprint. Currently they're one of the most, sort of, you have to do it otherwise you don't get the contract. Whereas most of the others are, well, you've got to measure this year to report next year. Or in the US, there's some rules which start to bite in several years. But it's coming. So that sort of top-down management, like board level reporting that kind of thing. So that tends to cause companies to want to do that.

Then there's also the marketing. Marketing wants to be able to say my product is greener than yours, in some sense. Or I want to have a brand which has some positive environmental attributes to it. So that's kind of marketing top-down. Products, product definitions, you know, there's things there that matter. So that's the top-down piece.

The bottom up piece is just employees that care, right? And if you have young kids and they come home from school saying, “I've heard that the world's going to end in a climate crisis. What are you doing about that?” Do not underestimate the power of that as an influence, it is incredibly strong.  Of either young people going, “I've got to live on this planet for the next 50 years or whatever,” or people just trying to explain to their kids what's going on. That causes a lot of, “Well, I should do something about it.”

The more you know, the less you realize you know about how all this stuff works.

And the side-to-side is supplier and purchasing and customer. So, the things you buy, you should be asking what's the carbon in that. All of your supply chain -  you should say, well, when do I get the carbon footprint of the things I'm buying from you as part of the purchase?

The EU is mandating that now, that's coming in. So if you want to sell something to Europe, you're in that supply chain, you have to disclose your carbon. Regardless of where you are in the world. You could be in California, but if you’ve still got to sell to Europe then they need to know the carbon. So the US doesn't get out of that, it's viral in the same way GDPR was viral. It kind of spreads throughout the legal requirements.

Then, on the consumer side, people are going to be asking you. And the problem then is you have to take your total carbon footprint and then attribute and allocate it. You've got to say, okay, this feature of this product uses this carbon. And these are the customers of that product and they used it in this proportion. And I'm going to somehow allocate out some amount of my carbon that is attributable to the customers. And I have to figure out an algorithm for doing that. And that's actually a fairly complex problem.

If you're trying to do that within Kubernetes, because you've got all lots of things running, that's what Kepler (Kubernetes-based Efficient Power Level Exporter) does. Kepler uses eBPF to look at what's going on. And it will tell you that this workload, which is part of this namespace, and it's this set of containers and pods, is consuming this much carbon out of the total carbon that those nodes are consuming in that cluster. And it sort of maps up the energy usage and the carbon into that. You have to do something like that, but at the sort of SaaS provider or customer level as well.

There's a whole lot of things there. If your company doesn't seem to be doing anything about it, you just start asking questions, doing it as hack day projects, things like that. But most companies have at least got some kind of report going on. So go look at your company's corporate report, try to understand where those numbers come from, what you can do to influence them. And look at some of the CNCF… well, CNCF actually has an environmental group as well, I meant GSF, the Green Software Foundation. There's a podcast, there's a whole lot of information there, so lots and lots of things to do there. So just get up to speed on it. That's really what people can do individually.

There's a lot of concepts, models, it's confusing. Like the mathematics for it is multiply two numbers together, picking the two numbers is the hard part. You’ve got how much energy and how much carbon was that energy? Ludicrously complex to answer that question. The more you know, the less you realize you know about how all this stuff works. It's one of those areas. So just becoming an expert in it is useful.

Closing and Contact Information

Awesome. Well, I really appreciate you coming on the show today. It's been fascinating to talk to you. Where can people find you online?

I'm generally on Mastodon, mastodon.social is AdrianCO, so you don't have to try and spell Cockroft correctly. I've given up on Twitter. I've got a Medium account where I've got a bunch of blog posts and I'm on GitHub as AdrianCO as well. There's a /slides GitHub repo that has lots of presentation slides of talks I've given. And there's other stuff you can find. I'm relatively easy to Google or whatever.

Other than that, I go visit conferences when I feel like it, when they're in interesting places. That’s my “I'm semi-retired and I do what I feel like” thing.

Nice. And how long until we can subscribe to your drumming channel on YouTube?

That drum kit has been there for about a week at this point and I cannot play it properly. Being retired means I take up guitar lessons and drumming lessons. And there is stuff on SoundCloud, if you really want to torture your ears, going back from tape cassette tapes from 40 years ago when I was in a heavy metal band and things like that

Oh, do you mind if I put that in the show notes?

Sure, you’re welcome to.

Awesome.

It is the quality you'd expect of somebody who had a cassette tape playing when they happened to make some noises and didn't want to lose the cassette. So I just uploaded it.

I love it. Well, thanks again. I really appreciate the time.

You're welcome. Cheers.

Important Links

Featured Guest

Adrian Cockroft

Tech Advisor