Security and Scalability with Justin Berman from Thirty Madison

In this episode of the Platform Engineering Podcast, Cory O'Daniel sits down with Justin Berman, Vice President of Platform Engineering and Chief Information Security Officer at Thirty Madison. Justin shares his journey from software engineering to security leadership, discusses the challenges of building secure and scalable platforms, and offers insights into the future of platform engineering and security integration.

Guest: Justin Berman, VP of Platform Engineering and Chief Information Security Officer at Thirty Madison

Justin Berman is the VP of Platform Engineering and Chief Information Security Officer at Thirty Madison. Prior he was Head of Security at Dropbox, responsible for Dropbox’s information/cyber security, content safety, and platform abuse prevention capabilities, which provide unmatched protections to their users, staff and products. In this role, Justin and his team ensured that Dropbox enables storing, sharing and collaborating on various kinds of content in a trustworthy and secure manner, for all their customers worldwide, from large enterprise Dropbox business accounts to individual consumers. Prior to Dropbox, Justin was the CISO at Zenefits, where he was responsible for scaling the security and IT capabilities and developing the privacy and risk/compliance capabilities.

Thirty Madison

Links to interesting things from this episode:

Transcript

Intro: 00:04

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory: 00:40

Hey everybody, welcome back to the Platform Engineering podcast. I'm Cory O'Daniel and today I have with me Justin Berman. Justin's the Vice President of Platform Engineering and Chief Information Security Officer at Thirty Madison, a company on the forefront of healthcare innovation, making specialized care more accessible and affordable for everyone. Justin has a distinguished career that includes leadership roles at Dropbox and Zenefits. Justin brings a wealth of experience in both platform engineering and security. And for anybody who knows me and the tin foil hat that I wear, I'm very excited about this conversation today.

At Thirty Madison, he's not just leading the platform engineering efforts, but also ensuring that security is deeply integrated in everything that they do. He's also an active advisor and angel investor. So if you need that money, holler at him. And he's working with VCs on some of the latest developments in the field. In today's conversation, we're going to get into his journey, his experience in platform engineering and talk about some of the challenges of building a secure and scalable platform.

Justin, welcome to the show.

Justin: 01:32

Cory, thanks so much for having me.

Cory: 01:34

Yeah, I really appreciate you coming on today. So I'd love to just kind of start with your journey. You've had an interesting couple of companies that you have worked for, and now you're leading platform engineering and security at Thirty Madison. I'd love to hear about your background and maybe a little bit about how security has influenced your approach to platform engineering.

Justin: 01:52

Sure. I feel really lucky. I got to start my career in software engineering a long time ago with pivoting to security really early in my career. Nowadays, I think it's actually much harder to get into the field in the sense that people are looking for degrees and whatnot. I've been doing this for like a little over two decades and there were no degrees in security when I graduated college with computer engineering.

I think the story of my career is the story of my brain. The thing that I value more than anything is learning and growth. I think because of that, I'm very type A about way too much. And so the whole way through my career was just, what can I do next? What can I do next? How can I broaden my perspective? How can I learn new things, lead new things?

I did consulting for a while early on in my career, which is a great way to get exposure to many ideas from many people. And then went in-house, when I realized that a lot of security consulting finds problems, but the companies that you work with don't fix those problems.

Cory: 02:57

Oh no.

Justin: 02:59

I don't think it's a don't fix like they don't fix it because they're bad. I think it's actually in part a don't fix because I think security consulting is broken at some level about the way that it spends its time versus what companies actually need to get.

I think the consultancy I worked for and with was great. Going in house was just like a ownership story. Which is the other watchword for me, I just really love owning impact. I don't mean that I have to be personally responsible for every change. But man the satisfaction of driving through complex and challenging change that lands in a place where a team full of people accomplished things that no one ever thought that they could individually is far and away one of the most satisfying things I get to do in my career.

Ultimately, like even from my first security leadership job when I joined Flatiron a long time ago. I was like employee 60 there. And I remember thinking to myself like, why don't we have an IT team? And I told my CTO like, why don't we have an IT team? And he was like, well, why don't you just go build one? And I was like, okay. It's just like a learn on the fly thing all the way through. And I got to work a lot with a platform engineering team or at the time we called them like CloudOps and DevOps at Flatiron. And then when I went to Zenefits, same, it's like I was leading security and IT and starting to take over chunks of the Cloud operations teams there as leaders shifted around.

At Dropbox, I was focused on security, but also what we called anti-abuse. Which was a bunch of software engineering teams and data science teams that focused on how to keep things like child pornography off Dropbox.

Then coming here to Thirty Madison has been a kind of refocusing on breadth over pure depth in security. You know, at Dropbox, the security team needs a leader at scale for just those efforts. Thirty Madison, I joined as employee 120 and they don't need someone of my seniority. At 120 employees and like the size we are at, we didn't need someone of my seniority for just security. And so it was part of joining them to say I want to focus on the breadth and take on problems that I haven't spent as much of my career focused on.

Long story short, type A-ness and a relentless focus on learning new things and growth, and then on impact. That is ultimately the reason that I went from a peer security leader (lots of people in the security field will be there their whole lives) into someone on the now platform engineering side as well. And now ultimately on a path towards more of like CTO or who knows... I'll probably do the CTO thing and until I feel like I'm good at it and then be like, well, now I've got to find something else new to do. That to me is like the very fortunate story that I've been able to chart in my career and picking up lots of cool VC opportunities and other stuff like that throughout.

Cory: 06:09

That sounds like a pretty awesome path. Mine was very similar. I just kind of stumbled backwards into the Ops side of the world, coming from engineering. I feel like there was a point in time where that just kind of happened a lot. It's like, well, no one else around here is doing it and you mentioned it, so now it's yours. And yeah, it's a fun little career path.

So before we kind of hop into security and platform engineering, can you just maybe give us a little bit of background on Thirty Madison and how platform engineering is affecting your team?

Justin: 06:38

Thirty Madison's a healthcare company, specifically a digital healthcare company. All of our services are offered online. The focus of the organization is really cool. First of all, it's always nice to work for a mission-oriented company where the mission is authentically one where people feel the impact day to day.

We hear from our patients about the fact that we're meaningfully changing their life. Whether it's because they live in a healthcare desert and thus they get access to healthcare that they wouldn't have been able to get easily otherwise and they would have had to travel far for. Whether it's because of their busy lives and this way they don't have to wait in a doctor's office for an hour. Whether it's because our cost of access is lower because we use technology to empower our doctors. Whether it's because frankly we are letting them get specialist quality care (that's really another point on the healthcare desert side), but that focus on improving access to quality and also price of care for the patient. I wish the US didn't need so much help with this, but we do. And so I'm really overjoyed to work for an organization like this.

As far as how platform engineering, fits into that, it's helpful to define the scope of the platform engineering organization here at Thirty Madison. So I'm setting the security team aside for a second because that's like my CISO hat, which is separate but different. We have an infrastructure function that focuses on probably most of what most people who listen to your podcast expect. We have a developer experience team within that. We have an SRE team within that that focus on both centralized as well as deployed practices. We have a data engineering team that focus on the data infrastructural layer.

Cory: 08:27

Oh, very cool.

Justin: 08:28

We have, what we call, core services, which is really an application infrastructure type team, like building shared services and maintaining them for our engineers. And then we have our corporate IT functions, which is everything from like office AV, networking systems, et cetera. But they also own some of the production critical systems that work within our pharmacies because we have large physical mail order pharmacies that help us make sure we can support our patients.

So the mission of platform engineering within Thirty Madison ultimately falls into two big groups. There's protect and then there's empower.

Cory: 09:10

Are these are these team names for these groups?

Justin: 09:11

No, sorry. These are the two overarching responsibility areas that the whole of platform engineering...

Like if I take the infrastructure, the SRE team is a team that does both, but like their ultimate mandate is to make sure that our systems are in fact reliable enough that patients get the healthcare they need, that our enterprise partners are happy, that we don't churn people because they're frustrated with the services that they're getting, that our doctors always have access to get to the things that they need to support the patient.

Versus like DX is an empower team. Their job is, and like all the way down to the Dora metrics they care about, their job is to accelerate the delivery of high quality engineering features by building things like the dev environments (we happen to be focused on remote dev environments or cloud dev environments). They build the tools so that engineers get to do their jobs and focus.

I think a different way of saying empower and protect is that like empower is about offering people the ability to not care about a bunch of things so that they can do their job more effectively by focusing their time and attention in the right places. And protect is about allowing a bunch of people to not have to be afraid that either they have created a problem or that a bad guy from outside creates a problem (whether that's about reliability or security or privacy or compliance or whatever else). But ultimately, I think everything platform engineering does as a whole falls into one of those two buckets. Then each individual team has their own mini version of that mandate that they specifically work on these processes or these systems or these aspects for the company.

Cory: 11:02

Yeah, I really like cataloging the teams into those two groups. feel like empower and developer self-service and DevEx are terms that we hear a lot. But while we think security is important, actually having security being a part of the platform is pretty exciting.

As far as the security aspects that your teams are focusing on, is it just securing your internal platform or actually extending that security through like service to your engineers, where they literally just don't have to think through things like TLS certificates and rotation and pen testing and all that stuff?

Justin: 11:38

First of all, I like the question because you're letting me talk about what I think the modern view of a security team should be anyway. I build security engineering functions because I do not think that like... I understand why in certain contexts, more traditional or legacy approaches to developing a security team are very valuable and necessary. However, I have the luxury of working for predominantly engineering-centered companies or at least companies in which technology is a key player role, and as a result I choose to build teams of engineers. So the short answer to your question is security treats solving problems systemically for other engineers as a huge part of their value add.

The first layer of that is, security expects no one else in the organization to understand better than them how to know if something is potentially a security problem or not. This is a spicy take maybe, I do not believe it is fair or reasonable to try and say to your developers that all of them have to be security experts, that they're all responsible for the security in the same way.

Cory: 12:50

You can just simply shift it left, right? [Cory laughs]

Justin: 12:54

I think it's fair to say things like everybody has some part to play, but I don't expect them to train on how to see whether a given piece of code they wrote is vulnerable. Our job is to tell them if it's vulnerable within security. But then that's not enough for me because if we just keep making the same mistake over and over again, then to me, that's a sign that something about the way either the architecture is set up or the platform is developed (and I don't just mean platform engineering, but like the whole of the technology) or about the expectations or about the feature requests from products... something is broken or at least it could be better. And security's job is to find what's the systemic change that could be made. And then, wherever possible to make it for people and to land it with trust.

I mean, even at Dropbox, I think one of the most impactful things the security team did was they built a framework for front end that everybody used and security took responsibility for. It essentially rendered it such that no front end engineer of Dropbox's products had to worry about the question of are certain classes of vulnerabilities present in my code? Like, is there going to be cross-site scripting or cross-site request forwarder or these other problems within the code. Instead of security looking at that as "we find it, you fix it", they're like, what if we built a thing so that no one could do this anymore?

That's the same attitude we take at Thirty Madison, albeit at a smaller scale. The same exact idea, we pick off problems and we solve those problems in ways that are systemic so that engineers don't have to.

TL certification is a great example. My info teams manage the cloud in such a way they shouldn't have to care about that. But you know access control is a good example, we built a library then and have checks for engineers to say like, hey, if you don't include a specific thing that says who is allowed to access this page, then it will default to no one can access this page. So you can never make a mistake that's dangerous for our patients. And you have to be declarative about who's supposed to touch a thing. Which is way easier for most of our engineers to swallow than like, wait, I forgot to write this check and now security's hammering me about the fact that this check isn't here in the first place.

Cory: 15:15

I kind of love that. Security teams, security experts... that's a niche. Like that's a smaller group of people compared to the number of software developers that we have on the planet, right? I think one of the boons that I've always felt you got from a good platform initiative is scaling the expertise that is hard to get in many orgs. So operational expertise, that security expertise. And if you can productize that, where you have a small team that's able to serve security, serve scale, serve reliability to other engineers, like that's when you're doing good platform engineering.

‍I've been at companies that had security teams and I've actually worked with outside security consultants and some customers before where it's like, they got two or three people that are in security, but they don't have enough of that bandwidth to start to automate themselves. And so it's a lot of like checks and balances, which just kind of slow the team. Like, hey, I caught that there was a problem at the end of your software development life cycle, and now we've got to hold this feature back. Which is a bummer for everyone.

Justin: 16:23

It is. Like, we found this thing in prod and now you've got to go hot fix and now you've got to go spin up a ticket, and the EMS argues with the product manager because they're like, "Wait, I want to land this feature. You can't..." From my perspective, like the earlier in process, the better. One of the things that we measure about this is, we have tools that are supposed to help developers find it as early as in their IDE. In the case that we find something we want to look at the ratio of things we find in prod to things pre-prod and specifically at each different phase of the development lifecycle. With the ideal to shift as much of it to the actual point in time where the developer is writing code as possible.

There's also a separate thing we track around which vulnerability classes do we think are just solved. So no developer should ever have to think about it. But in terms of finding problems, part of how I want to incentivize the security team to build and think is to be like, wait, I need to catch that earlier and earlier and earlier in the process. Because if I do that, then that helps us live into this maturity view of like what it means to do security well with our product and infrastructure engineering counterparts. But yeah, I'm super with you.

I think, people still hold this view of security as opposition is because of those moments where it feels like they're tossing a fucking thing over a wall, instead of coming to you in partnership and saying like, Hey, we need to fix this thing together.

Cory: 18:07

Yeah.

Justin: 18:08

To me, one of the fundamental reasons behind that is when you hire security people. One of the things we decided to do very explicitly is you don't demand traditional software or infrastructural engineering competencies, but like every one of my security engineers is tested as either a product engineer or an infrastructure engineer of some level as part of the interview or trained, if we're hiring someone quite junior and we're not going to expect that level of competence. But, if you want them to be engineers, make them be engineers. If you don't want them to be engineers, if you want a bunch of analysts and they're just going to throw shit over the wall, then recognize you're going to have a bad relationship and there's going to be a resentment from engineers to the non-engineer.

Cory: 18:55

Yeah, I think that's one of the things that's pretty key.

I feel like you saw this across the board in Ops early on too. Like in the early days of DevOps is like a lot of Ops folk were like, well, I'm not a software developer. And it's like now more of them are software developers and we're able to have nice things like the Terraforms and the Kubernetes, right? Because these people started doing this kind of work.

I've always kind of seen software development as a tool that allowed many nerds to wedge themselves into Silicon Valley. There are some smart business people that don't know how to write this stuff, but like the more people that can code, the more that we can produce. So I think it's important for people outside of engineering to learn about software. But it is interesting to still see in the in the big glob of like what we'd call engineers, there's still a fair number of people that don't have (not necessarily any traditional background like getting getting a degree in software development) production experience.

For somebody who's into continuous learning and growing teams... for that security engineer that's listening today who is like, "I am more of the analyst, I don't necessarily have this background in software engineering. I haven't put anything into prod." How do they get there?

Do they just go and take a bootcamp? Do they try to pick up a side project? Like, how do they get something in prod where they feel like they can actually kind of scale their expertise if they don't have that background?

Justin: 20:28

I think it also depends on where they're starting from, right?

If you took a security track in school and it didn't have any software engineering or other like traditional infra or SRE type engineering components to it, you might be starting from a place where you really don't know even where to start. When I look at more junior security engineers, or likecoming straight out of college for example, don't hold yourself to the bar of, "Man, I've got to have already landed a thing in prod" or "I have to have already written an open source project that like a million people use." Teach yourself basic scripting. Like automate the basic part of what you do in your day to day or, if you're not working yet, like some things that you had to do in school. It's one thing to do a task, do a homework, whatever, it's something else to go like, "What is the meta thing going on here and could I write a little script about it?"

Frankly, I think Bash is harder to write than Python is myself, but to each their own in terms of language and other things like that. Certainly if your on the younger side or if you don't have that experience and you're starting from like close to scratch, if a bootcamp works for you best, then definitely do it. I don't have a strong opinion about them. I think there's some great ones. I think there's some poor ones. I don't think everybody has the time to do that, especially if they're doing it alongside already working a full-time security analyst job of some kind.

But for sure, I think almost anyone can script something. Like you have a computer at home. You feel like nowadays it's almost impossible to get away from having home automation of some kind. So probably you can script something or even just like take something about your day to day and go, "Wait, what if I made the computer do this for me instead of me doing it?" I would start there for the truly inexperienced.

For people that have written scripts or written code, but not landed things in prod, what I observe from product engineers is when a security person authentically gives a shit to try and understand their world better, they're super open to be inclusive and helping you come up. So for those who are working in an organization where maybe you are more of security analyst and you want to develop the engineering talent, find friends. The next time you hand a Dev a bug, ask them if you can pair with them on fixing the bug.

Cory: 23:06

I love that.

Justin: 23:07

At first that's going to feel super intimidating because you're basically just going to pair program, but watch them do the thing because you don't maybe have the background to understand why they're making the choices they are, et cetera. But there's already an apprenticeship and mentorship culture within product engineering or engineering in general. Take advantage of that. Most people want to help each other if you actually care. If you just are trying to use their time to benefit you and it doesn't feel mutual, then I think you might run into static. But I think if you really care, they'll just sit down and help.

You'll also build amazing relationships by getting that trust from your engineering counterparts anyway. And by the way, I think everything we just said about security applies equally to SRE.

Cory: 23:53

Yeah.

Justin: 23:54

I think SRE is at a place where it's more likely that you're going to find an SRE right now that has a strong product engineering capability because of how those two things have grown up from different starting points. But, I think the best SREs I know have landed services in prod that they wrote, they're not just there to kind of like provide experience on how to run services at scale. Both are really useful skill sets to have. The truly best are these fusion players that have done both, and they have a level of empathy that doesn't come without the experience of actually having to get the thing into prod.

C‍ory: 24:40

There is like a rebranding that's happening with DevOps for many teams. So looking at your org, you said you're 120 folks on the engineering team or the whole org?

Justin: 24:50

Roughly 120 people in the CTO. So like product engineering, platform engineering, data, that grouping of different engineering types within the company.

Cory: 25:05

What was the platform journey at Thirty Madison? What was the initiative that got your team focused on like, this is something that we need to start working on, start focusing on getting some money behind, getting everybody excited about? What was that inciting moment that turned your team towards platform engineering?

Justin: 25:22

Well, first, when I joined Thirty Madison back in May of 2021, I was the 120th employee of company. When I joined, was no security team yet. There was no IT team yet. Both of those jobs were being done as second jobs by other people. And there were two people who were working in what was at the time called infrastructure.

I had been advising them through first round, because we are a first round company, and so I knew the VP there. He wanted another like technical VP in the organization who could share the load of these things. I think he brought me in to build it. And so there was a bit of like, oh there's a mandate anyway. It's not a, like convince the organization that we need to spend money on this at first.

As with any mandate, everything has a limit to the extent to which a company is going to just keep investing without demanding answers. But I think in the first nine months of my time at Thirty Madison, like we built a security team up from nothing to four people, we built an IT team up from nothing to four people for the organization. And I think, by the end of that, fired the third party resource that we had that was pissing everybody off all the time. And scaled the infrastructure team to the point where it was 10 under an engineering manager. Who by the way is stellar and none of you can have him - a guy named Lawrence Wakefield who is just an excellent engineering director now. But at the time, scaled that up to 10 under him.

It was really easy in that first phase to just always prove value, because there were so many jobs to do. And we had, you know, a lot of capital because of the raises that we had done so that there wasn't the... you know, every company goes through these pendulums of like high scrutiny about spending to low scrutiny and back and forth. I don't think I've ever seen any company norm to a middle ground that they just stick. It's back and forth based on how much or little it feels like money is easy to use or not, or get or not. So that was a period of low scrutiny, but also it was just like easy to show that there was like so much work to do that wasn't being done. Or so much acceleration we could get by taking problems off the hands of the product engineering team. Or, in the case of IT, just taking problems off the hands of all these individual employees that were just doing stuff - like HR was doing onboarding work and offboarding work beyond just the removing of the HR systems when I got there.

I think it's the second phase of scale up... when we merged with Nurx there were zero security people at the Nurx organization, so that team stayed the same. That was a place where my CEO was super empathetic. He's like, "Go do an assessment of where they are before I go spend a bunch of more money on additional security resourcing. I want to make sure that we're not overspending in this area." Great. Did the assessments, made the case, got the additional funding we needed to bring ourselves to where we need to be and protect our patients and our company and everything else. But the whole phase was like that.

Now on the IT side, people came over. On the infrastructure side, they had their own version of the same teams that we had. That was when we made the first split in teams. Where teams were big enough that we were like, okay, it can't just be infrastructure anymore. It's going to be like a DX team and an SRE team, and they're each going to have their own leadership and their own structure and focus.

As you scale, the reason to split teams, to me, is because you want to give the gift of focus to the leaders in question in those areas. When you're small, it's like everybody's jumping in to do everything within the remit that they think they have. As you get bigger and bigger, you need specialization. And as you scale, eventually that specialization means you need a manager that is focused on a thing and makes that their first concern. So that you don't also have to treat that like your first concern in every conversation. As you balance out, you have them arguing to you about what the right things to do are. Same thing with staff engineers and other like similarly leadership positions.

I don't know if this is really answering your question ultimately because your question was like how did it scale up? I don't think I ever ever had a problem getting focus. I think it was just as we tightened scrutiny on spend, the justification had to be clearer.

But I feel very lucky, I have a great relationship with my CEO, both CTOs that we've had at Thirty Madison - a great relationship with Matthew who came from leading all of infrastructure for Uber. So he almost has a bias probably in favor of the teams that I lead. And Gil Swarski, who's our CTO now came from Flatiron where he and I got to work together. And he, likewise, has had the experience of working in FAANG. In several different FAANGs actually and thus has the like, "I know what it's like when I have the kind of empowerment and enablement that's great from teams like yours. So I want to make sure we continue to maintain investment here." So I guess like I probably have a rosy story because it really... it's great.

I need to provide justification, but I don't hit the like stonewalling or static of like why would we invest in this? I have leaders across from me who are interested in, "Well, cool, explain why more right now, because the choice is to spend it with you or to spend it on other things that help the company move in other ways." They want to make the best decision for the organization, which I do too.

And the fact that they're willing to talk about it and to have a logic and like data-driven conversation about it... for example, I want to justify an additional head. It's like, okay, if I have this additional head, then I'm going to drive this kind of velocity improvement as measured by either feature cycle time or frequency of regressions or something else within engineering as a whole. And if I get this resource, then I can sign up for these impacts. And if I don't get this resource, then I can't. But obviously we'll do as much as we can, but I can't sign up for this much.

Being able to have that conversation with other leaders who are more in control of the purse strings, and have them actually care about the answers and reason about it effectively makes all the difference in the world to me.

Cory: 32:26

Yeah. What about on your customer engineer side, or I guess your customer developers, developers that are using your platform, like for them, what was adoption like? Was it like, we've been struggling with problems and we're super excited to get on this? Or did you have that kind of problem of getting people to change the way they are doing things to start using the tooling that you guys were building?

Justin: 32:48

I think we do have struggles at times. Here's what I see. First, early on it was easy because we basically took the platform that people were kind of norming around for themselves anyway in product engineering and just said, great, we'll run this for you. And we happened to have done a lot of investment into staging environments in such a way to make things like ultra testable and easy.

So at first it's easy because we're just saying like, "Great, you all do this yourselves, we'll build it to be more productized for you and run it for you." I think what has been the challenge is as we introduce things that are designed to lift problems, make more data available in staging. The two things that are the biggest hurdles: if we're asking people to make a change to their workflow, and communication is actually the biggest hurdle.

You could call it change management as a whole, but to me it's really like awareness communication and training of people. I have so many cases where I've been sitting down with some engineer in the organization just like serving or talking with them casually about kind of like what they think about the things - because we want to measure that they like it or at least that they feel empowered by it - but I've had so many cases where they're like, I didn't know that that feature was released. And I'm like, "oh, we did not share the right information in the right place or enough times." Because sometimes you'll like share it in your Slack chat, "Just released all these new features for staging. Like go read these release notes or whatever." And like people are so busy. They don't have time to read release notes.

Cory: 34:28

Yeah.

Justin: 34:29

You have got to say it way more times than you think you do and in way more places, et cetera, et cetera. But that's come up a bunch of times. And I think our company culture is very much that like, there's not an issue where those teams are like having organ rejection or they're like, "No, screw you guys, we don't want to use this." They're usually like, I didn't know that that was available yet. Thank you for telling me now.

The other issue that I find is when you're asking them to make changes that are more substantive. It's not just like, use this new version of this thing, but rather, we're going to change your development workflow significantly. Like shifting someone from local dev to like a remote dev situation, for example. I think the expectations we put on engineers about how much we deliver are very high. And so if you're asking them to do a lot of work as a DX team, to come to the new thing, that's usually very hard.

It is always very hard to get people to go like, "I'm going to spend a bunch of time revving my workflows or my dev environment or rebuilding from scratch the way I'm going to do work", because they have comfort and they're getting a lot of pressure from their product manager to deliver. And so they're like, "But I know I can get this done this way. I don't want to take the time or the risk of trying this new thing in this moment." And what I find is it's just like a lot of handholding.

One of the benefits and the trust that I think we've been able to build here is that we treat infra team very seriously - like senior engineers and EMs (we don't have product management attached to my teams) treat themselves like mini product managers. They actually go to stakeholder conversations. We do lots of feedback collection. We maintain feature lists of things that need to be evolved in all the tools - that are not just what does the DX team think is great, but is also like constantly collecting from engineering across the organization. Because ultimately it's stuff for them. I'm not building this just to build it for the heck of it, because it's fun to build cool tech. It is.

Cory: 36:48

It is.

Justin: 36:49

It is, but I'm doing this because I want them to feel like they have a more productive environment or a more effective workflow. Or just like, frankly, a better time writing code.

One of the things that sticks out to me that is like super buzzwordy right now, but like my DX team, in particular, has had some huge successes with - that goes beyond just making them more effective and just gets them excited - is like the introduction of Co-Pilot or other similar tools for helping them write code. Getting rid of some of the cruft and grunt work of writing code.

We can talk about measurably how many extra features per sprint get delivered post the introduction of that tool and spreading it across the teams and training people on it. But what it comes down to is like engineers feel like we are actively investing in trying to help them get the chance to use and maintain fluency with technology itself and modern technologies. And I think that makes them more excited even when what they have to do is to build one more feature in their application.

Instead of that feeling of this is just one more of the same. It's like, well, at least I get to use a thing. And it's not a compromise on architectural principles for us. It's not like we're going to let you use another language just so you get to use something new. We're saying, Hey, we're going to empower you with a tool that allows you to go faster in the tech stack we have. So we're not adding a bunch of debt to the tech stacks we have by adding more tech stacks.

I think that's like a lesser talked about, but important thing that DX can do for teams. Not so much use AI, but like specifically create an environment in which those engineers are not just effective, but engaged.

Cory: 38:48

Yeah.

Justin: 38:49

Create environments for development, keep engineers around.

If you feel like you get more done, if you feel happier doing work, if you feel like this is fun, that is a differentiator on keeping your best talent around and not having to fight dollar for dollar with Facebook for them, you know?

Cory: 39:11

Yeah, for people listening to the podcast and not watching the YouTube, I just had a revelation moment there. Looking at my own resume, but also looking at resumes as we've been hiring, it's very common I think in engineering - my dad was like blown away by this. He saw my resume once. He's like, you've had like 700 jobs. Like what the hell.

He's like, you guys change jobs like every two to three years. Like you don't go work someplace for a decade? It's like, I ain't got all decade to worry about working at some job. But like engineers, we tend to have shorter tenures, right?

And like that developer satisfaction, that developer experience, that is one of those things, I feel, where it is hard to to gauge where that happiness is coming from in the job, in the role. Is it your team? Is it the benefits? Is it the product? Is it the company's mission? Or is it that I love writing software and it's easy to do here? But whatever it is that's getting that developer to that happy place, like that is just a secret weapon as a business - If you can keep developers around, keep them stoked.

‍I think developer happiness as a KPI for platform engineering is probably - it's so hard to quantify and tell the business what it's worth - but like if you just see people being more happy in their jobs because of your platform engineering efforts, to me, that is just the best sign that you're doing a good job as a team. But like what what about for your org, what do you guys use to kind of measure the effectiveness of the team?

How do you tell the rest of the org platform engineering is working, these are the numbers to prove it?

Justin: 40:56

Each team within Platform is just going to have different specific key performance metrics. Security's are mostly about the specific risks and then they're tied to things like making sure we find problems before prod. Of course, things like are we responding to security incidents at a fast enough rate - or I should say events because they're not incidents until they become a a real problem for you and I want to catch them before they're real problems. So there's like those kinds of things in Security.

I mean, Security has a separate problem with measurement. Like a lot of people in security struggle to measure what really matters. And thus they give up on measurement because they're like, "Well, I can't actually measure whether like the risk is high directly, so screw it." Instead of asking what are the correlated things.

In DX, I think this is pretty straightforward for us. Or actually let's say within infrastructure, at least in terms of like DX and SRE. Let's talk about shared services team and talk about data engineering separately. But like DX and SRE use DORA, and the core of DORA metrics to us is like lead time, meaning like time between a PR open and close or PR open and merge to main. We use a releases per day metric, like how fast can people actually get code out there. Which, by the way, the goal is not to be forever up and to the right. There is a limit to how fast you want to release code before it starts negatively affecting another DORA metric, which is like how many bugs and incidents, et cetera, you're having.

We look at it as a combination of how much time does an engineer spend on features versus bugs, incidents, et cetera. We also measure like number of regressions per change, on average how many regressions per change, stuff like that. And then lastly, SRE is responsible for helping drive mean time to, not just respond, but contain and recover from incidents.

So when I think about each of those, not one of those translates directly to a CFO, right? But that's actually part of the challenge, but also kind of the fun. In platform engineering, you're not the product engineering team. It's not like, "I landed these 10 features and each one of those features drove this much revenue and blah, blah, blah." Yeah, that's the measure of a product engineering team - did they land the features with product and business that make the difference? Do those features actually have the business impact? For my team, it's my job is to explain like, "You know how you have all these features and all these engineers are landing? Well, these metrics translate to, you can do more of them." And I know that the business has a backlog of things they want to do that we can't get to fast enough right now or as fast as they'd like to. So it's actually meaningful to go faster at this point. We want to be faster and more efficient.

And similarly on the regression side, we measure impact per incident per SEV. So I can say if we're dropping the number of SEVs we have, or increase speeding up the response time to those SEVs and containment of those SEVs, then SEV 2s become SEV 3s. And the average SEV 2 cost this, the average SEV 3 cost this. And so for finance, this is dollar impact in the form of we make it possible for the engineering teams to respond this quickly.

I think shared services is interesting because it's a weird hybridization between what I would want of a DX team and what I want of a product engineering team. Because they're effectively building and running services. I look at them similarly. I try to think about their metrics in terms of two kinds of things.

For all the services they've already built, land and prod, et cetera - those should have specific, like, is this working well? Is this not? Are the customers happy with the thing? Whether you use a CSAT type view of that or whether you use MPS for that with your developer community, whatever. The point is they should be happy with it. Product should be happy with it if you're landing some feature that has to support product's business decisions in the product.

The other thing I need from them is like - it's not enough that people are just happier that the technical metrics are solid on the thing - use. Is it actually being used? Is it not being consumed? Are people working around it?

Then data engineering to me, at least at Thirty Madison (because that term could mean lots of different things to different places but it's part of our infrastrucutre function), is about the fact that the right data is in the right place at the right time. Which translates into like measurements of correctness of information, measurements of speed. Like how fast do we update the data warehouse? How fast do we ship whether log or event data from different place to place? Is that in line with what is necessary? And you can set thresholds on what is needed by the business teams or by the analytics teams or whatever else to support that.

And then on IT, lots of health metrics about different systems and stuff that we can measure. Those are great but IT, I think... Like if the people respond, if they get their tickets served and they're happy with that, that is the measure of whether a help desk is doing their job well or not. When I think about the systems engineering part of IT, then I think we're in a really interesting world where we're - and if you have advice about this dude, I would love it Cory - I'm trying to figure out how better to measure the systems engineering work of "we save teams time by automating their work for them", because to me that is what a truly great IT team does.

Like if you're talking to HR or finance or maybe marketing or others in the organization, chances are that engineering is not building product for them. In our company, engineering is not building product for them, they're building product for our doctors and our patients. And so they need that support. Sometimes you're going to get some really forward thinking leaders in those teams that are really technology aware and coming to IT and telling them what we need to do differently and where they could use automation support. But often it's not.

Truly great IT teams are going to those teams, helping them see around their own blind spots with regards to technology, and then automating things that are currently being done in a not automated... maybe automation is too strong, they are using technology to reduce toil. And I haven't found a great metric measure of that thing yet, but I have lots of leaders sitting across from IT teams saying like, "My God, thank you for fixing this thing for us." or, "My God, thank you for telling us that we can connect these systems and I never have to manually transpose data from one to the other again", or whatever else that IT goes and solves with those people.

But if you have thoughts on how to measure that in a consistent way, that is one thing I am continuing to try to figure out the right way to measure it.

Cory: 48:30

Yeah, that one feels like a hard one. That's another one where it's like, man, it's just like the joy of that team. There are so many IT teams, despite the hard work that IT folks do - and I think depending on what country you're in, IT might mean different things to you - but in US IT where it's like the help desk, computer wiring, buying services, installing active directory folk. Like that's a hard job.

It's a job that generally doesn't get a ton of positive sentiment - this is my background, this was my world before I got into software. Very similar to platform engineers I almost feel like, you don't get a ton of positive sentiment about what you do. People don't know you exist until shit's broken.

Justin: 49:11

That's right.

Cory: 49:12

And then they're like, "Ahh, why is this broken?" You're like, "It wasn't broken for 364 days and 19 minutes. It's a second, chill out. I've done an ace job all year long."

It gets interesting because I think I've seen IT people also do very good jobs and it results in very unhappy people because you'll see, potentially, layoffs. I actually saw this. So my background is in healthcare - originally a HIPAA security analyst, thus the excitement. But I was on the analyst side of the world, so not writing the software way back then. But we had this system that like reported billing to one of our insurance folks. And the way that the entire team was doing this was somebody... this was two thousand, maybe nineteen ninety-eight, might've been pre Y2K, a long time ago... but people were printing just hundreds of pages of spreadsheets and then taking them down on a cart to a data entry team that would just type in stuff into a different spreadsheet.

Justin: 50:15

Oh my god.

Cory: 50:16

Sorry, like a TTY thing, like some goofy old terminal system thing that you still kind of see at your insurance agent's office. And so we actually spent some time, figured out that there were ways to interact with the system programmatically. They didn't have an API, but we kind of created one for them. And then that was automated. Literally, the people that were printing the spreadsheets just dropped something in a folder, and it did all the data entry for them. And that team was very happy. But the entire data entry team got let go.

Justin‍: 50:46

Right.

Cory: 50:47

So that's like, it's one of those ones that's very hard to measure. Like the business saw the financial impact of it, but nobody intended to do that. We were like, isn't there other data to enter and it's like, nope, this team's gone. It's like, whoops, that was an accident.

Justin: 51:02

I think it's also a sucky reality of the way that certain companies are culturally, right? You could have the same exact effect in another organization and they may be like, can we retrain these data entry people into something else that's useful for the organization? Because they've been great for us in these roles, but like, we don't need them to do this job. Maybe there's some other job they could do here.

I don't need to pry into the specifics of what happened in your case, but I like to hope that even in cases where we reduce the need for a particular role or particular job to be done within an organization from IT, that we use that as an opportunity to enable people to work on either new things or to do better work on some set of things.

That's why empower, by the way, instead of automate or replace, is a mission word for Thirty Madison platform engineering, because to me it is about empowering the individuals to focus their time and energy on what matters and what is the best use. But I hear you, for sure, about the fact that sometimes you do your best work helping to solve something for the company and that results in a layoff or at least like severing for specific teams.

I think for me, it doesn't change the reality that the thing that I want IT to be focusing on is, how do they help solve problems? But it definitely could feel chilling to the individual IT member if they felt like their work resulted in the layoff, you know?

Cory: 52:42

Yeah, yeah, for sure. Well, I don't like to end a show on a bummer. Sorry for taking that to a dark space, everyone.

So I would love to know, just with your background - you're working in platform engineering, working in security, you also have this background as an angel investor - I would love to know, is there anything in the space right now that you've seen that maybe you've invested in (or not) that you're like, this team is awesome. They're doing cool stuff and people have got to check it out.

Justin: 53:09

Oh man, I feel like I'm supposed to... I should just list off my investments, is that what you're asking me to do?

Cory: 53:14

[Laughs] Well, except for that one company, you know what I'm talking about? I'm just kidding.

Justin: 53:18

Two things. One, I work with a lot of VCs and so I just get a lot of exposure to a lot of different spaces, but my angel investing is primarily centered around security things. The last angel I did in security exited after nine months and was a great choice to do the angel because it was great ROI on the investment, but that is not the common case. Here's what I'd say though, I think that team, Gem, is stellar. Their acquisition by Wiz is a great decision by Wiz and is going help Wiz build an incredible detection and response portfolio to go alongside all the prevention work they do for cloud stuff.

In terms of spaces that I think are really interesting right now, I'm paying a lot of attention to... in fact, maybe I'll use this as an opportunity. If someone wants to pitch me on their startup, this is what I am really interested in right now. There are a ton of places where there is a ton of grunt work in security and SRE that I don't think needs to be this much grunt work. I've seen security teams try to adopt lots of automation strategies in incident response, for example. I do not see the same level of effort in SRE, particularly like, there's tools we have at our disposal, like Datadog and Splunk (if you want to punish yourself).

Cory: 54:48

[Laughs] This episode brought to you by Splunk.

I'm sorry. I mean, I'll take your money. But they're not, maybe they'll come in on the next episode now.

Justin: 55:04

We have these tools, but what if I wanted to - instead of like just triggering an alert and then a page, and then having someone have to jump on and investigate - hand someone a package of data that was like, "Hey, based on the last 50 incidents that you've had, you wanted this information available to you quickly. Here is it. Like here's all the logical pivots that an engineer would do as part of the investigation."

Or maybe I'm just dumb and can't find the damn things that would do that, but I have not seen that kind of like orchestration automation approach within reliability incidents or SEV type of incidents. Security teams have already adopted that kind of stuff and I'm particularly excited about seeing SRE do the same.

And right now it's all getting wrapped up into the kind of like AI world, because everything is. I think if you start a startup right now and you don't have an AI story, no one's going to fund you right now. Which is a little bit tongue in cheek for me because sometimes people say AI and what they really mean is a lot of If... Then... Else statements.

I think that automation and AI are becoming synonymous at the moment. I don't see the SRE side or reliability incident management side getting that same focus of automating chunks of the job that really don't need a human. Like, I would love to, every time there's a real SAV, have some tool that pulls together all the data into a useful analysis package so that when the on-call engineer responds, they're not like, "Oh, I've got to remember how to run that search and this search, and then I got to go pivot to this other thing and stuff." Instead, they're just handed like, here's all this information that you probably need to be able to diagnose this thing.

If I really want to make a difference in the way that SEVs are handled right now, I think that is the next era of change. If you have people out there who are listening to this and work at a company and you think you do that thing, like find me on LinkedIn because I want to hear about it.

Cory: 57:14

There you go, that's the segue. Where can people find you online?

Justin: 57:19

LinkedIn. I think I still have an X profile, but I don't know if I've logged into it in too long. I'm not anti-X, but I just find myself unable to keep up with Twitter. I'm much more just like...LinkedIn is easy for me. Yeah, that would be my thing.

Cory: 57:41

Awesome. Yeah, I think that would actually be a pretty interesting story too.

I feel like SRE, like you are paged in middle of the night. It's already not great, something's gone wrong, but it seems like stuff always breaks like 2:04 am for whatever reason. And then it's just like, okay, what's changed? That's the old first question. Like, what's changed is complicated! That's a complicated question to answer between like my deployments are GitHub, built an image, pushed it to ECR, Argo deployed it, somebody made some changes in Terraform over here, like somebody made some changes to a shared infrastructure... Like there's a lot of context you have to gather pretty quickly under the pressure of something being broken, generally at the most inconvenient time.

Justin: 58:28

And when you might not be thinking as clearly, because you just woke out of a sound sleep at two in the morning to get up to your phone like going off.

Cory: 58:37

Yeah, you were either soundly asleep or you were just rocking catatonicly in the corner from the night before. [Cory and Justin laugh]

Oh man, well Justin, I really appreciate the time today. Thanks for coming on the show. It was a great conversation.

Listeners, if you haven't had a chance to check us out on Spotify or your favorite podcast network, please like, subscribe, and recommend to your friends. Also check out the most recent episodes. We had a great series on the Foundations of the Cloud, talking to the people that built the tools that most of us use today, like Terraform, Kubernetes, and Chef. And then the most recent episode was with Adrian Cockroft talking about how Netflix has grown their cloud presence and influenced the roadmap of AWS.

Thanks so much again for the time, Justin. It was great to meet you, and have a great day.‍

Episode 17

16th Oct 2024

Security and Scalability with Justin Berman from Thirty Madison

Transcript

Listen for free

About the Podcast