Breaking Down Healthcare Delivery Barriers with Joel Vasallo

Episode Description

Feeling overwhelmed by the number of apps you need to manage while building developer trust, managing costs, and trying to create an extensible platform that teams actually want to use?

Joel Vasallo shares practical insights from scaling TAG's platform engineering initiatives across multiple healthcare organizations. Learn how his team transformed deployment times from weeks to minutes while maintaining security and compliance. Joel breaks down the journey from initial Kubernetes adoption to managing 70+ applications. 

Listeners will gain actionable strategies for:

- Starting small with platform initiatives and building organic buy-in

- Balancing standardization with team autonomy

- Managing cloud costs across multiple organizations

- Building trust through visibility and auditability

Whether you're in healthcare or any regulated industry, this conversation provides a practical roadmap for evolving your platform engineering practice.

Episode Transcript

Hey everybody, welcome back to the platform engineering podcast. I'm your host, Cory O'Daniel, and today I have with me Joel Vasallo. So funny thing is, he's the Senior Director of Platform Engineering at TAG, but we've had to reschedule this call so many times that not only has his job changed since we originally met, but we've actually met in real life. We were at KubeCon and we ran into each other recently. 

But before we do that, let's just let Joel introduce himself and then we'll talk about our serendipitous moment.

Background and Meeting at KubeCon

Thanks for having me, Cory. So yeah, my name is Joel Vasallo, I'm Senior Director of Platform Engineering at TAG - it stands for the Aspen Group. My day-to-day responsibilities are overseeing the SRE team, having a cloud engineering team, as well as a delivery engineering team. So basically, all things that go into modern cloud-native landscape go through the team here. 

A lot of my background really was just building positive DevOps cultures and DevOps practices. I had tenure at Project 44. Redbox before that, building their streaming platform. Then before that even working at inflight Wi-Fi companies doing tin can Wi-Fi stuff. So that's kind of cool. 

Same story, different companies - it's all about empowering your engineers to do great things and if you truly building a good experience from the start. Tools are kind of secondary in that nature, we didn't have Argo ten years ago, but we still figured out a way to do DevOps, right? It’s really about a lot of investment in a lot of stuff.

What I've seen, between people that I've met on the show and just throughout my career, it's very easy to grab tools - there's so many of them. The hardest part of DevOps through and through across all the organizations I've met is actually building that culture. 

That was one of the things that was really exciting when I started looking into all the different healthcare companies that I wanted to talk to throughout this healthcare series that I've been working on. That's how I found you originally, but we actually met. 

A very weird moment. So at Open Tofu Day at KubeCon, I was stressed out. We were one of the sponsors. I believe I was hiding behind a garbage can or underneath the table. I can't remember where, but I was like…

You were doing something, you were trying to get a display or something to work, I think.

I was trying to get away from the booth for a second, because we didn't have time to set up and we got swarmed. And so I was just like, I need 20 minutes to myself. I duck and then I… like we hadn't met in person. The only thing I'd ever seen was your face with that giant sword, which I also want to hear about… all of a sudden I stand up and I see your face and I'm like, “I know him from somewhere.” I look at your tag, I see your name but then I see TAG instead of The Aspen Group. And I'm like, “Maybe I don't know him.” And I was like, “Joel?” And lo and behold you're at Open Tofu Day.

What are you doing here?

Yeah, what are doing here? And then you were also one of the first people on Trash Ops, which was also very fun. 

Yeah, I wasn't ready for that by the way. It was awesome. The fact that you somehow got Kelsey to be on that too was crazy. So that's awesome.

Oh my gosh, that was pretty funny. It's so funny, I feel like when I was talking to people throughout KubeCon doing Trash Ops, I'd meet somebody and I'm like, “This person's going to have like some really funny stuff.” And then they were like Angel Ops, like they just nailed it - like, “No, I don't do that.” And then Kelsey's just like, “Oh yeah, I did. I did everything wrong just to ship stuff. Like I'm good with it.” It was so funny talking to him about that. 

KubeCon Experience and Community

Well, let's talk a little bit about KubeCon before we hop into all things platform engineering. You're the first person I've gotten to chat with like post KubeCon. How was your KubeCon experience?

It was better than anticipated. Obviously last year was here in Chicago, so I was like it's going to be hard to top that because Chicago is my home, that’s where I live. But honestly, it was an awesome experience. 

This year, I think it was finally like… I know we've got a lot of like AI stuff and stuff… but it felt like the first year where I felt like I can actually help somebody with this problem. It wasn't LLMs and the type in a box and an image pops out, it was more like actually figuring out how to architect and build and support data teams and data and ML stuff. So that was good to see. 

On top of all the other stuff, Istio was awesome. Solo and like that whole company is like huge now. 

Oh yeah.

It's bigger than it was, with the gateways and stuff like that too - Gloo gateway. And then Prometheus too. So I don't know. I felt like this year I finally got to tickle a lot of my parts of the brain where it's like engineering focused, which was awesome to see.

So you were on the official track or the hall track?

I tried to do a little bit of both. I would try and dedicate my mornings (depending on when I got there) to go to a few talks, but then the afternoons were just like how do I kind of network with people, see where I'm going to be that night, talk with people. 

That's the second part of KubeCon that really people don't talk to. It's like, “Hey, this conversation’s good, you want to go grab dinner?” “Hey are you going to that event later?” It's like, “No, no I'm going to this event.”  “Cool. I’ll go to that event too.” 

You can just openly talk about like a day in the life of… you get so many people from companies like Nvidia, Apple, Google, Netflix. All these people are just there from the event and you kind of just have a place to talk shop. It's awesome, that's the reason I love KubeCon personally

I only got to see one session. This is the burden of being a tech person that's working the sales booth at a conference: you have all these hopes and dreams of all the sessions you're going to go to and then people come to your booth and you're like, “I guess I'm not going to get to go.” 

So I'm always doing the hallway track. I'm just running into people in the aisle way talking. 

Talking on the tables.

Yeah, but the wild thing is you meet people from Apple and Nvidia and you kind of look at these huge companies…  you're just like, my gosh, they're selling so many GPUs, building nuclear reactors, and Apple's doing all great things… and then you talk to the engineers and they have the exact same like DevOps troubles that any of us do, right? 

Which is one of those things that's exciting to me. For the amount of debt many organizations face, it's still there and there's ways to work around it. You hear about these companies doing these great initiatives and efforts in platform engineering like scale their ops, but like still having some of the same problems that we have.  That's the stuff that I love - when you get that like nitty-gritty.

Well, it’s real, right? It's not like polished. You get the real perspective of like, “Oh man, every time I hear XYZ company it just seems like perfection,” but then you talk and like, “You know, they're just like me. They're at a bigger scale, but they're kind of struggling with some of the same things I am, or they're excelling in some of the things that I wish I could excel at.” I'm learning a perspective that I didn't know. 

Yeah. You'll meet somebody and just assumed that they're doing all this magical stuff and they're like, “No, we've got this enormous Jenkins cluster that we hate.” It's just like, so you're “One of us. One of us.”

“One of us.” [laughing]

Oh my gosh. Yeah, the only session I got to see, it was great…  I can't remember the organization that was behind it, but it was one of the Karpenter sessions about using AI and Karpenter to optimize your worker pools in Kubernetes. One of my buddies, Josh Seifer, was giving part of that talk so I was like, I’ve got to make it over to that one just to see him. And then I was sitting there listening and I'm like, “Damn, we got to invest in learning a bit about this and helping some of our customers with it.” So it's very exciting.

Yeah. I mean, Karpenter as a whole… I remember it was at first a very AWS centric thing, but now I think it's kind of shifting to just Kubernetes. I mean, I think it was always a Kubernetes native thing, but really like that infrastructure as you need it kind of thing… you know, when people say it's expensive… I mean, in many ways with Kubernetes, you can actually kind of like right size your infrastructure, right? It gives you like the best way of kind of putting things where you want. Obviously seeing anything that can help manage this stuff is huge. I just love seeing that community grow personally.

I was excited to see it come out of AWS and get adopted elsewhere. It was one those tools that when I first saw it, I was like, this is cool. Great for AWS customers. How do we do something equivalent elsewhere? And being able to see it get adopted across all the major hyperscalers was exciting. 

Definitely something to check out if you all aren't familiar with it. 

Healthcare and Starting the Platform Engineering Journey

One of the reasons I wanted to talk to you is my background is healthcare. So I was originally a HIPAA security analyst decades ago, and I know the pains of getting good budget in operations and DevOps and, you know, being tied to legacy infrastructure. 

And so when I first started working on, you know, the series of like how people got to platform engineering, I wanted to kind of go back to my roots and talk to a bunch of people that are in healthcare companies that are seeing success building out platform teams. So that was one of the things that was really attractive about reaching out to you. 

I'd love to learn just a bit about the Aspen Group, how you got your platform initiatives started there. And would love to just kind of talk about some of the pains that you felt along the way and maybe some of the tips that you have for navigating tight budgets and legacy systems.

Yeah, gosh, now I'm going to have to rewind the clock back to three years ago. I've been here three years, almost three and a half years now. 

That's legacy experience right there, fellas.

I know! It feels like I just joined, but at the same time, it's kind of crazy because you learn so much. 

I mean, the first thing I'll say is within healthcare, the interesting thing is it's a lot faster paced than I thought it would be. I've worked in some regulated industries, but like, it's the pace of innovation that was always kind of like the interesting kind of component. When I first joined, I was like healthcare is going to be kind of annual releases. Certainly you can get there, but I think really it's what you build and how you like build a secure architecture. 

You also have to remember that the reason things are kind of slowed down or things go in a slow way isn't because like, you know, you have to do quarterly, monthly, annual releases. There's no like law that says you have to do that. It's really like, what do you feel comfortable delivering. In the sense that you're delivering an initiative for your customers, whether that be your internal customers, external customers, developers, corporate customers. So that was kind of the first thing that I kind of had to overcome. 

Once I overcame that, then I said, “Oh cool. It's just another kind of DevOps gig.” You know, kind of figuring out the world of platform, figuring out where tools can fit, and figure out how to get people to buy into this stuff. 

My mantra is always start with the smallest component where you can kind of show an initiative and you don't also cause a major company outage.

My mantra is always start with the smallest component where you can kind of show an initiative and you don't also cause a major company outage. Because the worst thing is you start this initiative, you move a major component… let's say your x-ray processing system and all of a sudden you're in support mode day one. It isn't to say you're not working, but at the same time you also have to like start and prove. You have to recognize that everyone isn’t at that vision. So you have to almost kind of sell that vision out a little bit. 

You’ve got to make your stamp on the company in the right way.

Yeah, you don't want to you definitely don't want to start with an outage.

So this is the new guy, right?

Why did we hire him? He's caused this million dollar outage. 

I still remember like the aha moment I'll say is like, you know, we were building these pipelines out and I saw the best way to make a footprint (and you know people listening to this podcast are going to be like, “Oh what, he did this with Kubernetes), I said, “Hey, let's just move our web servers and our API gateways into Kubernetes. Everything else is the same. We're not moving off VMs. We're not getting off Windows. We're not doing any of that stuff. Let's just move like some of the static sites over and some of our API gateways.”

Little did I know that this unlocked basically a pattern to start deploying things a lot faster. The team before was already putting a lot of investment moving legacy stuff off of like Windows to like .NET and then ultimately onto Linux. But then having Kubernetes as kind of that next step was awesome because it went from it takes two to three weeks to get something out… And I remember there was a meeting where we were talking about, “Hey, we just acquired a brand (it was AZPetVet, it's one of our vet brands), How long is it going to take to get the logo on the corporate home page?” Within the end of that meeting, an engineer goes and says, “It's done.” And it's like, “What do you mean it's done?” “It's live.”  I remember people were like, “That took two to three weeks before. How did you do that?” “I just used the pipeline, the platform.” 

This unlocked basically a pattern to start deploying things a lot faster.

Having the engineer say that and not me was kind of the light bulb moment where people were like, “This is what it means to do DevOps or platform work.” And I think that was awesome to see. I got so excited and it built a path.

I mean, that's like the appetite, right? I feel like there's two levels of buy-in. There's that top-down buy-in, which sometimes comes across as a mandate, and that's hard. When you get a mandate, it's like, okay, somebody's made a promise to the org, I got to figure out how to do it. That buy-in is sometimes just verbal. It's like, I heard a word, I read it on CIO magazine or whatever. We're doing a platform engineering now too, and that leaves the team to figure out how to pull it off, right? 

But this bottom up - where you just do that MVP no matter how simple it is, but an engineer sees that velocity change and now they're hungry and they're talking about it. Like that's the marketing that I feel like a lot of organizations miss. They missed it in DevOps. I feel like a lot of them are missing in platform engineering is like building that excitement around a product that then other people go and champion for you.

It says a lot.

Yeah.

There's lot more people that could go to meetings and present things than just me or my team, right? 

I always judge an organization's health on a journey when the conversation is not, “Hey, when is that change going out? What is this?” It's more about the impact of that change. 

You know, the methods of delivery, that's the boring part. Like who cares? I mean, we certainly care about when changes are going out, but not in the sense to say, “Hey, we don't trust our base infrastructure enough to say we can deploy whenever we want or whenever we feel is the right call.” Maybe that's a better way of wording that. 

I think when the conversation shifts to conversationally about the functionality and the improvements versus the minutiae of getting something out in the world, you know you're starting to mature. My lens is like, “All right, cool. People are now talking about API changes, contracts, things that a developer cares about versus is this going to be on the third week release to get this out.” It's kind of just a built-in thing and that's really the platform you’ve got to build.

Yeah, and I think for many orgs… like there's numbers, we all have numbers and metrics and whatnot… but I think that when you see people just talk about the efficiency, and it leaves the engineering team, right? Like that is a place that's hard for operations DevOps platform engineers to get recognition. 

I can't come along and be like, “Hey business side, we deployed Kubernetes. What do you think about that?” And they're like, “I'm not Greek. I don't even know what that means.” But when the PM is like, “Yo, my team's moving a lot faster than your team because we're doing this.” Like that's organizational understanding, right? That is, I think, that glimpse of DevOps working - when other people can brag about it without necessarily being full of buzzwords and numbers that people don't necessarily understand.

When other people can brag about it without necessarily being full of buzzwords and numbers.

Yeah, and I think to that point, the example I mentioned earlier about like launching a website… again, you know, people are like, “You do these Kubernetes for the web server. Like you're crazy. That's insane.” Like there's kind of a whole little subculture of people - “It's only meant for like high compute.” 

At the end of the day, you know, if you right size it enough, and especially at an enterprise scale, it's all about the efficiencies you gain that are not like dollars on compute. It's the dollars on the time wasted. If it wasn't in this, it would be some multi spanning auto-scaling group of web servers.I mean I have to patch Windows servers versus containers, right? There's a whole lot of other things that people have to consider. It's not just like using that. 

I think that example, it was great, but it was very much like… you know, I used the word DevOps earlier and I think DevOps still very much exists… but it wasn't a platform, right? Like having a pipeline in Spinnaker to click a button and deploy, great. But I think what was lacking on that is the platform. And platform for me is the experience that an engineer gets. The first example, you click a button, you deploy. One thing in, one thing out. 

I think the value of a platform is, “All right, now what guardrails do I have? Do I allow my engineers to declare the size of their cluster? Do I let my engineers declare the environment variables that they need?” That wasn't in V1. A lot of that stuff was kind of the hard-coded components of the pipeline. And some of those pipelines still exist. 

I'm not saying, like, here I am, perfect health care pipelines. But people have to realize there's a path from proof of concept to delivery to ultimately getting things out there into the world. Don't get lost on the perfection. One of my old managers says, it's not his quote - don't let perfect be the enemy of good.

People have to realize there's a path from proof of concept to delivery to ultimately getting things out there into the world.

Yeah. I mean, every single one of those things that you do, if you're increasing that developer velocity, that's great. But at the same time, if you're tied to some horrible release cadence, as the ops side of the house, and you have a lot of work to get that out, these little tiny things that create more space for you, that's a feedback loop that's great. Now you have more time that you didn't have last week, right? So now you can improve on that little improvement that you've made. 

That's where I think the seeds of a good platform come from. You start with something simple, but now you've got that organic buy-in, you’ve got that grassroots. You’ve got a little more time for you. You’ve got a little more time for Dev.

It's about expanding that culture too. I think a lot of people see it (and I know we briefly talked about this like at KubeCon), but a lot of people say like it's almost like they go to the CNCF landscape and they like go shopping. It's like, check, check. Check, check, check, check, check. 

And it's like, great, you have a whole list of tools, but what's your end user's experience? Are you just giving YAMLto engineers and saying, “Hey, go figure out how to do auto-scaling inside of Kubernetes?” Or “Hey, go figure out how to do OPA policies and define them in some sort of arbitrary repo?” 

The goal, at least for me, has always been architect that away.. and it's actually something that Kelsey mentioned in one of his early talks… architect things away, not because the engineer doesn't understand it, but because they shouldn't need to. 

I'm the biggest fan and everything that we do is an open core model. If an engineer wants to give me free work and say, “I think if you make this a variable, I can expand upon that.” Great, so make me a pull request, I'll still review it. We have open source internally in the sense that it's not open to the public, but open within our nature. 

If people want to contribute a fix or review how things are done by all means, but that's not the expectation. I don't want someone's boss to say, hey, this person is spending three days of their week figuring out YAML files. No, that is not the goal of anything that we're trying to do.

Yeah, I feel like this is one of the things is really hard too, because we're all hobbyists. It's like, okay, let me put down this work keyboard and pick up my other loud keyboard to start working on my fun side thing, right? We always want to learn stuff. So you can take an engineer away from the product that they're working on and have them learn as much Ops stuff… I'm sure they want to, some of them do, some of them definitely don't… but like what is the best service for the business, right? 

The reality is… I feel like I've seen this a lot of like larger orgs that they'll be a little bit afraid of like taking away from engineers. Like, well, if we take that option away, what if they need more IOPS? And it's like, well, a lot of people use a pass and they don't even have that functionality exposed to them and they're fine, right? So it's all about finding that right level of abstraction, like you said, for your team.

If somebody comes along and they're like, “Hey, something's not exposed that I need.” that right there, that to me is like the perfect essence of DevOps collaboration. That conversation, that PR that comes in, that person's learning about that system, not having that system and the maintenance of that system forced on them.

That to me is like the perfect essence of DevOps collaboration.

They're opting to come out and like try to help with it, right? And I think that's even like a good way to find the engineers that are going to be a part of your platform team, because a lot of times you're starting this initiative from the ops side - I need developers to work on this thing to start building more automation around it. 

Taking the Next Step

So, going from that grassroots to that executive level sponsorship, that can be quite a leap. How did you start navigating that path? You've got people that are excited, it’s starting to scale out across the engineering org. How do you take that next step to, we're actually going to have a platform team and we're going to go from not just working at the Aspen Group to what you're doing now, which is building a platform across your entire portfolio of products.

Yeah, mean, I think the biggest thing from there is when you can sell the repeatability of what you deliver, because by doing so you make what you've built so common and well understood. The first few weeks of kind of learning a new thing kind of are, you know, problematic. But after they kind of learn it and they kind of get on it, that engineer, that person could in theory go look at something else on another brand. Or another resource can go work on another product. You're basically kind of leveling up everyone along with it too. 

I mean, for me it's simple, it's kind of coming up with a contract. I mean, even before my time here at TAG, it would all be about building a contract. Like hey you have your local environment, I'm certainly not going to police what IDE you can use, I'm not going to police what tools you can install, but at the end of the day I want the contract to be simple. If you could build a container and it could run, I could deploy it. 

So then from there it goes, “Yeah, cool. I built a container.” But then what is that path to delivery? It's kind of like that artifact-minded promotion process. And it's not just a code. We're talking about feature changes, things within feature flags, and even infrastructure as code. What's that method of artifact bundle to kind of get that progression and then visualize it. Then you can layer in all your processes because that's really what gets that buy in.

When you start saying, “Hey security team, you know how you said we should be doing this every time we deploy, but you know, it's kind of a manual script. Well, guess what? We can enforce it and maybe we don't block things, but I guess at the very least I can guarantee you anytime someone deploys, you're going to get this.” 

For me, one of the biggest wins is… you know, it's a tale as old as time… DevOps is seen as it's kind of the antithesis of change control and stuff like, “They're going fast. There's not going to be no way of tracking what they do.” No, if you do it right, you should audit everything. 

If you do it right, you should audit everything. 

Anything that occurs within a pipeline - I can go to a change engineer (who is really more on an operational side) saying, I can tell you down to the millisecond that that change occurred in production. And that way when you get that report or you're looking at your monitors and your dashboards, when you see that that time correlates, you should be able to say, “Oh, it was that change that occurred at this time.” And because I have such a great audit trail, I can actually track what was that change and what went into it down to the Git commit. So that's awesome. 

Especially when you see some outages and you're on there like, “What was changed?” I don't know, let's take this - this is where the error's coming from. Take that git commit. Go look in the pipeline. Go look in the git. And you're going say, “Oh yeah, of course, this API path changed.” That's extremely powerful. 

So I think like to that, selling it up to leadership of being standardized, having these kind of common ways, and then bringing everyone else along. If you do it right the pipeline is for everybody not just for the DevOps team, not just for the platform team, not just for the developer. But you build in methods for these SMEs in these key areas like quality, security, app, architecture, the networking, to have their say on how to properly guardrail the pipeline. Again, not a gate, I'm not saying yes or no, I'm saying a guardrail where things can kind of be expanded upon. 

If you do that the right way, you build like this culture of people communicating and talking versus kind of like, “No, you can't go to prod if you click that button.” 

Yeah.

That's kind of the big thing for me.

What I loved about that, going back to that Kubernetes to launch a logo example is, they're in a meeting, this person's added the logo, and it's a logo, but in an org where it might have taken three weeks before or a quarter just because of the pipeline, that's a big deal. People can see that. 

But what's wilder about that story that you just told me is the number of organizations I know that… even if they had that pipeline, even if they had it going to Kubernetes… that they would have had to have stopped, gotten somebody else to approve it and possibly somebody else. Like they get caught in that gate hell of just like gate after gate after gate, right? 

And that auditability, that trust that you start to build up on your team can start to replace those gates. How do you get to a point in an org where you, like… besides just the audit trails… how do you get to the point where you have that trust in these teams? That these are their production systems and they own them and they should be able to ship to them when it warrants being shipped to without kind of falling through this series of approvals.

Not to say like we deploy any time we want. It's almost like the mindset of like, if I know there's a problem and I know the fix, do I really have to get approval to fix it? 

If the building's on fire and I know that by me using this fire extinguisher… does that really mean like I have to, “Hey, can I use this fire extinguisher to put out the kitchen fire?” No, put out the kitchen fire and then ask why that kitchen fire happened there. 

It's not to say you circumvent process. Like there's still RCA's. There's still reasons why things break. But at the end of the day, my commentary is usually… very rarely, more rarely… I don't think anyone wakes up in the morning saying, “I can't wait to take down production today.” I don't think that's anyone's book. We all feel ashamed. We all are very protective. Like, we all have a shared interest in keeping a platform running. Otherwise you're on bridges all day, you're on outages. 

I don't think anyone wakes up in the morning saying, “I can't wait to take down production today.”

But how do you kind of have that like right balance of you can deploy whenever you want and you have these checks and balances? I think it is level setting. 

What are some base requirements to kind of get there? Some people are looking at things like IDPs to do scorecards, right? Hey, you can deploy whenever you want, but you have to have some things. 

Like for me, I feel certainly a lot better if the app that's deploying whenever they want has a lot of the baseline things. Do you have on call? Do you have proper logging set up? Are you using the logging framework and the most up-to-date version that we released internally? Do you have security scans like with SonarCube or  Sneak or whatever?

It's a very visible number that you don't have to be technical to understand. It just says, “Hey, look, this app has a good scorecard and it looks like it's healthy. Let's reward that effort.” Because it's not easy to do that, right? 

Yeah.

I mean, how many apps… I know for a fact, even ourselves, we have apps that probably aren't the most up-to-date version of Python. Maybe they're a rev back or two. But what's your base standard and how do you kind of level that stuff up? That to me is kind of how you sell the maturity aspect of it. 

If you just say, “Hey, we're deploying fast.” Great, but is deploying fast really the goal or is it deploying in a safe way fast? And I think that's really where I think a lot of people get lost. They see the numbers and they just want to see, “Hey, look, we deployed 20,000 times in a month.” Great. How many outages do we have? 

19,999. [laughing] 

Yeah, exactly. 

So we're going to move fast and not break things. 

Yeah, that's the goal. 

Sorry, Zuck.

Yeah. 

There was a brief point in time where people did wake up and think I'm breaking production today. And that was that era, the 2015 to 2018, where you had like the chaos engineer. Everybody's like, “We're doing chaos monkeys.” Everybody's hiring chaos engineers. And people are like, “We're actually not as high of a scorecard as… we're not quite ready for that, are we?” 

Would you like a chaos monkey in your infrastructure destroying servers randomly?

Just give a junior engineer root access to AWS. Let's see what happens.

I never want to knock that theory because you really need to have the resiliency in your infrastructure to understand.

Yeah.

I still remember when I was, I think, doing my first kind of Kubernetes gig at Redbox and me and my platform engineer the time, we launched this like Space Invaders game.  It was like something on Reddit or something, and every time you shot a thing it would destroy a pod. 

Oh no.

So we gave it to our CTO and he was like, “This is fun, what am I doing?” I was like, “You're blowing up prod right now.” He's like, “Delete it immediately.” [laughs]. At the time it was a canary and not really prod, but it was cool to see that like hey we can cause an outage in a region and we have such sophistication that we know we could fail over because we took that time to understand the affordances. 

We took the time to understand how our app performs -  look there's going to be an increased latency out of East, our database takes an extra hundred milliseconds to get to it. We understand that but we have like methods of understanding when the application is unhealthy. Or, in the famous words of “degraded state” - we know what that degraded really means. It's not just there's an outage and we don't know. We know what degraded means, or we try to know as much as we can.

That's one of those great places too where it's like that shows… the way that you talk about it, the way you present it to the org… whether you're leaning into really good RCA's and blameless or whatnot… being able to identify that stuff quickly and to not just say, “We broke it, we rolled it back, we fixed it quick.” 

That level of information to somebody outside the engineering org is scary. Like there's a lot of nothing there, right? But being able to come up and say, “Hey, we know exactly what it is down to the millisecond, it was this service. We changed that.” Like, “What are we going to do about it? How are we going to stop that from happening again?” It's like, “Well, shit happens. But the point is we're able to figure it out quickly and roll it back. And like, we have that level of visibility, and that's the power of the platform. That's why we need to invest in this team and what they're building, because we can know this.”

Mistakes will happen, but being able to find those mistakes and remediate them as quickly as possible is the name of the game.

Mistakes will happen, but being able to find those mistakes and remediate them as quickly as possible is the name of the game.

And I think that… to the sake of the platform, we all in the industry have to be careful - and I mean this with all my intent, because I was a former DevOps engineer. We certainly don't want Platform Engineering to become what happened to DevOps. 

My hard take on it is DevOps became the label that people slapped on to keep people at a company and get their comp up. Certainly, let's get people's comp up. That's great. But the reality of the fact is it's not just a title. It's an initiative and a goal that we're trying to build.

That experience has to take into account all these things because then from there it's kind of like the next level of maturity, right? You're not just saying, “Hey, we're doing Jenkins or we're doing CircleCI and now we have automated click of buttons.” No, it's really an end-to-end experience. 

What makes people gravitate towards a Shopify or a GoDaddy? It's easy, it's simple. Now, I'm not saying that those are the most code friendly, I would say. They're very easy to get started with. But how do you kind of take that same concept but for the technical audience? And build it so I don't have to worry about that component. Logging recommendations? I already have it. Environment variables? Those are already kind of provided to me. I don't have to figure out how to know what region my code is in - the pipeline and the platform give that information to my app, and all I just have to figure out is how my app reacts to that information. So like declarative delivery and dynamic configuration, figuring out the functionality.

For me, it's make the platform so appealing that if an engineer and an engineering team truly want to go off the beaten path, they better have a good reason for it. Because they're not going to get support,they're going to be out there on their own. But make it so compelling that you don't want to use any other solution.

Yeah, yeah, I love that. Yeah, like they've got the golden paths. If you want to go hiking, like go hiking, but watch out for rattlesnakes is all we're saying.

Exactly.

Developer Experience and Platform Extensibility

As far as like developer enablement goes… you said earlier, you don't want to force people into an IDE… like developers, we have so many different traits about us, right? Like from the IDEs, even to like the level of like platform that we want…

Like tabbing spaces, come on, I know it's a joke, but jeez, we can't even agree on formatting our own code.

Dude, we can't agree on anything. There's a big disagreement internally at my company about this.

I'll tell you what, there's a lot of people that like tabs and they're absolutely wrong and eventually they'll come around to it that's fine.

Oh, ho ho ho ho ho.

Ooh. Okay, I gotta go. [laughs]

We have a lot of different tastes. We all have our development environment set up exactly the way we want. I'm weird, I develop locally in Kubernetes - that's how bought in I am. 

But when we're delivering platform for a team, or even five different organizations like you're doing, how do you zero in on the right level of abstraction for your engineers? 

Is it just delivering something and then letting them come along and do that collaboration and PR thing and letting it work itself out? Or are you setting a vision for this platform and doing feedback surveys to figure out how it should be exposed, whether it's a CLI or whether it's a Spinnaker UI? How are you getting that interface that makes the people happy with their interaction for the agility they're about to get?

Yeah, I think as platform engineers, we have to recognize the success of our platform is based on the consumers of them. You know, if the team really enjoys using it and they see that there's constant improvement, they're going to keep using it or they're going to want to keep investing time in making it better. 

I think as platform engineers, we have to recognize the success of our platform is based on the consumers of them.

I think for me, it's always when you get into like the sophomore phase of like a pipeline or a platform, it's really like the prescribed way of, “Hey, we're doing Kubernetes, we're installing some Google Cloud fun and we're doing this.” - that's the boring part now.

Now it's, how do we get people working together? How do we get people talking? When you start seeing engineers like, “Hey, you know what? We need to talk about a shared component on this that scales across all of us because we don't want to figure out logging 20 different times.” I certainly don't want to have to figure out serial log drivers in .NET ever again if I don't have to. 

Lost me at .NET, sorry guys.

Well, .NET Proper now is actually pretty good. It used to be .NET Core. It's actually not bad.

You took me back to my banking era briefly and I had a minor heart palpitation.

You should look at .NET 8 and .NET 9. Trust me, I come from the land of Python, Go, and Java from before… Java in itself is in its world right now. I think as a whole, it's actually good. 

Figuring out how we work together, that is really the big thing. So for me, where I draw the line is… certainly we can try and enforce things at a developer level, but really, if you do the platform right, those pockets of engineers or that leadership should be able to then build upon the platform to extend that to their own side.

I don't want to be the one to tell a manager or a VP like “Hey, your team can't do this because they're not following the platform.” The platform should be extensible enough that it should take that requirement. Hey, you know what, we build a lot of things with open feature so maybe that way we can continue using different frameworks like LaunchDarkly or whatever and it gives us control. Hey instead of doing things natively with Splunk logging we'll use something like Fluent Bit to send logs to Splunk, right? 

Those are functionalities in platforms that the engineers can extend upon. So then that way they can have that conversation and take it to their own world and then kind of expand upon that. 

So to me, that's kind of the value of what I see in platform engineering. It's not just the pipeline to deploy. That's table stakes. Like your platform has to be able to deploy things effectively and have enough features and functionality. But when people and teams can then take that and then have those internal conversations and then figure out what a team standard is going to be… like, “Hey, the payment team is going to use Python and Python is our support language. Here's our processes. Here's our rituals.” - that to me is really awesome. Cause that again is, you know, expanding upon what a platform really is.

What I really like about that approach is you're building this base level and you're actually kind of giving people… like just the Fluent Bit example, right? We'll do Fluent Bit and we'll send it to Splunk - that's what we do. But like, you almost have this like level of extension in your platform where it's like, “Hey, we support Fluent Bit.”  

So the thing that's wild about teams, and I'm assuming this just gets even more complicated as you spin up to five different orgs…  like I've seen plenty of companies where it's like, we have one product, but we have eight teams, and the requirements of those teams are very different.

So you have this one team that comes along… maybe it's the payment team… maybe they have a tool that for whatever reason is spitting PII out in the logs. They don't own the tool, right, they've bought it. And so like now, if you've kind of forced them into this is where your logs are going to go… And it's like logs seem like such an afterthought to so many platforms. It's awesome hearing that like that's your level of abstraction, because I know companies right now where they're sending logs to a place they don't want them to go, because that's the way their platform team works, right? 

To be able to say like, “Hey, you know what? I love that I've got this base level of abstraction for our logging, but I can't send them to Splunk for this one service because we don't own it, there's PII just dumped into the logs. Like I need something else or I need to maybe put in a plugin to scrub it.”

S3 or dump it in a bucket for now until I can figure out what I need to do with it later, right?

And that little thing right there is something that like, if you're not designing for it, you're not thinking through it, like that's the thing that pulls that person off of your golden path. 

So now, like if you've forced them into making a logging decision that they don't like, you've kind of just forced them into making a whole bunch of operational decisions they're definitely not going to like when they're trying to support. And probably recreate 95% of what you've done just so they can get this one service logging the way that they need it.

To your point, if you do it right, it's a feature in functionality. For me, it gives an engineer where they can just say, “I don't know, the platform does it.” 

Recognize that some of these apps have different regulatory compliances. When someone, the auditor says, “Hey, how do you do logging in that application?” “Hey, we have a standardized way. It’s the platform.” “Hey, it's a standardized way. It's the platform.” “Hey, it's a standardized way. It's the platform.” 

One, it shows a pretty solid level of maturity. It's not like, “Hey, I'm just writing the standard out and pooping it into a bucket.” No, there’s a process. Then when they ask the platform team, “So hey, how do you do logging?” “Well, that depends. For PII, some apps… we don't log PII so we filter it out with this mask. Here's our pattern, here's our path.” 

We also have things like, “Hey we have to have auditability and long-term logs for XYZ components.”  Maybe those go into a bucket, like I don't know about you all but hot storage of logs is the number one burner from some of our spend here. 

Oh yeah.

So how do we kind of meet that regulatory compliance of you need to be able to go back three, four, ten, fifteen… whatever number you get from the organization that you decide… but also try and keep some sort of cost-effectiveness? And that's where you kind of get that features and functionality.

Oh my gosh, you get on the cost. That's a whole other topic I'd love to talk to you about but we're actually running out of time, aren't we?

Oh wow.

See, I told you, dude. Old hat. I pop out of a garbage can like Oscar the Grouch. We're old hat now. We just blow through this thing. 

Cloud Costs and Infrastructure Decisions

So cost is another one that is interesting because I feel like a lot of orgs get that cost, like, whenever it comes through the cloud. How are you thinking about cost across your five orgs at the team level? Are you kind of surfacing that information to individual teams, or are you more focused at the org level and their general cloud spend?

Yeah, I mean, generally speaking, I alluded to it earlier with environment variables, but that also extends to tags and labels and standardization. I like to think that, in general, most organizations kind of go through this kind of shift, especially when they start going to the cloud. 

“Hey, we need to run it so they see data center costs versus cloud.” Great. But if from there, you're able to then break down that cost by org, by vertical, by team, by (if you can somehow) person - You can then start seeing like what is the most expensive line items and how to kind of pivot. 

Most orgs when I walk in, just see the cloud spend and I'm like, “Holy crud.” That's either very high or “What level of rigidity do we have? Do we have any standardization there?” So for me, we have all these kind of like tags and labels applied to things. It's just a matter of the question. So then you can then say, “Hey, how much was AWS for this brand from this month to this month?” You can start breaking that down. Again, it's more so from us doing the right thing, due diligence in terms of this cost.

It's always improving. We don't have the perfect levels of tags and functionality. We certainly have things that we've kind of like, “Oh, we should have tagged it this way. We should have added this label.” But I think that is something very important. 

As you're building your pipeline and your auditability, consider that too, because that's all going to come back. Especially when you start talking about the maturity. Remember, once this like kind of train keeps moving and people are going fast, it's going to be fast. Like for us it went from you know maybe one to two to three apps and now we're close to like 60-70 apps across all. So like it went from like zero to nothing in like a year and a half almost two years.

And you get that network effect too. As you start to roll out your cost tooling, you do it once and boom, five organizations now or 10 teams, 20 teams now have this information available to them. And the wild thing about cost is in many orgs the Cloud costs bubble back to the DevOps team, whether it's their workload or not. And being able to get… it's just like, your team…

You enabled it, you fix it.

Yeah, it's like, “You guys got a huge budget. You're spending $30 million a year on cloud.” It's like, “Yo, I'm running one Jenkins box. That's all your apps.” Like, “Your just blaming it on me, because I got the credit card that AWS is going on.” 

But when you start to be able to categorize that, you start to see some interesting things. I've seen companies that are like, “Oh, I didn't realize we were spending that much money on a tertiary service that makes us no money.” It's just like, you can start thinking about the accounting of it all. Are we spending too much time and money on a service that maybe isn't that important? 

It's hard to see that when you're just seeing a cloud bill and you're not breaking it down to a team and their different services.

Yeah, and I would even say that… this kind of goes back to the maturity comment of when you hear SRE teams and SLOs and SLIs… you can even take this, there was a slide that I saw the other day from a Google SRE training and it kind of talked about the levels of availability and kind of how you come up with your nines. A lot of us kind of… I know I'm certainly guilty of it in previous roles, where I'm able to say, “Yeah, we're 99.99% available.” And like, was there any science behind that? I don't know, that's what Google said they are, so that's what I am too. You all want to be like five nines, but do really know what that means? 

Then from there, you're able to then have strategic conversations like, “Hey, do I really need tri-region for this app?” “Eh, no, if it's down for an hour, it's probably like a $5,000 loss. Like, yeah, we could figure that out.” Let's maybe get that app down from three regions to one, right? 

Yeah.

But within one region, maybe I want it to be multi AZ. So like, maybe that's my level of availability. But then you have other apps that are like, if it's down for a minute we lost $10 million (I'm just putting some fictitious number out there), so in your best interest is that thing should be the highest cost in your infrastructure. But then also, what are you doing to kind of level set that terms of availability, right? 

So I think there's kind of a knob that isn't just shut down and delete VMs or tag things, right? 

There's a level of availability and I think that's, from the cost perspective, that next maturity level of, “All right, cool, so it's expensive, I hear you, but the reason it's expensive is because you have it replicated in three regions. Can we go down to two? Yeah, we can. So we'll save half the cost by doing that.”

It's funny. I had a company… I won't name them… but I worked at a company previously that our product broke the home page of a very big company. 

Everyone guess.

I was just kind of round-robined of like the people to like be like you're going to fix this now. And I was told by the CEO that it could never go down again. And I was like, each one of these nines costs a lot more money. And he's like, “Never.”

I was like, “Zeros are a lot more expensive than nines. Like, you can tack a nine on there, it's getting expensive, but as soon as you switch to two zeros and a one, you're talking crazy numbers.”

I actually wrote a blog post about one part of this system a few years ago that kind of went viral. It's called “From $erverless to Elixir”. Like, we'd originally built the interface for this thing in AWS Serverless. And just the API gateway build jumped from zero to like $32,000 within like two months because more of our customers started using it. 

More of them wanted this more resilient system than the one that we'd had before that. And that was just the API gateway. The elastic search behind it was like, I think it was about $200,000 a month or something like that. It got very expensive very quickly.

Keep talking about AWS costs, you're going to have Duckbill and Corey Quinn here talking about this stuff. 

I know.

Corey Quinn is great on Twitter because he's such a big champion in the AWS community, but he loves ragging on them. Like, API Gateway, what are you guys doing? This is crazy.

It was funny because like it was like all of a sudden it was like, okay, we've gotten to like… I mean, we can't guarantee 100%. It's hard to do… but like we've got enough stuff in place like those nines are like way out there. 

And all of a sudden I get an email about a month later. It's like maybe we can tone down those nines a bit. I was like, “I was trying to tell you.” It's like we lot we spent… 

We're down for three seconds in a year.

Yeah, it's like we spent more money on these nines than we were being… we had a financial SLA with the customers. So it's like we spent more money on these nines than we had to pay out for an outage. It's like, “Yeah, yeah, yeah, accounting bro - you gotta do it sometime.” 

Closing and Contact Information

Well, I really appreciate the time today. This was super fun. Before we go, I would love to know like where can people find you online and is there anything that TAG needs help on? Are there any open source projects you guys are working on that you want eyes on? Are you guys hiring?

Yeah, no. If anyone wants to find me, I'm justjoelv on X. And I'm also Joel Vassallo on LinkedIn. So find me there. You can't miss me… I think the sword that he talked about earlier… I'm the kind of crazy guy with the sword in my profile picture. So that's me. 

But no, yeah, I mean, the Aspen Group is always looking for great opportunities. We certainly have a lot of development kind of occurring from .NET to Python to even some apps being in React and JavaScript frameworks right now. There's so many of them. But yeah, we're always looking and expanding. So I would always encourage folks to check out the company. 

And then really the biggest thing is just keep focusing on the community, folks out there. That's the biggest thing. The whole reason the CNCF stuff kind of is where it's at - there's a whole organization behind it. But focus on building a community. Focus on getting out there. Focus on talking. Because that really is what drives what we do in our day-to-day, that's for sure.

Awesome. Well, everybody thanks for tuning in. It's been great having Joel with us today. I was so fortunate to meet you in real life. I can't wait to… as soon as I'm in Chicago, I'm tracking you down… in a fun way, in a fun way, that was not any sort of threat. We're to get a deep dish.

We'll get some deep dish and then I'll pair it up with some Malört. So we'll be fine.

Oh my gosh. I just heard about this over the weekend. One of my friends is from Chicago and I was told that it is the worst alcoholic beverage ever, but it's apparently like becoming a thing like a meme worthy drink that people are doing.

Yeah, when I went to Utah, I went with a bottle of Malört for one of my friends who lives in Utah. And I said the two things of this are: (1) I had to check a bag to bring this here; and (2) I had to spend 20 bucks to give it to you because you'll never drink it again. And he took one sip of it and was like, “This is the worst thing!” and he poured it out. I was like, “Yep, that's kind of what I figured.”

There's this little thing that happens at ElixirConf where there's a small group of people that all get together and they bring alcohol from where they're from and they have like a little party.

That’s cool.

I feel like there needs to be one for KubeCon and you can just wreck the party by bringing some Malört.

Me and a co-organizer for my local GDG community here in Chicago, we've done that before. And yeah, we haven't been invited back. So maybe that's a sign.

Thanks Cory for having me. I really appreciate being on the podcast and, yeah, thank you.

You're definitely getting invited back here. I loved talking with you today. And this is the last episode in the healthcare series. So a super fun one. It's been really awesome talking to so many healthcare companies, kind of going back to my roots, seeing how much the world's changed in 20 years. 

The next series I'm going to be working on, if anybody's listening… kind of goes hand in hand with Trash Ops… is nightmare production stories. So if anybody's listening and you've got a great story… we can bleep out names, we can bleep out companies. 

We have the witness protection black screen and voice modulator.

There's going to be so much editing. Drew's gonna love all of this. He's going to have a blast with it. 

Yeah, if you've got nightmare production stories, go ahead and reach out to me, Cory@massdriver.cloud. 

Thanks so much for tuning into this episode. And thanks, Joel!

Important Links:

Featured Guest

Joel Vasallo

Senior Director of Platform Engineering at TAG