vCluster with Lukas Gentele: Rethinking Kubernetes Multi-Tenancy

Are your platform teams constantly saying "no" to requests for new Kubernetes clusters? The traditional approach to Kubernetes multi-tenancy forces organizations to choose between cluster sprawl or restrictive namespaces - neither of which fully meets the needs of modern development teams.

Lukas Gentele, CEO and co-founder of Loft Labs, shares how vCluster is transforming the way organizations handle multi-tenancy in Kubernetes. By running virtual Kubernetes control planes inside namespaces, vCluster enables teams to experiment with different versions, operators, and configurations while maintaining efficient resource usage.

Key topics covered:

How vCluster solves the limitations of namespace-based multi-tenancy
Running multiple Kubernetes versions in the same cluster for testing and gradual upgrades
Managing bare metal GPU resources efficiently for AI/ML workloads
Balancing standardization with developer autonomy in platform engineering
Using virtual clusters for cost-effective testing across multiple Kubernetes versions

Whether you're a platform engineer looking to say "yes" more often or a development team seeking greater autonomy within Kubernetes, this discussion offers practical insights into modern multi-tenancy approaches.

Guest: Lukas Gentele, CEO & Co-Founder at LoftLabs

Lukas Gentele is the CEO and Co-founder of Loft Labs, which delivers Kubernetes-native tools, functionality and frameworks purpose-built for platform engineers to manage, activate and optimize their platform stack. Gentele is a dynamic leader with wide-ranging expertise in enterprise architecture, distributed systems, and developer productivity solutions. Prior to Loft, Gentele served as the co-founder and CEO at covexo GmbH and Webmans. Gentele often speaks at conferences such as KubeCon, writes articles for leading industry journals, and likes to share his experiences at meetups. Gentele holds a Bachelor of Science in Computer Science and Information Systems, and a Master of Science in Computer Science & Management of Enterprise Information Systems, both from the University of Mannheim.

Lukas Gentele, LinkedIn

Loft Labs

vCluster Slack channel

Links to interesting things from this episode:

Transcript

Intro: 00:00:04

You're listening to the Platform Engineering Podcast, your expert guide to the fascinating world of platform engineering.

Each episode brings you in depth interviews with industry experts and professionals who break down the intricacies of platform architecture, cloud operations and DevOps practices.

From tool reviews to valuable lessons from real world projects, to insights about the best approaches and strategies, you can count on this show to provide you with expert knowledge that will truly elevate your own journey in the world of platform engineering.

Cory: 00:43

Hey everybody, welcome to the . I'm your host, Cory O'Daniel, and today I'm joined by Lukas Gentele, CEO and Co-Founder of Loft Labs. Lukas has been deep in the world of Kubernetes and development platforms for years, and Loft has been a go-to platform for teams that are looking to tame the chaos of multi-tenant Kubernetes. Whether it's spinning up thousands of clusters or supporting next-gen AI cloud providers, Lukas and his team are tackling some of the hardest problems in modern infrastructure. He's spoken at KubeCon, contributed to the open-source ecosystem, and has strong opinions on what doing platform engineering right actually means.

So Lukas, welcome to the pod. Thanks for coming on.

Lukas: 01:20

Good to be here, Cory. That was an introduction. Wow.

Cory: 01:24

Hey, you know what? I'm a performer. What can I say?

Lukas: 01:28

I appreciate it.

Cory: 01:29

You've got a super interesting background and I'm excited when anybody wants to come on and talk Kubernetes. It is one of my favorite pieces of technology and I am definitely a fan boy.

‍But before we hop into that, I would love to hear a bit about your background. What got you into operations, platform engineering, and what led you to co-founding Loft?

Lukas: 01:48

Yeah. Let me start at the very beginning. I started coding when I was like 13 or something like that, you know, fascinated by building my own website. That's how it started. And then hitting the limitations of what the website builder was permitting me to do. And then just going deeper from there, all the way to, you know, in college, I ran my own mail server and I was tinkering with hardware in a co-located data center. Spun up Kubernetes the hard way, I guess, as Kelsey Hightower would call it.

Yeah, I went pretty deep on that end. All the way from web technology through, you know, running dedicated Kubernetes clusters on bare metal. Lots of fun. I started the entrepreneur journey pretty early, when I was about 16 or so, I started my first single-person little company in high school. So I was pretty hooked on that.

Towards the end of college, you know, you're kind of thinking about, “Am I going to just take a job at a bank or start working for Google or Amazon? Or what am I supposed to do?” And it was pretty obvious for me that I wanted to start a company. And Kubernetes was really exciting. I worked on a lot of really exciting Kubernetes projects with my fellow students. And one of the most talented engineers I've met in college was my co-founder and CTO Fabian. So we started working on this project called DevSpace, which is a developer tool for Kubernetes to extract away some of that dev workflow, some of the tedious pieces in there. And we open-sourced it. It was the first time we, you know, open-sourced a project and kind of contributed such a major piece, rather than just the lines of code. And we were immediately hooked by, you know, people sharing it and all the feedback we got and people just starting to run with it. That's really the journey in a nutshell.

C‍ory: 03:37

I hear this story a lot – I got in early, I was doing some development, building websites, wanted to start my own company, started doing some website stuff. I always think like, “Oh man, did I squander my youth?” I started developing around the same time, I'm in high school, but I just screwed around on AOL, just trying to make IAM bots to make chat harder.

Lukas: 04:01

Well, that sounds like a fun task too.

C‍ory: 04:03

I mean, it was fun, but there was no money to be had. It was just being a doofus on the internet.

This is very cool, very cool. So let's break down Loft Labs. Could you walk me through what Loft Labs does, and particularly, what is vCluster?

Lukas: 04:15

Yeah, vCluster is a way to virtualize Kubernetes. It's a central piece of our product stack. We're really working to virtualize the entire cloud native stack. vCluster is really the central piece when you map out the stack, you know, like when you look at the CNCF landscape or any AWS architecture diagram, right? You have at the bottom, infra and provisioning. You have in the middle orchestration, which is things like Kubernetes, but also things like service mesh to orchestrate your network. And then you have things on the top that are, you know, application definition, application development, your CIA, your registry, your developer tooling, those kinds of things… your packagers like Helm, right?

We started in the middle because we saw a lot of inefficiency in Kubernetes. When you think of the early days of Kubernetes, it was always how big can the cluster get, right? Like the scheduler was the important piece. Go back to a Kubernetes blog and you’ll see blog posts about Kubernetes scheduler can now handle 500 nodes, now it can handle a thousand. It was like, how big can the cluster get?

Kubernetes really originated from this idea of we're getting rid of these pet VMs and instead we're going to have containers dynamically allocated on this network of machines. Kubernetes is the orchestrator.

Most enterprises today are not set up that way. Instead, they have 500 little three-node, five-node Kubernetes cluster. And we looked at this and we're kind of like, “Why?” You’re really not getting the benefits of Kubernetes because instead of having these pet VMs, you now have these pet clusters and they consist of three VMs. So you three-x’d the problem, right? Rather than actually building the solution that Kubernetes was supposed to be.

So we set out to solve that problem and dug into why people are not sharing clusters. Why is multi-tenancy not a thing? And it turns out it's really hard to make multi-tenancy really work in Kubernetes. And vCluster, you know, is a virtualization layer that really helps you achieve this. Sometimes when I tell people, you need virtualization for Kubernetes, folks are like, “But doesn't Kubernetes already have users and permissions, RBAC, namespaces, right? Isn't it possible to share a cluster? I don't think anybody who asked that question has ever tried, to be honest.

When you look at a Linux host, it also has users and permissions and folders, right? But sharing Linux host without virtualization is equally hard. So why would sharing a Kubernetes cluster be any easier only because you have r-vik, right? It's actually really, really tough if you're trying to make it work.

Cory: 06:50

Yeah, I've seen both these worlds. It's funny, you're talking about getting the clusters bigger and bigger, and I remember when, I think, GKE supported 5,000 nodes. And I was working with a customer… they were at a limit, and they were like, “We need to put more nodes in this cluster.”

‍They were running this beast of a cluster. It was very impressive. But we do see this sprawl of like, “Okay, I've got a cluster for each team.” And it's like, okay, you've sequestered everybody off, but I don't know, especially when you're in something like EKS, it's like you're paying for a lot of control plane.

‍On the other side of this, you know, we have this concept of a namespace in Kubernetes and like, “Why is the namespace not enough?”

Lukas: 07:31

Yeah, honestly, you're bringing up GKE, right? And I blame them a little bit for the proliferation of clusters, because to be honest…

C‍ory: 07:37

You heard it, people. I'm just kidding.

Lukas: 07:39

I mean, they were the only ones that had no cluster fee. So the fee per control plan you're talking about, you know, that wasn't around in the early days, right?

And that was a great thing because people were exploring Kubernetes. They could spin up a lot of clusters, which is obviously great. But it also makes you think less about, “What am I doing here?” versus if there's a fee attached, like today. Obviously GKE reversed that decision. In EKS, AKS, everybody has the same cluster fee pretty much today because it does cost, you know, a certain amount of compute and obviously maintenance etcetera to run a control plane. So yeah, people are much more conscious of spinning them up.

It's really not just about the control plane. It's also about the nodes, right? Because when you spin up a cluster, you know, that cluster is pretty bare bones unless you put things like Istio, OPA, Prometheus, right? Like you have to put all of these tools in each one of these clusters. So there's a certain amount of compute capacity just reserved for that. Whether that Prometheus or that Istio is doing something or it just runs it’s going to consume a certain amount of CPU and memory, right? And now multiply that by 300 because you have 300 clusters. That's a lot of ways to compute, right?

C‍ory: 08:52

It is.

Lukas: 08:52

That's how I look at these separate clusters. You asked about namespaces. Why are they not good enough? If we were to just like throw everything together, build this giant 5,000-node cluster you're talking about, right? I think in some production scenarios, that is easier to do. If you have like a large SaaS running in one Kubernetes cluster, right? Like that's probably possible.

But the scenarios where you have multiple different applications, you buy vendor applications or your teams use Dev and CI environments on Kubernetes, right? Like all of these kinds of scenarios, you really want to segregate things. But if you do that with namespaces, you take a lot of the power away from these users because suddenly everybody has to have the same Kubernetes version. Nobody can do anything at the cluster level. Like CRDs are completely off-limits, right?

C‍ory: 09:42

Yeah, that's a tough organizational orchestration you kind of get into. I mean, even just how do I roll out a Helm chart that depends on a custom resource. It's like you have no control over the CRD… getting that version… now you have entire teams in lockstep on versions moving through stuff. Which some people may like, but it's one of those things that also hinders progress, right?

‍As soon as you're like trying to test this new version, are you going to test it on everybody at the same time? That is a poignant point, I love that.

Lukas: 10:14

It makes cluster upgrades, CRD upgrades, controller upgrades, like all of these things become pretty risky now and they have to be heavily coordinated. But we live in a world where, you know, you want to crank out multiple releases a week. We don't live in a world anymore where you do like a big release once a year or once a quarter or something like that, right?

People are continuously shipping and innovating, so you have to remove these kind of lockstep upgrade deadlocks that folks have. You know, you have this spear tip team that is like really cutting edge and wants to bring your organization forward, and they want to use the new version of Istio and they want to try this thing with Argo, right? But they don't have the permission to change that inside a single namespace. And then you have other teams who, you know, are kind of like, “Oh, we're not ready for this Kubernetes upgrade.” and they may be holding things back. It's really hard to balance that in a multi-tenant cluster when you just have namespaces at your disposal.

With a vCluster, you launch a control plane that runs inside of that namespace. So you get a control plane that runs inside a pod. And then all of these objects… you know a CRD is just kind of a made-up object, it's just a Kubernetes resource, right? It's not tied to… you know, like a container is tied to a process on your Linux kernel as actually something running there. But a namespace or CRD, those are entries in your etcd, in your data store, right? That's all they are.

So why not store them in a virtual environment and then only have the pods, the thing that actually runs something, shared on the multi-tenant cluster in your namespace? That's the idea behind the vCluster. Let's keep all of the higher level objects in the virtual space and then the real objects… like a service that does obviously pod-to-pod communication and the pod that actually starts the containers… those should run on your namespace. Everything higher level should run in this virtual environment. That's the idea behind the vCluster.

C‍ory: 12:09

So you're taking just a vanilla Kubernetes cluster. Could you even take some of managed ones, like EKS/GKE? You're putting the cluster into a pod inside of somebody's namespace. Now they have these virtual CRDs. They interact with these higher level, like these cluster-level resources.

‍So let's say you have like an operator, where does the actual controller for that run? Does it then run individually in each one of the namespaces alongside… is it running in the virtual cluster, or is it running in the actual namespace alongside my workloads?

Lukas: 12:40

Yeah, that's an excellent question. The answer is both are possible.

Cory: 12:44

I love it. It depends.

Lukas: 12:46

It depends, the classic answer, right? I would say the default is, so you're launching that controller, that controller is just going to be a deployment or state for set or something like that, that you're deploying to your cluster. And that one you deploy to the virtual cluster, the pods that actually host the binary or whatever makes up that controller, that pod is run by the underlying cluster, but that pod has a service account and talks to the API of the virtual cluster. So when your controller has a watcher on a CRD, it's watching that state, that etcd state of the virtual cluster rather than the real one. But the pod is actually run by the real cluster.

It's really fascinating because you're kind of blending the worlds, right? And then there's the opposite mode that we have as well, where you say, “Hey, we want to have certain CRDs available to our tenants and we as a platform team want to manage them in the underlying cluster.” The vCluster has a so-called synchro, which means when you create a CRD in the vCluster and the synchro is enabled for that CRD, the CRD will be copied in the underlying cluster. And then the underlying cluster’s controller is going to reconcile it.

So let's take the example of a cert manager. Something pretty standard that doesn't change very frequently, right? But lots of teams need certificates. So let's say you run cert manager at the underlying cluster and your platform team says like, “Hey, we're going to run and maintain this for everybody.” We are the certificate authority for folks. We have all the credentials etcetera, but teams should self-service their certificates.

What you can do is you can enable CRD syncing for the certificate resource. That means your teams can create the CRD inside the vCluster. It gets synced down to the host cluster, the underlying host cluster creates the actual Kubernetes secret that now holds the provision certificate. And that secret is going to be synced back into the vCluster. So that team created a certificate, got a secret out of it, and there was no cert manager controller ever running in that v… they don't see it, right? It's kind of like magic. It's just as if the controller was there, but you can't see it.

C‍ory: 15:04

Yeah. I got a couple more questions on this. You're getting my nerd juices flowing.

Lukas: 15:08

We're going pretty deep already. This is awesome.

Cory: 15:10

Sorry. No, I like it. So, I love that flexibility, right? Because one of the things that always kind of gets me about Platform and a lot of the platform tools that are out there is like, it's hard to just pigeonhole your entire company into a product, right? So that flexibility of like, I do have a team that wants to manage the security of all of our certificates, and that is an extremely valid concern and a valid take, right?

‍Can you kind of mix these two options? Like, let's say I have that spearhead team and they're like, “Okay, there's a centrally managed operator that the platform team's managing, but I want to experiment and move forward with like a next version.” Can I take my namespace and say, “Hey, we need to override it for this version so we can test something out.”?

‍Maybe it's even the platform team that wants to test it out in isolation, right? And then bring it over to be the ones who are… Does it support like that kind of, I guess, progression as well?

Lukas: 16:00

Yeah, it does. Every vCluster is completely independent of the other vClusters running on the host. So we have what's called a vCluster YAML. That's the config file that really makes up the configuration of that vCluster and how it behaves. So you could spin up, you know, a hundred vClusters, which all use the central, you know, CRD syncing with your central cert manager. And then you can spin up a vCluster that has that sync disabled. And on that one, you can just deploy your own cert manager.

Actually, the scenario that you just outlined of a platform team wanting to test certain things is something we've seen very commonly. People are like, “Okay, we want to test against different versions of Kubernetes because… you know, let's say you have a hundred vClusters, there’s a good chance, and that's actually one of the benefits that these vClusters run with different Kubernetes versions. And it's independent of the host cluster. The only thing they have to agree on obviously is what is a pod. So the pod resource needs to be pretty standardized, but it pretty much is in Kubernetes, to be honest.

I mean, we may be adding some fields in Kubernetes from time to time. I remember ephemeral containers was added at some point, right? I think it was like two years back or so. And when that happens and you have a vCluster with a newer version and you're using that particular field… ephemeral containers are part of the pod spec… then it's going to tell you, “Hey, your host cluster doesn't support that. You can't do that.” But other than that, it's pretty flexible. Unless you actually use… you could even have a newer version that supports that field… but as long as you're not using it, you're good. But once you start using that feature, obviously your underlying cluster needs to support that feature if it comes to these container-level specifications. But that's a very, very minimal constraint. Other than that, all your CODs, your namespaces… you can even have, you know, RBAC in the host cluster and, you know, things like OPA running in the host cluster, restricting privilege containers. But you have another OPA running inside the vCluster, adding an additional layer. Like the admission control loop is going to happen twice. First on the virtual level and then on the real level for these pods. It's really fascinating what's possible with Vcluster.

Cory: 18:06

Yeah. Okay, this is the hard-hitting question. You ready for it? This one's tough. Have you used the exhibit, “Yo, I heard you like Kubernetes, so we put Kubernetes in your Kubernetes” meme for your advertising yet? Do you remember this from Pimp Your Ride? Do know what I'm talking about? No? Oh my gosh, I hope somebody is in my age group. I'm gonna send you an amazing meme that you can use for an advertisement after this call.

Lukas: 18:35

We did use, however, vCluster Inception. You know the movie Inception? Because people are like, can I put a vCluster in a vCluster now? Because technically it should sync the pod to the second vCluster and the second vCluster syncs it to the host. And it actually works. We've done it a couple of levels deep, actually.

Cory: 18:52

What is reality anymore? That is very, that is very cool.

Lukas: 18:54

Yeah, you don't know when are you on the actual real cluster.

Cory: 18:57

Okay, I'm making you… I love getting on the meme generator. I'm doing that for you right after this call. Very cool.

With these teams that you're starting to see adopt vCluster, what do these teams look like? Are these traditional operations teams? Are these more platform teams that are trying to figure out how to extend Kubernetes to their engineering partners? What type of folks are starting to adopt vCluster?

Lukas: 20:18

Most of the time this is championed by the platform teams because they're in charge of handing out infrastructure and enabling the engineering teams across the org. They have a lot of control over the AWS account, etcetera. So it's the right spot for us to be in.

They are usually in this unfortunate situation. Either they have to hand out a bunch of different little clusters… which is going to hit everybody's budgets and in some organizations, you still have to get budget approvals for this, et cetera, because a Kubernetes cluster isn't necessarily something really cheap.

They really can't be too free in handing out Kubernetes to everyone. Maybe a team says, we need four clusters because we want to work on different things in parallel or we want to split things up. It's really, tough because they have to essentially say no a lot of the times.

The alternative is to create namespace-based offerings. And that one, they can be much more freely in terms of like, “Hey, everybody can get 20 namespaces.” Namespace doesn't cost anything. But then you have all these constraints that limit folks. And then they come to you again as a platform team and they say like, “Hey, we would actually need this Kubernetes version.” or “We would love to have this operator running. Can you deploy this for us?” And the platform team is kind of like, “Ehh, we're responsible for it.” Or in the worst case, they're like, “Oh, we already have something running.” That's the wrong version, right? And, as we all know, there's only one version of a controller and one COD version that can run in the same cluster. So you really have to say no a lot in these cases as well.

We help platforms teams say yes a lot more. They say, “Okay, yes, you can have a cluster, but it's a virtual one.” And ideally, you know, the teams don't even realize that that didn't get an EKS or GKE cluster, instead they just got a virtual cluster.

Cory: 22:11

‍Yeah. I mean, since it's working on the Kubernetes API, virtual or not, it works with all the tools that they're familiar with, right?

Lukas: 22:17

Yep. It's a certified Kubernetes distro. You know, the CNCF has all these compliance test suites, and we run the entire suite every time we do a release. So we're really making sure that we're compliant.

When you move an application or a CI pipeline from one cluster to another cluster, the only thing you have to do is switch out the kubeconfig, effectively. You're planning it at the virtual cluster instead of the real one, but we don't require you to change your application or re-architect anything. The goal is really to make this as seamless as possible. Which is like the essential Kubernetes promise, right? That we're abstracting things away from the differences of the underlying hardware on cloud providers and things like that.

C‍ory: 22:59

This is reminding me of… sorry, it's not reminding me, it's triggering a memory in me from a company I worked at previously. Not the giant customer of mine that had the massive, massive cluster, but I worked for another company and we had some pretty high uptime requirements. And I remember whenever we'd go to do a Kubernetes migration… we were one these companies, it's like we really yielded the benefits of orchestration. We really leaned into containerization and the applications themselves were all fairly trivial applications at massive scale. There weren't these newfangled ETL pipelines, AI, etcetera, it was just handling web traffic. Whenever we'd upgrade a Kubernetes version, we would do a little spearhead test of, OK, let's run some stuff over on this newer Kubernetes version over here. But then we just had so much stuff to kind of all interweave to make the suite of products that we offered that our migration process was we never actually upgraded a Kubernetes cluster.

‍We would bring up in a completely new cluster running the version that we wanted, and then we would cut over applications and then just kind of like bifurcate traffic across them until we saw that everything looked good there. We'd move everything over. And then once it was there and stable, we'd eventually shut down that old one. So we were just kind of just constantly getting new Kubernetes clusters out.

‍Now, this actually sounds like a fantastic way for a team to actually test and upgrade without going through like an extreme amount of pain. Given that I'm not, you know, jumping into a new feature in this next version, and I'm just, “Okay, we're trying to go from 127 to 128.” or whatever. I'm trying to go between a couple of versions, maybe because of a CVE or security or whatever, maybe it's EOLed… I can put the virtual version in front of it, see the applications working there. And once I've moved everything to the virtual version, I can bring up the version of that cluster to match. And now you can start using that net new functionality in the new version, yeah?

Lukas: 24:46

Yeah, a hundred percent. We've seen this a lot. We've even seen people, you know, in their CI pipelines, let's say you run end-to-end tests, for example, and you are writing a controller and you're working with CRDs, right? You have a very cloud-native application. Maybe you are a vendor in the Kubernetes space, right? And you’ve got to make sure that whatever you're writing there works in so many different Kubernetes versions.

So we've seen people spin up, you know, like eight different Kubernetes… like vClusters with different Kubernetes versions. And then run end-to-end tests in there and then dispose the vClusters afterwards, right? Like truly ephemeral cluster. Just imagine someone was going to spin up eight EKS clusters on demand for every CI pipeline, for every pull request, right? That sounds ridiculous, but a vCluster is just a container. So spinning up eight of them and then getting rid of them afterwards is like, it's a no brainer, right? It's really straightforward to do. It takes 10 seconds to launch a vCluster, or less.

Cory: 25:37

‍Yeah, that is very cool. And that hits home, too, because I actually manage a library called Bonnie, which is like an operator framework for Elixir and Erlang. And we actually use K3S, but we have a pretty interesting test matrix of testing the Elixir versions, the OTP versions, and then the Kubernetes versions. But we spin up this big fat grid of them to do all this testing. It sounds like we just… Does it work on K3S? Can I put a vCluster in K3S?

Lukas: 26:00

Yeah. We run pretty much any certified Kubernetes distro and we actually started out by putting K3S…

Cory: 26:04

Whoo!

Lukas: 26:07

vCluster is really control plane and pods. So we switched out, you know, essentially to scheduler because… we send pods to the host and then the host schedules these pods, right? So the vCluster doesn't have a regular Kubernetes scheduler, it has a synchro instead. But for the rest of the control plane… you know, controller manager, API server… we were kind of looking at like which one are we using, which distro are we using. And we did start out with K3S.

C‍ory: 26:32

Okay, very cool.

Lukas: 26:33

So in each vCluster, was a half of a K3S cluster running, right? We then supported K0S and vanilla Kubernetes. Today we're gravitating towards telling people to use vanilla Kubernetes because we've realized that has the fewest limitations. Obviously it's the most upstream, quickest updates, etcetera, there's no lag if there's a security issue coming out. So we're definitely encouraging people to use the vanilla version, but we did start out with pure, you know, K3S as the control plane that runs inside of the vCluster.

Cory: 27:04

That's very cool.

In the pre-interview, you had mentioned that you all are seeing AI pulling you in some surprising directions. How are AI native startups and infrastructure projects using Loft and vCluster?

Lukas: 27:18

I think the AI space is very exciting because they need bare-metal performance in a lot of cases. And we're talking about really expensive hardware, right? It's not like your average CPU that is comparably cheap. These H100s and these GPU racks, that's really, really expensive hardware. So if you don't want to use virtualization or virtualization is not really possible, you need that bare-metal performance. You don’t really have a GPU specified hypervisor. Then the question is, how do you spin up clusters? How do you dice things up? And one great way to do that is to spin up one large Kubernetes cluster that has all of your GPU nodes and now launch virtual clusters on top in order to segregate things. That's making a lot of these operations a lot easier.

The question that arises though…and that's one of the latest things we shipped and we're really, really excited about this… is how do you isolate things on the node level now. Because obviously, you have containers running for multiple vClusters on the same node. And that's a tricky challenge. You know, if you don't want to go the virtualization route and, you know, even if you have virtualization, you may want to think about nested virtualization, which can be quite ugly or doesn't work in the public clouds. So there has to be a way for these bare-metal… you know, especially like expensive hardware like GPUs… to share it efficiently at the node level but still make sure people don't get ineach other’s way.

That's why we worked on this project called vNode. It's our way of going a layer down in the stack. vNode is this new product we launched. It's essentially virtual nodes and you can combine it with virtual clusters and then you get isolation at the control plane level but also isolation on the node level.

The idea of vNode is really to package things into a vNode when they belong to the same vCluster. So you have one node but you have three vClusters that have pods on that node, so we wrap them into a vNode to securely isolate them.

Cory: 29:27

That's very cool. And did you say that that works on the hyperscalers as well?

Lukas: 29:30

Exactly, yeah. It works in pretty much all the public clouds. We tested it in, you know, GKE, EKS, AKS. We have instructions on how to run it in each one of them. One really important thing for us was we didn't want you to have requirements like nested virtualization works or we didn't want you to have like, “Here's your custom VM image that you have to run in your hypervisor.”... like we don't want any of these constraints. So the only requirement we have is you choose the right… you know, like your cloud providers, usually have a specific default image for your nodes in EKS or something like that.

The defaults right now have too low of a Linux kernel for what we need. So we do ask you to switch that – you know, hit that dropdown and select the right image. But all cloud providers have by default in their set of images for these nodes, the right kind of image that we need. The only constraint we really have is that you use container D. So currently OpenShift does not support it because OpenShift uses Cryo as container runtime. We require container D version 1 or version 2. Version 2 is pretty cutting edge, but still works. Both of them are supported.

The other constraint we have is the nodes need to have Linux kernel version 6.1 or higher. And that's why we require you to switch that default image in some of the cloud providers. But over time, 6.1 will be the default. I think probably in a year from now, you spin up an EKS… I think even if you spin it up with eksctl, right, like the CLI tool, it already by default chooses the higher version. If you do it through the UI, the web console, it actually gives you a little bit of an older version.

So that's the only constraint. Otherwise, it works completely in the public cloud and obviously in your private cloud as well.

Cory: 31:15

And is vNode also open source, or is this a part of the Loft Suite?

Lukas: 31:19

No, vNode is part of the commercial offering, it's not an open source project.

C‍ory: 31:23

Okay. So let's say I'm a user of vCluster today. When do I start moving towards using a managed version like Loft?

Lukas: 31:32

You know, vCluster open source is an amazing project with a lot of benefits baked into the open source. And I'm a true believer of like your open source project needs to be a true innovation and really a game changer for folks and providing a lot of value on its own.

So the question is, what is the commercial offering beyond that? And I think that needs to have an equal amount of value, right? So we really carefully built the commercial offering. We have what's called vCluster platform. It helps you manage your fleet of virtual clusters and for certain common standards we have vCluster templates, for example, that allow you to say, “Hey, this is what the average vCluster should look like in our org.” Or here's three versions of it, so standardize a little bit more.

It also has certain things like your vCluster typically in open source runs with SQLite as the backing store. So we use Kine, which is what K3S uses under the hood as well, to talk to relational databases rather than NCD. SQLite is so straightforward, you just have a PV for each vCluster, very straightforward to spin one up and have the state captured, that virtual state, without firing up an entire etcd cluster. And the alternative is for folks if… SQLite obviously is single file data stores… if they’re hitting certain limits, they would need to now hook up an etcd cluster and manage that. Which is also kind of annoying.

Cory: 32:53

Yeah [laughing].

Lukas: 32:55

I see you had some pain running etcd clusters.

Cory: 32:58

Yes.

Lukas: 33:00

Welcome to the club, that's definitely an issue. We saw that immediately. We sometimes get requests from folks that say like, “Hey, we have 300 vClusters running. We're thinking about spinning up 300 etcd clusters.” And I was like, “Oh, I'm not sure we should be doing this.”

Cory: 33:16

Someone's got a lot of time on their hands.

Lukas: 33:18

Yeah, exactly. It’s a pretty risky path, so many etcd clusters.

One of the things we have is, we call it embedded etcd. It actually is an etcd that runs inside the vCluster container. And now you can scale that container horizontally from one to three to seven, etc. and that vCluster becomes an HA etcd cluster, but the etcd is managed by vCluster. That means you’re offloading a lot of that responsibility, you know, to us and our software rather than completely managing it yourself. It's much more lightweight. And then you can get even beyond that, with the commercial offering, you can also offload it into a relational database like RDS, for example. And that way you have a global RDS and AWS, which is, you know, completely resilient, super scalable. Like infinitely scalable backups, rollbacks, like all of that is supported. And your vClusters can share the same RDS instance, so you don't need a separate RDS, cause that would be cost-prohibitive.

So we're doing some really smart things around these kind of topics to make the at-scale operation of vCluster easier for these platform teams. So if you have a certain amount of vClusters running, that's really where we can help you with the commercial offering, beyond the open source innovation.

Cory: 34:34

Very cool. Honestly, this is super exciting to me. Like this is, I think one of the cooler projects I've seen in a while. Like just the idea of like how it can be... I mean, it seems a super flexible tool, but like whenever I see anything that helps ease the testing in like day two, particularly around Kubernetes integrations, like that's super rad. I'm really excited about this.

Lukas: 34:56

That's awesome to hear.

Cory: 34:59

One of things we did talk about early on in the pre-interview was about doing platform engineering right. So I would actually love to know, what is one of the things you see… The teams that are doing well, what is a common trait that you see across those teams? And what do you think is something that, if you're just getting into platform engineering, that maybe you want to let go of for a minute?

Lukas: 35:23

There's a really important factor here in platform engineering. And, in my opinion, it's striking the balance between offering, you know, the paved or golden path or like the recommended way in the org and at the same time, giving people autonomy and freedom to explore and experiment and push ahead with ideas. Because when you're thinking about it… if a hundred percent of your engineers are only choosing the paved path, then who is innovating on what the paved path of tomorrow will look like? So my take is you want these, I call it transparent abstractions… you want to abstract and make things easy for folks getting started and for the average kind of team just focusing on getting their things shipped with Kubernetes. And Kubernetes is the platform of choice for a lot of platform teams out there.

But then the question is, how do you make it possible for… how do you not build this path that locks them completely in, abstracts Kubernetes in a way that they can't actually get to the container logs anymore? You know what I mean? They're really struggling with that untransparent layer. So let's create a transparent layer and let's give them areas where they can… you know, sandboxes and a vCluster to play around with, right?

Sometimes we get this question from teams… and I get where it's coming from… like, “You know, we handed out vClusters. Should we start, by default tenants, a cluster admin in those vClusters? Should we start locking them in?” And we're like, “Maybe don't do that, or don't take all of their autonomy away.” because you want them to have that freedom to explore and iterate on things. I think that's a fine balance between guard rails and being an obstruction and slowing innovation down.

C‍ory: 37:14

I mean, I love the idea of guardrails, but like the analogy of guardrails is a funny one. It's like once you've hit a guardrail, you're actually too late. Like you didn't go off the cliff, but like you've still damaged the car.

Lukas: 37:27

Right.

Cory: 37:27

And I remember, when I was 16, I crashed my dad's car into a guardrail. And you know what? He was very happy that I didn't die. But he was also extremely pissed off at the same time.

‍Guardrails are great, but they are going to restrict you, and sometimes you do need to go off-road. A little off-roading never hurt anybody, as long as you're not off-roading off a cliff or into another lane.

‍What’s something that you see teams get hung up on as they're trying to level up their DevOps maturity towards platform engineering that maybe is one of those things that's not worth the effort or could be hampering you when you're trying to become more efficient?

Lukas: 38:07

Honestly, this may be a controversial take in the CNCF open-source community, right? Like I'm a big proponent of open source, but I think sometimes teams have this, “100%. We’ve got to build everything and everything needs to be open source” kind of rule almost baked in. Whether it's like in an individual's head or whether it's even like a company type policy, right? Sometimes it's better to go with, “Hey, this commercial offering actually would get us there faster and would get us to a full-blown solution quicker.” So you may want to consider that, right?

I think platform engineering, obviously… I'm always saying like the platform builders, they want to build things. It's kind of their job to build these platforms, to run these platforms, right? But that doesn't mean they need to do everything from the ground up. You don't need to reinvent the wheel on everything.

So I think that's one of the things I would strongly consider – like a non-open-source route is not necessarily per se a bad route, right?

For example, like one of the reasons we,make vClusters so… like we always think about what is the path backwards for folks as well. We want to take that burden away from them to think about, “Oh, this is going to lock us in.” I think products need to be designed that way because if the product is great, people will stay. There should not be an artificial way to lock somebody in.

We're trying to ease some of that concern by things like, “Okay, you can migrate these data stores” and things like that, right? And you can always say we're not going to run embedded etcd anymore, run by separate etcd. If you're deciding against it at some point.

That's some of the things that I've seen where, sometimes, a doctrine almost is a little bit in the way of getting to the actual target faster.

C‍ory: 39:58

Yeah, for sure. I love that.

‍Awesome. Well, Lucas, thanks so much for coming on the show today. Where can people find you online?

Lukas: 40:05

Yeah, you can find me in the vCluster Slack channel - head to slack.vcluster.com, just hit me up, send me a DM. You can message me on LinkedIn. And I'm pretty much on all the other social channels like X and others, you name it.

Cory: 40:19

Awesome. Well, thanks so much for coming on the show today, and have a great day.

Episode 28

16th Apr 2025

vCluster with Lukas Gentele: Rethinking Kubernetes Multi-Tenancy

Transcript

Listen for free

About the Podcast