Smart TV Testing Made Simple

Episode Description

Testing smart TV applications presents unique challenges that traditional web testing approaches can't solve. Dave Lucia, CTO and co-founder of TV Labs, shares how his team built a platform that virtualizes televisions and set-top boxes to help media companies test their smart TV apps on physical devices.

Learn about TV Labs' innovative architecture and how they handle everything from camera-based testing systems to their custom Lua-based DSL for faster test execution. A key highlight is how choosing Elixir as their primary technology has enabled TV Labs to build a robust orchestration system. The language's built-in capabilities for fault tolerance, process isolation, and distributed computing make it particularly well-suited for managing concurrent connections and real-time state across multiple devices.

The discussion also explores practical insights about system architecture, including how TV Labs leverages Phoenix presence for real-time device state tracking and achieves microsecond-level performance for message broadcasting.

Episode Transcript

Hey everybody, welcome back to the Platform Engineering Podcast. I'm your host, Cory O'Daniel, and today I have one of my favorite people on the planet with me, Dave Lucia, CTO and co-founder of TV Labs. Dave, welcome to the show.

Cory, it is a pleasure. I missed you. Thank you for having me on the show.

I missed you. I miss you so much. So Dave and I… Did we know each other at all before we actually met in real life? Have we been bullshitting with each other on Twitter?

 I don't think so. 

That was a serendipitous meetup. Is that what that was?

It wasn't so serendipitous. Todd Resudek, a mutual friend of ours, was like, “Dave, I feel like you would really like Cory and vice versa.You guys should hang out.” And then I think we went to the bar.

I think we were both depressed or confused about what to do about life. So Dave and I met on the day that the pandemic was declared. And we were both sitting at a conference where I think we’d just like shaken hands with or hugged people that we knew from all over the world. And then the news is like, “Y'all are going to die.”

And I think even at my talk, I like made fun of COVID because they hadn't announced it was like a pandemic yet. Or I think I made people like hug each other high five like in the crowd. And then I remember meeting you and you're like, “Do you like Mezcal?” And I was like, “I do like Mezcal.” And then we went to the bar. And then I think we both got COVID.

I mean, I know I did, at least.

I couldn't place anything for the next three weeks. It was in San Francisco. And that was when, while we were there, that cruise ship that had all the people on it were they were like stationed in the bay because they weren't allowed on to land. And they were just… everyone's locked onto that cruise ship. That happened while we were there. 

I was like, “I don’t know what to do.” My wife called. She was like, “Yo, you can't get on an airplane.” So I rented a car to drive home from San Francisco.

And then I was like sitting outside the car place. And I'm like, “There's no fucking way I'm gonna drive a car back to LA.” So I switched to the airport. I'm like, “Whatever. I'll die. I'll die before driving this yugo back to LA.”

Oh, man. Well, I'm excited to have you on the show. We've been talking about having Dave on for a while. And I don't know, children and time zones. And what else can we blame it on? 

I won't go there. I won’t go there on a podcast. Okay.  

The TV Labs Platform

Yeah, we won't talk about that on the pod. But TV Labs is pretty cool. So why don't you tell us what TV Labs is? Because it's pretty neat. And then I want to know... Yeah, let's start there. And then I got a whole bunch of questions. 

Okay, TV Labs, we are a company, we virtualize televisions, set-top boxes and other connected devices in our device lab. And we work with media companies who build smart TV apps and help them test their smart TV apps on physical devices virtualized through our platform.

So you can think of us as like a browser stack, but specifically focused on the television and set-top box category of devices. And we have a few different products where we allow you to directly pick a type of device will match you with a physical one and then initiate a connection where you can then interact with that device as if it's right in front of you. We port forward everything from the device onto your machine. So you could use device-specific SDKs and APIs. 

Then we have an automation product where you can programmatically write tests, hook it up with your CI/CD, and then run it on physical devices on TV Labs. So that's what we are in a nutshell.

That's very cool. You said browser labs, for people that aren't familiar with browser labs, effectively, it's almost like a selenium, but where you're driving TVs instead of like a Chrome driver or something.

Okay. So WebDriver is something that you can use to drive TVs and typically is used through something called Appium. We do support that. So we have what we call our Appium proxy, which allows you to write Appium-based tests and run them against our physical hardware. 

We also have vision-based testing where we allow you to visually navigate applications as a black box. This is useful if you're a QA or you're trying to write high-level integration tests to make sure that an app is performing, but you don't actually control the internals of the app. And so you don't want to write like very in-the-weeds knowledge about the applications you're testing. 

We support both styles of development. WebDriver is more kind of like a communication protocol for a device.

Testing Approaches and Architecture

So like pre-TV labs, like how would people test, you know, apps that are going to run on Apple TV or a Fire Stick or whatever?

A lot of it is you have a physical device in your office, your device lab, maybe one at home, and you'll connect to it over your local network and run an Appium WebDriver test. Something like that. Or it's manual. You're fully doing it manually. There's different styles of it. 

There's other companies out there that will do a similar style of virtualizing the televisions, but we kind of take it to the next level by really automating everything around the device life cycle of turning on, turning off and everything in between to make the whole process really seamless and easy so that you could just get onto the devices you need and don't really have to care about the maintenance. 

The maintenance of these devices is really hard and complicated because televisions, they break all the time. They get into weird states. You have to turn them on and off. Sometimes you can't pair with them in development mode. Or they're just old and don't support like newer styles of communication. 

Our goal is to be able to support the latest and greatest TVs, which actually tend not to be the problem ones. The problem ones tend to be the devices that are 10 years old that are still on the market and still have like 4% or 5% or 6% market share that, you know, just enough of your customer base is pissed off that your app is not working. 

Okay, so let me ask you a question. This one's, I don't know… 

Oh boy.

I don't know if you have the answer to this question. This is a difficult one.  Why is it that when you look at companies like Apple and Apple TV, Netflix, Prime – I'd say these are some of the biggest companies. I mean, they are in the FANG, right? They got their letters in there, right? – Do any of these motherfuckers have a single Redis instance running anywhere? Because none of them remember where I am in the episode. 

They don't remember what I'm watching. I can't even figure out what streaming service I was watching the thing on. And when I finally figured it out, they have no idea what episode I was on last time. What is going on over there? Do you got any insights into these people? Did I just put you on the spot? 

Well, you put me on the spot. The one that infuriates me the most, I'm not going to name-drop them. But one of the biggest companies out there, media companies, their app does the worst job it could possibly do of like serving me the show that I want to watch. So I will watch three or four different series on this app.

And I have to go through multiple menus to find the thing that I know that I'm looking for. It doesn't make any sense. So Redis missing sure, but just like basic UX is also a problem in this domain. And I don't know why. 

I don't know, I feel like they're trying to piss people off. They're like, we want… I feel like they have some vested stake in cable and they're like, “We’ve got to drive people back to cable.” Like, we're trying to watch White Lotus last night, and I was like, “I'm gonna lose my fucking mind.” I don't remember what it's on. I'm asking Apple TV, I'm like, “Show me White Lotus.” and it's like, “White Lotus on Prime.” and I'm like, “Sure, I got Prime.” and it was like, give us money. I'm like, “Give you money? I already give you money for Prime. What are you talking about?” I was like, “I've been watching this shit for free for like three weeks.” It's on HBO. 

Well, this is why we exist. We're here to help these companies improve their product through testing to make them more reliable. And maybe through hosted Redis.

I will throw a free Redis… Apple, if you guys need a Redis come to me, I will give you a free Redis instance just so you can remember where I am in the show. 

Okay, so behind the scenes, you actually have physical TVs. I'm assuming –  what we can't see in your office is they're all just in your office, you’ve got 14 of them all over the walls? 

They're on the ceiling. 

So you actually… where are these things? Like you just have a bunch of TVs that you guys actually own. Is there just like a huge office with TVs all over the walls? Like, what is this?

Yeah. So our device lab is exactly that. It's a massive amount of televisions and set-top boxes, gaming consoles that are all hooked up to our proprietary system that includes… actually we've got our own like printed circuit boards [holds one up] and a bunch of hardware that are connected to these devices. And a compute running with each one of these devices that connect up to our web platform. And our web platform is kind of like the coordinator of… a user wants to connect with an Apple TV from 2020, we pair them with a physical device and then a web RTC connection is initiated between the physical device and the user. And then all communication happens through there. Any events that are emitted get emitted up back to our web platform and processed and digitalized, all through that platform. 

The office itself is physical hardware with a bunch of stuff hooked up to these TVs to be able to communicate with it in any way possible. But also to be able to manage it so that we can like remotely power cycle it, so that we could shape traffic of the network, even pair two devices together so that you're able to do different styles of testing. 

All of that is kind of like a pair of our hardware and our software working together.  

That is pretty cool. When do I get a tour of this thing? 

Next time you're in New York City, you let me know.

Okay. I must spend some of those VC bucks on a ticket, I want to see this thing.

I mean, I remember you guys when you first started talking about this a few years back and I was like, “Oh, this is going to be pretty rad.” I've always been jonesed to see like what this actually looks like at scale, like physically. 

So is it just like desks with TVs everywhere? Are you racking TVs? Are they actually mounted on walls? Like how do you guys organize all this? And what is your cable management game like? 

Cable management game – there's still room for improvement. We're on, I think, our fourth or fifth iteration of how we house these boxes or these televisions.

It depends on the style of device. We support televisions. We support set-top boxes, mobile devices, gaming consoles, and all of these have different needs in terms of how we need to house them to communicate with them.

Set-top boxes, we just are able to attach, like an Apple TV I'm talking about, we're able to attach an HDMI capture card, and then all the communication through our PCB is hooked up to the device. And that's it. That's all we need. 

It gets more complicated when it's a television because televisions don't have a high-quality HDMI out that we could just plug into. So we actually use a camera system and we point a camera at the device. And then we do a bunch of processing on device to transform that camera image into something that looks flat and perfectly rectangular and HD and is color-corrected before it comes back to you. 

Oh, that's pretty cool. 

It is pretty cool. One of our hardest challenges early on was building that entire pipeline. And actually recently we just rewrote the whole thing. Like I got a prop for everything [holds up a “REWRITE IT IN RUST” print], we rewrote it in Rust. 

Aaaaaaagh.

The way that we house televisions, we built a box. This box has your standard VESA mount that's mounted inside of a box. And then there's cameras pointed directly at that device. And then all the ways to communicate with it, whether it's IRCC, you know, vendor-specific SDKs all happen within that hardware. And then there's also cooling involved, there's light separation and even separation for IR and things like that, all kind of accounted for within the hardware. 

Our current boxes are kind of like big, ugly and bulky. We actually went from like erector set style. Then we got like really pretty and really nice looking, like sleek black boxes. To then like light case boxes that look like something that you're touring on a band and you threw your trombone in – because you're obviously in a ska band. And then now we're moving kind of back towards that erector set style. It's been an iteration based on our needs and kind of like where we've seen issues and how we plan to scale our physical footprint.

Yeah, that is pretty cool. I remember when you first started talking about the cameras, like how many cameras are on the screen at once? 

Right now, four.

Four. Okay. And so like when you look at the raw feed, you probably see like a little bit of the monitor, it's probably got like curvature. And so all of that is corrected on device? Or corrected like when it's sent back?

So four cameras are connected into our compute that sits next to the box. Those come in, they go through a V4 L2, you know, Linux camera drivers. They then get processed through what we call our stitcher, which takes those four cameras, that are overlapping and have like fisheye distortion and like light saturation, and those get stitched together and then flattened out and cropped so that it looks like a perfect rectangle. That uses a bunch of computer vision techniques and various types of math in order to do that - It's called homography.

It goes through an undistortion layer, color correction layer, and then that then moves to the video coding part and either going to WebRTC or to going to an HLS sync and then routed to the user in whatever way. 

So I'm trying to like, I've obviously never worked on…

video processing. 

Yeah, well, actually, I've been around.

I've started off FFmpeg before. 

I was going to make a joke that is just… is too just between us and it would have made no sense to anybody else, but I'll skip that joke until we turn off the camera. 

What is the testing experience like? Is it like BDD style testing where I'm like, “Okay, you know, given the screens visualize, I click on White Lotus, and I expect it to play.”  Is that the QA type, like, based on the visual black boxing Is that what it's like? Or is it like with Chrome driver and WebDriver? Like, are there some sort of like selector equivalence that you can like tap into? 

So yes, is the answer. Yes. And yeah, all of it. 

All of it.

Styles of Testing at TV Labs

I think of those as like different layers of abstraction. And we kind of have a fork in the road where we support two like fundamentally different styles of testing. Like the white-box versus black-box, I don't even know if that's the right way of describing it.

One is where you have no idea of the internals of the system. And that's our vision-based testing. And so for that, it's very much… we rely on visual cues. So you take screenshots of your application. And then we have a drag and drop workflow editor and annotation designer, where you take these screenshots, and then you annotate them and you say, you know, this logo placement and like this text on the screen probably means I'm on the home screen. And here's a bunch of other variations that might mean I'm on the home screen in this particular state.

You can use color, you can use recognition of an image, you can use motion on the screen. Or we have a bunch of different LLM integrations where you could say, “Hey, I'm on the screen, given the element of focus, navigate to this other place.” and it will generate the series of commands to navigate from A to B. All this is black-box vision-based testing. 

The other route is I know the internals of the system, I'm using an Appium or a WebDriver under the hood, and it's more selector-based, HTML-based, DOM-based navigation. Where it's, you know, click on this particular element on the screen, wait for this event, wait for this element to pop up, assert this, assert that.

Those two paths allow different kind of capabilities. The BDD and all of those other things that you're talking about are all layered on top of the Appium-style testing. Whereas the vision-based testing, we can collect a competitive analysis report as one of our products. Where we actually take all the top apps in the app store, and we collect different metrics on them and publish. We collect hundreds of these reports on a ton of different devices, and then all the top apps… and I think we have like 12 metrics that we're recording right now. We run this report monthly, and this all uses our vision-based system because we don't have access to the internals of those apps to be able to do that style testing. So with that vision-based testing, we unlock things that you can't do with the Appium and other styles of testing.

That's pretty neat. So you got these TVs, you got the back end. Like, when somebody makes a request, like, let's say they're queuing a test from CI/CD. I mean, I'm assuming you only do one… like one test is taking a TV, right? 

That's right. 

So if I got a test suite of, you know, 100 tests that are going to run on this TV, like that TV is mine for the next…

40 minutes.

Yeah, 40 minutes. Okay, okay. They don't have those fast tests?

They’ve got a lot of slow tests. 

Oh, man.

There's some fundamental issues with like the protocols being used too. Because they're kind of meant for… like, I have a device farm and… like Appium and WebDriver specifically, they make you run a request from wherever the request is being made in your CI down to the device and round tripping every time as opposed to like, sending down a script and executing it on the device itself. And because of this, it just adds so much time to the cycle time of particular test. 

With our vision-based system, we actually wrote our own DSL in Lua. And we just send down a full Lua script and execute that on device. And so we're actually able to test much faster because of it. 

So there's kind of optimizations that we have in mind for how to improve even the Appium style of things with hooking into certain features there. But there's just a fundamental difference in the architecture of these technologies.

OK, so I got my TV. I got it for 40 minutes. And so effectively, like, that's mine. 

So when you guys have to scale, like, let's say, OK, we've got a new customer. We need to bring in X amount of new units or whatever. What is that like with your box system? Like, is it just popping a TV in there and, as far as provisioning is concerned, like you're done? Or do you have to kind of like go in and instrument for like the exact dimensions of the TV or like some sort of color profile in the TV? Or is that all kind of handled “in post”? Can I say “in post”?

In post, sure.

That felt cool. 

So the device provisioning, our system is designed to be able to handle kind of any manufacturer TV, but optimized around… 55 inch is kind of the dimensions where nearly every TV has a 55-inch model. Our boxes are mostly made to handle that 55-inch variety. But then when it's time to, “Hey, we have a new customer, they have a specific need for this.” We're also monitoring capacity. So maybe we're getting a lot more usage on LG. And so we want to scale up our LG and we're not getting so much usage on our, I don't know, our TCL fleet. So we scale down our TCLs.

We can just take the device out, pop a new device in. We reconfigure it in our web platform, which then sends the configuration down to the device. So the device gets bootstrapped by our web platform. And then there is a series of calibration mechanisms that can run to calibrate the cameras with that particular device. So it'll run through a series of different screens. It will calibrate off of the images… a known image, and then the device will just be ready to be used. And that could be done within minutes. It's very fast.

And so all of this is been rewritten in Rust? No, it's not all been rewritten in Rust. 

No, I would never rewrite all that in Rust.

You gotta rewrite everything in Rust. 

So there's your back-end system, and then there's the on-device system. That's what's written in Rust, yeah?

No, only a very specific part of it is written in Rust. So for the most part, our system is written in Elixir. We've got some Go, we've got a bunch of Python, we've got Lua.

I didn't know you were a forward-thinking CTO with that Go lang in there. 

Oh, our CLI. Why would you write a CLI in anything but Go?

The Rust is limited to our video pipeline itself. It is using this framework called GStreamer, which is actually written in C, but has a whole Rust abstraction around it. That and some of the plugins are the only Rusts in our code base. 

Okay, are you using like Rustler for that, or is that like an independent running service?

It is using Rustler. We went through multiple different iterations of how this works. I actually have an idea for another iteration, because C under the hood, deadlocks, and memory corruption are a real problem. But it is actually implemented as a native implemented function in Elixir that has the Rust embedded inside of it via Rustler, which is a little crazy.

Okay. So how do you have that architected? Is that all a part of like one Elixir application? Or is that an independent service and you're making some sort of web service calls to it? Or is that all just…

Well, let's talk architecture. 

I know, I wanna see how it's done before I let you unveil the Elixirness of it, which I'd love to get into. But yeah, let's talk architecture, baby.

The web platform. Single monolithic platform. We call it Sauron, because it's the eye of Sauron. It looks down on all the devices to manage it. 

The software that's running on our device is… actually, there's nearly a dozen services running on each device, all in Docker. But the main coordination device is what we call the sidecar, also written in Elixir. And that makes a web socket connection up to our platform. 

Okay. 

That web socket connection is where all the signaling happens between the device and the web platform. So when you want to initiate a connection - I want an LGC2 -  we say, “Great.” That makes a request, our matchmaking system (that we call a demand system) will then find the first available device for your request, initiate a session, which basically there's a transaction that establishes the connection, warms up the device. And then depending on what you're doing… If it's an access session it's going to establish a WebRTC connection. If it's an automation session using that vision-based system, it's going to send down a Lua script and execute the Lua script on the device. Or if it's an Appium session, it's going to run through our Appium proxy and then send down messages over the web socket to the device directly.

So it'll basically take an HTTP request coming from your CI through the proxy, turn that into a payload that gets sent down to the web socket and then replies via HTTP. 

Okay. So essentially you got this core system, this monolith, which by the way – love it.

I love it too. 

It's a great name. 

And it's a monorepo.

Oh, oh, I gotta go. 

Oh, you're not a fan of monorepos. 

No, dude, you gotta have as many git repos as possible. 

Oh, I didn't say we didn't have as many git repos as possible. Well, maybe not as possible, but we still have several dozen repos. 

You’re like, “Look, all of our code is in one repo. We just created 300 other repos that are empty, just for fun.”

You need as many git repos as possible so you can like fully lean into like your service naming convention. Like if you do a monorepo, you get one cool name and then you’re done.

We don't even have a cool name. 

You said… 

No, no, that's a service. Our monorepo is called “platform”. 

Oh. Not cool.

Not cool. But you know what it is, it's the platform. 

It's the platform. It's where all the stuff is that you use. 

Why Elixir?

I know that you are a person that is a fan of Elixir, let's say. You like it a tad. Your background was C Sharp, right? It was where you were before. 

I was doing a JavaScript over at Bloomberg. 

It was JavaScript. For some reason…

Yeah, well, C++, then JavaScript.

Okay, oh, yeah, yeah. I was, sorry, I didn't, I was doing the C++++ thing. I didn't realize you C++'d to JavaScript land.

Yeah.

TV Labs isn't your first foray into Elixir. You were previously at SimpleBet. You did Elixir there as well. Like, what was the appeal of Elixir to you initially? And then from there, like, I'd love to know, like, what was, like, what made you think, like, this is the right language and system for designing TV Labs? 

In college, as part of a computer science curriculum, we took a programming language course, which got me into Haskell and functional programming. So, that was, like, my first taste of functional programming. Then I forgot about it because I got into Bloomberg. They, you know, indoctrinate you with C++ and Fortran. And actually JavaScript is a core part of the Bloomberg technology. They were doing server-side JavaScript before Node was even a thing.

Ooh.

There's some talks about that out there, so that's public knowledge. 

I was working on Bloomberg's trading platform for the first half of my stint at Bloomberg. And then when I moved over to web, we were doing a bunch of cool stuff on Bloomberg.com. We completely re-platformed the whole thing. We were doing… we wrote our own single-page app framework – this is kind of like before React was popular, or really, it came out while we were building it – called Bloomberg Brisket, by the way, which is still a great name.

So, we were doing all of that, and, I mean, JavaScript was just fine, but I was not excited about it. I also felt that it was just fundamentally limiting in terms of performance with the whole event loop model, and wasn't particularly excited about the platform. 

And so I got interested in two functional languages that were kind of getting a little bit of attention at the time. One was Elm, which I was playing around with quite a bit, which solved the front-end programming side of things. But on the back-end side of things, got really interested in Elixir, because it was my first introduction to pattern matching. Which is a language construct that is now actually in a lot of languages, but at the time was kind of coming onto the scene as the new hotness.

It is the new… It's always the hotness. 

It always will be. Yeah.

Iconic.

So I started building some chatbots and some cool stuff with Elixir, and there was just a turning point where I was like, this is all I want to do. So this guy, Josh Topolsky, who was one of the founders of The Verge and had come on to Bloomberg and helped us relaunch Bloomberg Business Week and Bloomberg.com, he ended up leaving. I think he famously told Mike Bloomberg to go fuck himself in a meeting, and I think that's how he got fired – Sorry, Josh, if that's not the true story. 

But anyways, he left. He started his own media company called The Outline, and it just so happened that they were looking for an Elixir founding engineer, because their CTO (who's now a great friend of mine) Ivar Vong, he was a Ruby guy. I think you've met Ivar. 

Oh yeah.

He wanted to build The Outline and Elixir, and so I was like, “This is it, I'm doing it.” Went over to The Outline, and kind of the rest was history.

So Outline was my first startup, went to SimpleBet after that. I was doing sports betting and real-time odds creation in Elixir – which worked extremely well. Did another media company for about a year, and now TV Labs.

So that's kind of like how I got into Elixir, but why for TV Labs is that Elixir is just, I think, the best orchestration language out there. If you need to manage a bunch of concurrent things, you need to have a lot of data moving around, you want things to be isolated and be able to send messages in between, you have a bunch of servers located all around the globe and need stateful processes to be able to coordinate between them – Elixir is your go.

Elixir is just, I think, the best orchestration language out there.

That's exactly the problem we have at TV Labs. Managing a lot of state, a lot of things concurrently, signaling all these problems. Elixir just maps onto this problem really, really well, and I find that to be the case for a number of different problems. So Elixir is that hammer. It's a solution looking for a problem, and it has a lot of problems where it's a good fit. 

I'm a big fan as well, folks. I think everyone knows that. But it's not just a language…  it is more than a language, right? 

For the people that aren't familiar with Elixir, can you tell us a bit about the Beam and what it provides over just a standard, functional programming language? 

Yeah, okay, so there's a lot of things that are unique about Elixir, but really when we're talking about the unique things of Elixir, I'm gonna focus more on the underlying platform, which is the Beam. The Beam VM, which is the Erlang VM.

So Erlang was created in the late 80s, open-sourced in, I think, the late 90s at Ericsson, which is a telecommunications company. Their problem was writing telephone switches and making sure that they had really high uptime and that they weren't dropping calls. And through a series of experiments and iterations, arrived on a platform that was process-based. So everything in the Beam VM is run in the context of a tiny, lightweight, green-thread-type process.

So on a single machine, you can have millions of processes. They're, under the hood, scheduled by usually a scheduler per core. That's basically pulling down a process, churning through a bunch of machine code instructions, and then dumping it back to the end of the queue and going to the next one. And so this gives it the property of it balancing work really well.

It's a solution looking for a problem, and it has a lot of problems where it's a good fit. 

So going back to my world in Node.js, you'd accidentally write some shitty code where it was blocking the event loop, like you forgot to do a promise, because we didn't even have promises at one point. Everything was just shitty callbacks. 

Oh, baby.

And then we had Bluebird, which was the first promise library. 

Anyways, in Node, you can accidentally write synchronous code that's waiting for IO and the whole machine, like the whole process is just deadlocked, waiting for that. This can’t happen in Elixir, because processes, if they're doing IO, they just get preempted and go to the back of the queue. And so everything can be consuming resources concurrently and not negatively impacting each other. There's caveats to this, don't get me wrong, but that's the basis of the system.

It's also based on this concept of fault tolerance, where processes can be observed and linked together. So you can really gracefully handle failure. The ethos of the Beam world is let it crash.

So rather than worrying about handling every edge case, you just say, “Hey, let it crash. The system will reboot itself,” And it can reboot itself at arbitrary levels. So you just stick a supervisor, you put your processes under the supervision of the supervisor, and then if they fail, it will restart. Or you can even set rules, like if this process fails, also kill all of the sibling processes and restart them all together as a unit. So that fault tolerance is a huge piece. 

The ethos of the Beam world is let it crash.

You then get functional programming and message passing, which is really interesting and gives you nice isolation properties. But then what I think is also really unique about the Beam VM is its clustering and runtime capabilities. Where you can have many Erlang nodes that seamlessly mesh together, and then you can transparently communicate between these processes.

So I can have a process in EWR and LAX – if we're talking about airport codes – and I can communicate between them. And from the perspective of the programmer, they could be exactly on the same box. I communicate with them nearly exactly the same way. 

So from all of these different underlying properties, you get a lot of really robust tools for building distributed systems in a highly available fault-tolerant way. 

So ton of fault tolerance there, like really interesting system. I've gotten to use this quite a few times over my years as well.

And it's interesting, like one of the things that I've always personally struggled with, like I am at the same time an Elixir fanboy and I am also a Kubernetes fanboy. And like honestly, a lot of times when I'm designing systems, I have a problem figuring out, when using both, where to draw the lines on my fault tolerance and scalability. Because I know I can do this at the Kubernetes level, but also… like the language in VM itself also does this. Which is pretty wild, right? And I would say… I don't know how much you agree, but I honestly feel like it's probably one of the most like DevOps-oriented languages you could get.

You can take something like Kubernetes and make, you know, a slow language or a language that's not quite as powerful as Elixir, very fault tolerant, very resilient. You can take a Ruby node, whatever, throw it in there, but you can go really far on Elixir with just some VMs and not much beyond that, right? Like the language even has the ability to dynamically upgrade, right? Like without taking downtime, right? Which is pretty wild. 

I mean, it's painful to do sometimes. I don't know if you guys are doing it. I've done it in the past and I was like, hey, you know what? 

I don't do that. 

I'll just pimp a new version and shut the other one… Yeah, it's a pain in the ass. I think it's a pain in the ass.

But it's wild. Like the level of effort that went into designing this system, because the phone system can't go down. Like if we're on a phone call and somebody wants to upgrade the phone system, like you can't hang up on everybody. 

Yeah. 

That's what you're building on. This language that was designed to withstand pretty much any interruption, right? So you're using it. It's really important for you guys. You've got a ton of Websocket connections. 

Was that your main go-to? It was like, okay, it's just the Websocket connections. Or was there something about the ability to cluster those nodes that you're leaning into? Like what other like technical architecture decisions went into picking Elixir? 

Well, a big piece of it is that the Websocket connections allow us to do real-time communication really easily. So for example, when a device connects up to our web platform and establishes that Websocket connection, we then use a feature of the web framework of Elixir Phoenix, and we use its presence features to announce “This device is connected and it is in an available state.” And then that information via CRDTs is broadcasted around the cluster so that no matter what device you're on, you can create the state of that device, show it in a UI or whatever, without having to like directly go to that device and communicate with it. Or to have a bunch of really complicated rules around device tracking. 

So there's features like that, like Phoenix presence, that give us some kind of like unique, but like right out of the box features to do real-time state propagation. 

Another is the broadcasting. So, you know, you love Redis, you know, you can have a Redis instance and you could do PubSub over Redis. But with Elixir, if you have your nodes clustered together, I can just do Phoenix PubSub broadcast, and then every UI that might care about some piece of information… Let's just say I'm building a dashboard for showing live sessions for my org. And I broadcast the message, “New session started.”, all the UIs could just be listening to that message, push the message update over Phoenix LiveView, which also uses Websockets, and boom, I have like a real live updating UI with like two lines of code. 

So because we're doing so much real-time stuff, all of these capabilities that are really easy with the Beam VM because of the clustering capabilities and the message passing capabilities made it just like a really obvious fit.

The alternative, you know, I can do it in other languages and other platforms using Redis and message queues and things like that. We also still do use those technologies, don't get me wrong, but we get a lot of out-of-the-box features with Elixir by not adding any new dependencies. 

Yeah, the presence stuff is pretty rad. We use that in Massdriver, like when somebody is interacting with the canvas.

So for anybody who's not familiar, my day job outside the podcast is working on this infrastructure as code platform, and you can diagram your infrastructure. So when you're diagramming that, we actually show that to anybody else who's looking at the diagram. It's rad because like I can add, you know, a new Terraform module or whatever, and everybody else sees it. That's great. That's what we want.

But for the backend, it's not this complicated architecture where we're trying to like store that this happened someplace, find everybody that's listening to it. It hits essentially our GraphQL endpoint. And anybody who's in that channel, which is effectively the canvas, we're able to just dispatch that event to everybody almost instantly. It hits everybody at the same time. 

The response rate of this popping on your canvas… like if you have two of them next to each other is insane. And there's a ton of really awesome demos out there, like showing how fast Phoenix presence is. 

Did you see the rainbow one where I was doing all the colors? That one was over the top. What was the OG one for the chat channel? Wasn't it like a million connections on like a T3 medium or something like that? Like a million…you remember that one? 

Yeah, there was like… “The road to 1 million Websocket connections.” I think was the name of the article. 

What's cool about this is like, it's also very fast. So like to set expectations, the beam is not the fastest runtime out there because of the preemption and because of other considerations. It's never gonna compete with performance with something native. Like Go is gonna like smoke Elixir in terms of raw performance. However, when it comes to IO and like all this message passing stuff, Phoenix PubSub runs in microseconds.

If you're on the same node and you're broadcasting a message, all their processes on that node are gonna receive it within microseconds and then be able to update your UI. It is instantaneous and can happen in like milliseconds in terms of like getting it onto your screen. 

I remember I… Sorry, this just triggered like a very specific memory. 

Oh my God. 

We had this service at… gosh, I think it was... I can't remember if it was when I was at DealScience or Clicktrips because I've worked with like the same team everywhere. I just bring the same people out. I have engineers I like working with and I just bring them. 

But a buddy of mine saw… like we rolled out… Oh, it was at ClickTrips. We rolled out this Elixir service and like the service had been taking like milliseconds before and like all of a sudden there was the little U. And I remember just being like, “What the fuck is this character?” Because all of a sudden it was just like insanely fast. But then when we're like looking at metrics and Datadog, there was like the little… I don't even know what the character's called. He's like, “What the fuck is… what does this mean?” I'm like, “It means it's extremely fast is what it means.” 

I mean, it is very, very fast. We had that service… I wrote a blog post on this, “From $erverless to Elixir”. It was a system that we designed in Lambda and it was a great system. It was also a kind of a testing and uptime system. It was for a bunch of customers that we had in the travel space. And we were processing just tons of gigs a second of data. And we moved from Lambdas, which were costing us a fortune to Elixir. 

The entire system, just like… I don't know, not 10X, the opposite of 10X, one-tenth? Like the response times were just insane once we cut over to Elixir. But then the other thing was like, we went from a system that was costing us $90,000 all in… I think in the article, I say like $30,000, but that's just our HTTP processing… to, I think we ran on about $150 of like spare Kubernetes compute. Like it was just running in like extra compute that we just didn't… it was just kind of sitting on those boxes.

And it took us a matter of days to like rewrite this service and like the cost savings was wild, but the user experience on the other side for people that were running this tool that we had, it was just like, “This thing is just lightning fast. Like, what'd you guys do?” And I was like, “Used a language that was designed for lightning fast.”

Developer Experience and Debugging

I'm with you that Elixir is one of the best Ops languages out there. And something that comes up time and time again is where there's some sort of production issue, and it's like, “How do I replicate the state? Like, how do I know what's going on in the system?” And the Beam has really incredible introspection tools. 

So let's say that you're able to get SSH access onto a production box. Whether or not you can do that with compliance.

75 people just died, as soon as you said that. You just murdered… 

Let's imagine this is even just local development, production off to the side – we're not even allowed to do that. But in local development, let's just say you were able to reproduce a production scenario. You didn't have like some login, but you're like, “Shit, like what is happening here?” I can go in, have a ripple of the running system, find the process that is having the issue, and then like pull the state out of the process and see exactly what's going wrong. And then I can even change the code. 

I can do this in a production system if I want… which would be very bad… but I can inject some code in and it will hot reload the code, keeping the state of the system for all processes on that node. And so I could use that to like hot patch a production system. Which not saying I've done before… but I've definitely done before.

In addition to that, there's tracing tools that existed before like any of the lower level system tools. Like I'm thinking of like D-Trace or eB… Is it eBPF? eBF? whatever the hell it's called. 

The Beam has a lot of tracing tools built in where I can say like, “Hey, any function of this name, whenever it's called and it has this particular argument, print it out for me. Or hook into it when it happens.” And that's like something I haven't seen, to that extent, in any other language or platform. And when things go wrong, this is extremely powerful. 

Yeah, it is nice to be able to like pop into something, especially if it's like truly mission-critical where you're like, “Oh, like I don't wanna like stare at some logs and try to figure out what it is.” It's just like, if there's a break glass scenario for a truly mission-critical system, and you can get in and like, and get access to some of this information without taking it down or just, you know, performing archeological tasks on some logs once it gets written. Like that is… it is amazingly powerful when it comes to the debugging stuff. 

I use… well, I used to use – before fraud access was taken away from me — I used the ripple all the time to troubleshoot things. It's amazing. It's really good at what it does. SOC2 be damned. 

Awesome, man. Well, thanks for coming on the show today.

Thank you for having me. 

Whereabouts are you on the internet nowadays? 

Well, me personally, I'm on all the apps, but you can find me, davydog187 on Twitter and Bluesky, follow me there. But tvlabs.ai is our products. Check us out at tvlabs.ai

Heck yeah. Awesome, man. Well, thanks again for coming on the show and I'm gonna bug you for like 10 to 15 more minutes once I hit stop. 

All right, see you then. Thanks, Cory.

Thank you.

Links

Featured Guest

Dave Lucia

CTO & Co-Founder at TV Labs