Learn how Conductor uses HashiCorp Nomad to drive industry-leading compute scale that enables VFX studios to render massive amounts of content on the cloud.
Speakers: Jonathan Cross, Carlos Robles
Hello, everyone. Welcome to our presentation of "Making Movie Magic with Nomad." In this talk, we'll explore how Conductor Technologies uses Nomad to enable high-scale visual effects (VFX) rendering on the cloud.
I'm Jonathan Cross, a lead software engineer here at Conductor. I'm humbled to be speaking to all of you at HashiConf EU. I've been using so many of HashiCorp's tools throughout my career, and being able to speak about one is a huge honor for me.
At Conductor, I work on our backend, with a concentration on our core domain, rendering, and dealing with all the complexities of doing that on multiple clouds.
And I'm Carlos Robles. I'm a senior DevOps engineer at Conductor, and it's my job to make sure we build and scale in a way that's consistent, repeatable, and secure. And as always, on the operations side to put out any fires that flare up and make sure they don't flare up again.
I've worked in a variety of industries, but this is my first time working in entertainment and VFX. And I can say there's lots of fun and interesting technical challenges to overcome.
I'm honored to speak at HashiConf EU 2021, and I'm excited to share with all of you what Conductor does and how we do it. Let's get to it.
Founded in 2017, Conductor is the premier platform for visual effects rendering on the cloud. We've been used on projects like Blade Runner 2049, Deadpool, Game of Thrones, Hellboy, Stranger Things, and Welcome to Marwen.
Projects like those partner with Conductor to leverage cloud computing for the VFX rendering. We make that possible by providing a seamless, scalable, and secure VFX rendering platform.
Customers only pay on-demand for the minutes they render. That works very well for our customers, artists and studios who are looking to complete their projects, ranging from digital media campaigns and commercials to big-budget TV shows and movies.
We make it easy by providing a suite of tools and plugins that work directly in line with all of their existing software and workflows. All that means is that our customers can easily submit their render jobs, monitor their progress, and download the finished render scenes when they're done.
Let's talk about how Conductor makes rendering on the cloud easy and how Nomad fits into that solution.
It's useful to note that the pipeline workflow a VFX artist might follow is similar to the development workflow a developer might follow. And that developer workflow that a lot of us are familiar with starts with a code base that you make a local branch on and contribute to. Generally, before you push up your changes, you run some local testing to make sure everything still looks good. Once you do push your changes up, there's going to be some automated processing in place to push your code out to release.
The VFX workflow is similar. An artist is going to start with a set of assets—that's the models, objects, and textures in the project—and they're going to create or modify a scene with these assets.
Before submitting a scene for production-quality rendering on the render farm, an artist would generally run a local low-resolution render to ensure there are no errors and that the render output looks the way they expect it to. If it does, they go on to submit their scene at Conductor for rendering.
The scene that's submitted is a job made up of a collection of tasks for the frames that need to be rendered. Conductor will first only render what's called the scout frames, which are the first, middle, and last frames of the submission. This lets the artist download them and get another chance to inspect the output to confirm it looks good. if it does, we can go ahead and render the rest of the frames.
In this context, Conductor can be thought of as a continuous deployment platform for VFX workloads. We automate the process of converting rendered assets and dependencies into a finalized render output in much the same way that a CI or CD process is going to convert a code base and its dependencies into a built-in packaged app that can be deployed.
Once the job is submitted, we take over. There are 3 major steps to producing the rendered output that the artist can download:
The first step is uploading. Customers upload their scene dependencies via our tools and plugins to cloud storage, in this case, Google Cloud Storage. They have the option of using their command-line tools, our point-and-click user interface, or our custom plugins that are built into the render software. They can just do everything from inside that software.
As part of the upload, we ensure that no files are duplicated and we re-upload any files that have changed. The uploaded raw asset's going to look something like the image you see on screen now, where you have raw assets like the car models and the textures and some pre-selected camera view.
Once that upload is complete, the second step is syncing. Our sync process copies their scene dependencies from Google Cloud Storage to a disk or a set of disks. Once this sync is complete, we can spin up the required render nodes and attach these disks to them for rendering.
The last step is rendering. We produce the finalized output that we see on the bottom right here. This finalized output is copied over to Google Cloud Storage. From there, the customer can download their frames at their convenience, again using our Conductor tools.
In order to meet our customer rendering needs, we have certain requirements that we need out of the scheduling backend that runs our Windows workloads. It's important that we can scale, and quickly, not just up, but also down. Because we charge on-demand by the minute, we don't want to spend any time or money on compute unless it's actively rendering.
And we need to be secure. This means that we run all of our render commands in a secure sandbox environment to prevent any sort of malicious use.
Observability is key because it means that we have visibility into what the scheduler is doing and we know what our key performance metrics look like. We can establish that baseline behavior and better respond to issues before they become real problems.
On the rendering side, we need to support customer rendered data, which means being able to attach the rendered scene files and their dependencies to render nodes so that they're available at render time.
And we need to support our render licensing backend. Most of the render software packages that we support have these unique and specific licensing requirements that we fulfill with our licensing backend.
Lastly is, of course, the render software itself. We have a render software backend that ensures all of the various packages and their differing versions are available at runtime.
As you can see, we have a lot of requirements, and we're a very small team. It's just Jon, myself, and our manager, Francois. So we need the backend to be as hands-off as possible, so we can spend less time on operational overhead and more time actively improving our service.
So we felt that using managed services would minimize our overhead, and we wanted the flexibility to scale across all of them as needed, as well as not be locked into any particular one.
So we designed our platform to accommodate a variety of scheduling solutions, both cloud-based and self-managed. In our setup, the Conductor API manages all the submitted jobs and their tasks. It's the API you submit your jobs to, and it's the one that keeps track of your job's progress throughout the render process. The render API manages the status of each orchestrator, and it's the one that submits jobs to each of them.
The render API defines a set of methods that we expect every orchestrator to implement, and then we write that implementation for each orchestrator. This allows us to add as many orchestrators as we need across as many providers as we support, and we can enable or disable them based on their availability or compatibility with our platform.
It also allows us to abstract away what an orchestrator means. Each orchestrator could be a scheduler, like pictured here, or it could be a region. But in practice we get as granular as the individual cloud zones.
By default, all the orchestrators are going to be enabled. If we have an outage in US-Central-1A, then we can disable that zone across all our orchestrators and prevent any jobs from being placed there. If we have an outage of an entire region like AWS US-East-1, then we can just disable all the zones in that region.
This multi-cloud, multi-region orchestration architecture is what allowed us to pilot managed services and Nomad as our render farm solutions. There were a lot of lessons that we learned from using managed services that led to our success with Nomad.
With orchestrators to submit jobs to, we need to be able to scale quickly to run those jobs. We initially used the auto-scalers that came with the managed services we were piloting, but we found that they were more optimized for long-running, predictable workloads, and not the unpredictable batch-type of workloads that we were submitting.
We also didn't have the visibility that we wanted into the auto-scaling behavior. So we decided to write our own. The auto-scalers follow the same design pattern as the render API. It's one interface with a defined set of methods that we then implement for every orchestrator.
As a result, the auto-scalers all work in the same way across all orchestrators. They check for pending jobs and scale up the needed instance types, and they check for any idle instances and scale those down.
Of course, we have logic in place to control how aggressive we are with that scaling. In practice, each auto-scaler runs in one region of a particular orchestrator, and it monitors the zones in that region for scaling needs. We can scale up and down in each zone independently of the others.
Now that we have Nomad as an orchestrator and have implemented its render and auto-scaling methods, we can move on to building our Nomad cluster.
Our first step is to make a base image for our Nomad instances. We make 2 base images, 1 for Nomad servers and 1 for our nodes. Our focus will be on the node image.
As you can see, we keep the base image fairly light. We only install 4 things, starting with the CentOS base, we install Docker, Nvidia GPU Drivers for our GPU customers, Nomad of course, and the device plugin, which we'll talk about in a bit, and finally our render licensing service backend.
So we have 1 pair of server and node images for all our regions. And we have a startup script that does some cloud discovery to join the right cluster and with the correct text.
For building the Nomad cluster infrastructure itself, we use Terraform, with the Terraform Cloud backend. We have a module for our servers, our services nodes, and our render nodes. Each module run is split into a separate Terraform workspace, which lets us modify individual pools separately without affecting the others.
With Terraform, we build our clusters, each of which looks something like what you see on screen. We have the 3 Terraform workspaces: servers, services, and the render-pools. You can see that the building blocks for our pools are Google Cloud Managed Instance Groups, or MIGs, which are represented by the blue hexagons.
In this example, we have 2 zones in that region, but it's not uncommon to run 3 or more. The servers and services pool are both 1 single regional MIG that distributes instances across all available zones.
The services pool is where we run our long-running services, like our auto-scalers. Then we have our render pools, which are made up of lots of MIGs, 1 for each instance type we support. Every zone, for example, Zone A, will have CPU MIGs to start at single-core instances and go all the way up to 160 cores.
For our Nvidia GPU customers, we have separate GPU MIGs to start with 4-core, 1-GPU instances and go all the way up to 64 cores with 4 GPUs.
And lastly, we offer both standard and preemptible versions of all the instance types. The number of MIGs you need to build is double what's pictured here.
Preemptible instances are the same as standards, but they have a major difference. They're much cheaper, because they come with a small risk that the instance might be terminated while you're rendering, or preempted, if the cloud provider needs to claim those resources back.
But because it's generally a small risk and because preemptibles are so much cheaper than standard, this makes financial sense for a lot of our customers.
Now that our cluster's built, we can start to submit jobs to it for rendering. The render API is what translates the Conductor job parameters, for example, how much CPU, how much RAM, to an instance type on Google Cloud, and it submits a Nomad job for every Conductor task.
It means that every frame is a job on Nomad. When the Nomad job is submitted, it's populated with the constraints to match the instance that it needs based on instance type, the zone it's in, and whether it's preemptible or not.
The last job constraint in our submission has to do with the Nomad device plugin feature and our render software licensing backend. As I mentioned before, our platform offers licensing built-in and charges on-demand by the minute.
We always want to make sure that there aren't going to be any issues with licensing while our customer renders run. To prevent this from happening, we use a custom Nomad device plugin to check that our licensing service backend is ready and all green. If it is, the device is marked ready, and the last constraint is met for the node to begin accepting those jobs.
While the Nomad device plugin check is happening, we're also loading all of our system jobs on every render node, because they provide all of the supporting services that work alongside the render.
The first one is the license checker. This is an additional license check on top of the device plugin. While the device plugin prevents jobs from being placed on a node that isn't ready, the license checker performs that same check, but it terminates the instance if it continues to fail that check for a pre-specified amount of time.
We use vector logging from Timber.io, now Datadog, for getting logs off the instance and onto our dashboard, so our customers can view the output logs over their render commands.
We use Telegraf to send our metrics to a central influx database for visualization with Chronograf and alerting with Kapacitor. Lastly, we run a custom OOM detector that notifies us of any rendered jobs that OOM, that is, run out of memory, so we can proactively reach out to those customers and recommend that they increase the amount of compute resources they're giving their jobs.
Once all the constraints are met and the job is placed on a node, the render agent can begin the process of running the render. And now my colleague, Jonathan Cross, will give us a rundown on how that process works and how it performs on Nomad. Take it away, John.
Thanks, Carlos. I'll be covering what happens once all the constraints are met and the orchestrator has allocated our job.
Much like how our render API abstracts, we instituted the same methodology with the agent, that should only have to fulfill a contract or more specifically an interface to be able to work on a multitude of environments.
We are very inspired by HashiCorp and how their products give developers the ability to extend via plugins without having to be built into the main binary. We'll touch on how we went about this later on.
Let's get into the first thing the agent does, which is discovery, determining where it's running, or in most cases, which cloud. The initial discovery will instantiate all the packages required to operate successfully. Very much like an orchestrator where it attempts to realize the desired state, we also instruct the agent on what it should be working on.
In this example, the agent has found out it's running on GCP. It then communicates with our API to inform that it's started the job. As Carlos showed earlier, customers use tools to upload their assets to optic storage. Eventually, that gets written to some type of file system that the agent interacts with.
On GCP, we're using persistent disks for AWS FSx for Lustre, and a customer on their own datacenter could want to use NFS. It's all the same with the agent doing the procedure. Here on GCP, the agent will attach and mount all disks pertinent to the job to run properly. Once the file system is recreated, we are ready to run the given arbitrary command.
You all know how dangerous this is, accepting random commands from potentially unknown users. We've taken several measures to try to protect us from malice. Docker has several security features such as namespaces, true root, cgroups, and capabilities and can use tools like AppArmor or SELinux for additional protection.
It also has a default seccomp sandbox profile. This profile disables 40-plus system calls out of 300 or more and is generally flexible enough for almost any application to run without causing issues. For those unfamiliar with seccomp, it's based on eBPF, which allows programs to run small sandbox applications in the kernel with rules.
In the case of seccomp, we can filter any system calls that we don't allow and prevent them from succeeding, and additionally kill the program that made that call. Before the arbitrary command is run, the agent will further reduce the allowable system calls down to 150 and allow no new privileges, meaning the trial process cannot elevate its status beyond the user running that command.
We also run an additional set of iptable rules to disable any outgoing egress calls, while protecting against leaking cloud metadata or anything within our VPC.
Once the command has started, all the stdout and stderr from that process or its children are outputted by the agent. with additional sanitizations so our logging sidecar can ingest it.
We have recently incorporated RPC functionality into the agent. This allows us or our customers to extend functionality that fits the environment it's in. Without having to be built into the agent, it could be in any language.
With the ability to opt in to the agent's built-in package, we can avoid rewriting code and focus on specific implementation points. One of the exciting aspects of this feature is that we can run workloads using different runtime environments.
gVisor is an application kernel that helps lock down file systems, networks, and system calls by rewriting only the parts of the kernel that's required. Google uses this to run App Engine, Cloud Functions, and Cloud Run.
Currently, we are using this for CPU-based workloads where it's compatible. We are excited to roll this out fully, once gVisor has GPU support. While we have started with gVisor, due to how it's OCI-compliant, we can easily and quickly try out different runtimes such as Kata or Firecracker.
Now that you've seen how our platform works, let's go into how it performs.
We wanted to get an idea of how switching to Nomad helps our customers and operators, so we decided to do a pre-standard case benchmark. There's a saying that goes, "There are lies, damned lies, and benchmarks." So please don't interpret this as an absolute.
The agent is using the same exact code, procedure, packages, and package providers on both orchestrators, along with our custom auto-scaler, using the same timings, integrals, cooldowns, and metadata checks or constraints.
Using the canonical BMW car blender that I'm seeing on the same instance types and frame count, we'll see how Nomad and GKE performed for this very normal job on the Conductor platform. The best of 3 runs on each orchestrator are shown with standard deviation if applicable.
Here we are measuring the amount of time it takes to go from Conductor pending to the first confirmed agent start, or Conductor running. Nomad has an advantage here. Mainly due to its machine image being customizable in our CI/CD process using Packer, resulting in a 63% improvement.
GKE is performing steps during instance creation via user data that we can't control. This increases the wait time considerably while costing Conductor. We don't start billing until our agent has started. Had this been a Kubernetes built with a tool like kops, we do believe this would be much closer.
A further breakdown on percentiles: With 50% of the tasks being under 2 minutes for Nomad. GKE fares pretty well for 50% of its tasks as well.
As we continue to scale, we start seeing a slowdown on both, but much more noticeably on GKE. We measured, on average, how long it took to go from Conductor start to finish, per task. Nomad performed slightly better here, due to being able to acquire more CPU and memory when creating cgroup.
GKE has percentage-based reservations, meaning you cannot get the entire machine's resources, which is why it has a slightly longer average runtime. Again, if this was Kubernetes we operated or owned, this would be nearly identical.
Here we are showing the total time from going Conductor pending to all frames or tasks completing. With the additional user data overhead on hundreds of machines, we see considerably longer time to finish the entire job on GKE. Nomad, benefiting from this machine image and the batch scheduler using a power of 2 choices to quickly rank instances, we see a 42% improvement.
Finally, we measure the peak instance count of the clusters to run this job. In conjunction with the total runtime of the benchmark, having a smaller cluster size and finishing quicker saves us and our customers' overhead cost.
Nomad stayed very consistent in its sizing, with a standard deviation of about 20 instances. This attributes to the previously mentioned points. While we saw GKE's best being very close to Nomad in size, its standard deviation was much larger, by almost 5X.
We still see a need for Kubernetes as a service for batch-related work due to being ubiquitous for almost all cloud providers, and for good reasons. We certainly might reevaluate that once HashiCorp platform Nomad exists.
With that being said, for a long time we believed, with such a small team, we couldn't get into the business or the operational overhead of running Nomad, that managed services would save us time. When we finally ran into blockers with managed services that made us beholden to their release cycles, we knew it was the right time to consider a switch.
We went from 0 to production in a month and migrated most of our GCP customers to Nomad. So for those who think that having to own or operate a cluster will be a time sink, we found Nomad as easy as going with managed services, but with much more flexibility and hope many more consider giving it a try.