See how Epic Games uses a Nomad-based pipeline to handle tens of thousands of concurrent “cooks”.
“Attempting to do this with Kubernetes would've been a monumental amount of extra work, particularly due to the multi-stage workflow file system and Windows requirements. With Nomad, we knew we had the parameters we would need — and without any additional hoops to jump through.”
Hi everyone. My name is Paul Sharpe. I'm a technical director at Epic Games in the Unreal Cloud Services division. Cooking with Nomad — powering the Fortnite Creator Economy. I'm a firm believer in “context is everything.” So, I'd like to set some context real quick. As I know, not everyone may be familiar with Epic Games or Fortnite.
Epic Games is one of the biggest names in the gaming industry, originally founded in 1991 by our CEO, Tim Sweeney. Since then, the company has greatly expanded, with thousands of employees and numerous studios across the globe. Epic has created some of the most famous games titles in the industry, such as Unreal Tournament, Gears of War, and Fortnite. However, it's also just as famous for its 3D creation tool, the Unreal Engine.
While Epic is known for its games over the years, the Unreal Engine is the heart of our company. It's an industry leader, the recipient of many prestigious awards, and even holds a Guinness Book World Record for the most successful video game engine of all time. Unreal Engine has been the foundation for thousands of video games, from some of the largest AAA titles to smaller indie projects.
But it's more than a video game engine. We call UE a 3D creation tool because it enables the creation of rich 3D experiences and environments for a broad set of applications, not just games. In automotive, it helps BMW design in real-time. And at Rivian, it actually powers their in-vehicle experience.
The engine has also transformed special effects for film and television, allowing scenes to be adjusted and rendered as they shoot the footage. It's right there on the spot. It's also used in architecture for design and 3D and VR walkthroughs, and it's used in all kinds of simulations for medicine, academic research, and more.
What kind of presenter would I be if I didn't do a quick shoutout for my org — the Unreal Cloud Services, which is the intersection of the engine and online services? We specialize in cloud-based solutions involving the Unreal Engine as well as projects and features that help it operate in cloud environments like Containerizing the Unreal Engine.
UCS started as a small startup-like org within the company, and we focused on Epic's pixel-streaming products, such as Metahuman Creator and Twin Motion, which were primarily used in non-gaming industries. But after we established our reputation as an org that reliably delivers, we expanded our scope to include game-related products, as well as creating and expanding on real-time collaboration features for engine-related products.
Getting back to Fortnite, Epic launched it in 2017, and it rapidly became one of the largest video games in the world. Now, with over 500 million accounts, rivaling some social networks we know.
Fortnite started off as more of a cooperative survival game with an emphasis on building structures to survive the environment. But shortly after launch, a mode called Battle Royale was released. This is a competitive last-person standing 100-person arena also with the same building elements. This mode only took two months to create, and it was an instant hit, propelling Fortnite to the top of the charts.
The willingness to move quickly and experiment with different gameplay modes was critical to Fortnite's success. It set the tone for the coming years, where the game evolved into a whole world with a vibrant story told through chapters, a wide variety of game modes, and in a place that brings players together through live events such as in-game concerts with various famous artists. Fortnite became a set of experiences, not just a game.
This spirit of experimentation, coupled with Fortnite's evolution to be more about experiences, was the catalyst for Fortnite's creative mode. This mode allows players to build their own games and experiences, utilizing the assets and building blocks provided, on their own private Fortnite islands.
Epic has always been passionate about building tools and platforms for other studios and content creators, so we created an ecosystem, both financial and technical, for these creators to publish, manage, and market their Fortnite islands.
As the market and interest for these kinds of experiences grew and took shape, we began to expand on the ideas and technologies involved with the goal of building the foundation for an open ecosystem where players can discover and participate in a vast variety of meaningful experiences together, with no gatekeepers whatsoever. This described ecosystem is at the heart of Tim Sweeney's vision for the Metaverse, and it required new tools and new monetization to entice and enable creators to participate — so thus, UEFN was born.
In March of this year. We launched Unreal Editor for Fortnite and the new Creator Economy at the Game Developers Conference. UEFN builds on the original concepts, introducing creative mode. But instead of working with a curated set of assets in engine functionality, it lets creators use many of the bells and whistles in Unreal Engine 5 and with any asset, including custom ones that people can make themselves.
UEFN also introduces a diverse programming language to enable more advanced gameplay functionality and take advantage of what the engine's capable of. With these expanded capabilities, the sky's the limit on the types of experiences that can be created, including ones that look and play completely nothing like Fortnite.
Epic Games is very serious about this product and is doubling down on Tim's vision of the open Metaverse by launching our new creator economy, where Epic sets aside 40% of its net revenue to fund participants in this economy. Creators receive money based on player engagement with their experiences. Six months after we've launched, UEFN has been gaining in popularity, with over 700,000 islands that have already been created. Then, to give you an idea of what's possible, here's a clip from our live demo at GDC. This game, what you're about to see, was created entirely with UEFN.
That was pretty cool, right? They created that entirely in UEFN, and the engineers at GDC played through that live when they were doing that and showed how that was made — it was pretty neat. Notice it looks nothing like Fortnite, hyper-realistic style But at this point you might be asking yourself, alright, cool story: What did that have to do with cooking or Nomad?
When we say cooking, we mean the process in which assets are compiled or converted into platform-specific formats required to be loaded on by the engine on that said platform. Windows, PlayStation, Xbox, Switch, and the like. Unreal Engine comes with that functionality, cooked assets baked in — alright, that joke.
So, everything you saw on that demo required the Unreal Engine to perform operations like compiling shaders, converting audio formats, translating static meshes, and more. These operations are extremely resource-intensive and normally are done on a developer workstation with all the cores and all the RAM or on a giant render farm.
Unreal Engine will gladly consume every resource you give to it, and it can take a lot of time for these operations to complete, depending on the complexity of the project. To speed up this process, the engine can use a cache to avoid regenerating assets and will update the cache when it outputs new cooked output. Cooking your UEFN island takes a large amount of compute resources, a lot of time — and involves caches, a lot of input and output data, and the use of a tool that potentially has double-digit gigabytes of bundled STKs with it.
Given the potential size of the metaverse where millions of creators are crafting these experiences, how do you build a platform for doing these kinds of operations — potentially with tens to hundreds of thousands of simultaneous workflows in flight? Well, it turns out it's with Nomad. We'd already been using Nomad at UCS for serving our pixel stream products like Metahuman Creator and Twin Motion — and for processing images in a product we have called Reality Capture.
We had battle-tested operational experience with it in production and familiarity with its API and tooling. So, when UCS was approached to take on this cooking pipeline for UEFN, we knew Nomad could be the perfect fit for this type of workflow we had to support.
It met all our project requirements: Reliable container orchestration available through an API, multi-stage workflows for executing batch jobs, control over container runtimes to configure compute isolation, solid file system isolation to secure user-generated content. Remember, anyone can bring their own assets and whatnot. And the ability to operate on Windows.
Attempting to do this with Kubernetes would've been a monumental amount of extra work, particularly due to the multi-stage workflow file system and Windows requirements. With Nomad, we knew we had the parameters we would need — and without any additional hoops to jump through.
Let's talk about the first requirement we had — orchestration. These cooking sessions are one-off batch jobs, executing containers that could potentially run for extended periods of time. The UEFN services involved managing a user's cook request needed a reliable API to manage these jobs — from creating them, getting status, canceling them if they exceeded an SLA.
Nomad's control plane was a great fit. Not only because of its clustering capabilities, which provided redundancy for the API but also because it has its own reliable queuing mechanism built into the scheduling workflow. We wouldn't have to implement any queuing mechanisms or rely on something like SQS; we could just trust the control plane to maintain and process the queue while providing numerous metrics to give us insight into the queue state. This greatly simplified our development process as we didn't have to create separate services to manage this for us.
Additionally, Nomad gives you full control over the scheduling algorithms available for scheduling containers. For our use case, given the intensity of the operations we're talking about here, bin packing wasn't an ideal use of our fleet of EC2 instances, and these are gigantic. We take the whole host level of EC2 instances because noisy neighbors are bad for this use case.
Using this spread algorithm allowed us to fan these cooks out horizontally and minimize any CPU contention when hopping onto a box that wasn't fully loaded. This was important. Because, due to the size of the containers involved and how long it takes Windows to pull said container, we couldn't rely on reactive auto-scaling. Instead, we had to optimize the fleet size based on traffic trends. That meant making sure we were utilizing as many instances as possible to optimize our cloud spend.
Another feature that was a value in this process was the concept of routing by node class. This allowed us to have different instance types for different types of cooks — further optimizing our cloud spend.
If I had to pick one feature that was critical for this project's success, it would definitely be task lifecycles. The concept of having a main task with hooks to launch other tasks around it was precisely what we needed for a cook workflow. If we were using Kubernetes, we would've had to extend it to provide this functionality. With Nomad, we just had it out of the box.
The steps provided in lifecycles are: You have a pre-start task. These tasks are run before the main task executes. Any failures in this step will cause the main task to be skipped, and these are useful for any kind of initialization or staging of an environment.
Then, you have poststart tasks. These are run after the main task starts and are useful for sidecars that you would want to run beside the main task. You can even flag the task as a sidecar, and Nomad will keep it up and running for you and make sure it's always running — and will clean it up when the main task exits.
Finally, we have poststop tasks. These are run at the end of the job after all the previous tasks have exited. These tasks are always executed even if there were errors in the previous steps, which makes them great for any additional error handling or cleanup that you need to do.
This involves numerous steps. First is downloading any of the input assets along with any related cached assets and making them available to the engine to operate on. The prestart lifecycle step was absolutely perfect for this. We're able to launch our utility to download and validate all of the assets and required files, as well as create the directory structure for the rest of their tasks in the workflow. With how the lifecycles work, any errors here will skip launching the cook process, allowing us to fail fast.
The main task in our lifecycle is the cooking process itself. This is a containerized Unreal Engine 5 instance with UEFN-specific cook modules run in complete isolation because it's processing untrusted user-generated content.
We keep various platform-specific versions of these containers prewarmed on our nodes due to their gigantic size, which in turn is due to those bundled STKs that I was mentioning before. Keeping these prewarmed allows for the container to start almost immediately, which is important because we try to keep our cook times as short as possible for a favorable user experience.
After the cook starts up, we use the poststart step to launch a monitor process that will fail the job if the cook exceeds its SLA. Again, we're very cognizant about cook times for the user experience. Being able to cancel this task is actually one of the many ways that the Nomad API helped with managing these workflows.
The last phase is poststop. We use this for analyzing and reporting any Unreal Engine crashes as well as syncing any new caches that were generated during the cook. We then upload the generated output t to its final destination for loading back into the running UEFN session. I should mention that when you download UEFN, and you use it,, you're in-game, you're in a 3D session. You bring up an editor, and you're doing all this stuff, and you submit these cook requests.
When the cook is done, your assets will actually render live back into the session as you're editing it. It's really neat to watch. That poststop step was critical for that. We use that step in case there are any errors. We can update the metadata so we can send a message back to the user that their cook failed.
One quirk of using task lifecycles is that all the tasks in a particular phase are launched at the same time — all together. There's no way to inherently run these in any particular order. For a sub-workflow within a workflow functionality. However, once again, we were able to utilize the Nomad API here. That last step uses the API to get the status of all the other steps that it's waiting for. Once they're complete, it can move forward with performing the uploads and the updates.
Nomad's variety of runtimes and the ability to control arguments passed to them was absolutely critical for the security of the cook workflows and the maintenance of our fleet.
Nomad supports a number of container runtimes, even running tasks via exec through the task driver abstraction. You can select and configure these drivers when you create a job spec, providing a very flexible toolkit for operating your services and applications. In our use case, being able to configure Docker’s networking and security features were instrumental in completely isolating this cooking process. Again, user-generated content.
This configuration, in concert with Nomad's file system isolation features, protects creators from bad actors — and code or their assets potentially breaking out of the cook and doing nefarious things on the host — or messing with other people's cooks happening on the same node.
Also, as I previously mentioned, we keep all those containers prewarmed. So, utilizing the support for these different runtimes combined with Nomad's concept of a sysbatch job — which is this excellent job type that will run a batch job across every host in the fleet — allowed us to periodically sync the required images that we need and clean up old ones just to keep a container creep under check.
That is another absolute key feature. When Nomad schedules a set of tasks for a job on a host, it's called an allocation. Inside that allocation, each task within it has its own working directory. Then, the allocation itself additionally provides a separate directory meant for sharing data between the tasks.
Some of Nomad's task drivers, including Docker, support image isolation for these file systems. This means the file systems are isolated as machine images controlled and owned by the task driver's external process. Like the Docker data, instead of by Nomad. These images are mounted into the task, ensuring the task file system is only accessible by the task itself and that allocation file system is only accessible by the set of tasks within that allocation.
This shared allocation file system was the final puzzle piece for isolating this cook process. We needed a way to get the inputs and cache files to the engine, as well as letting the subsequent reporting and uploading processes have access to the outputs and crash reports from it. All while preventing other cooking sessions from accessing the data. By utilizing this allocation file system, we were able to do exactly that.
The final requirement for this project was Windows support. Even though the Unreal Engine technically runs on Linux, we had to use Windows due to the various platform STKs used during the cook. Luckily, deploying Nomad on Windows is a fairly simple task. One that we already had a lot of previous experience with from our original cluster we used for Metahuman and those other services.
Nomad ships as a single binary, both for the servers in the control plane and the clients in the data plane. This makes packaging and deploying it a fairly simple procedure, relying on just configuration to define which role the node is fulfilling. For Windows nodes in our UEFN cluster, we didn't really need to do anything special for that. We just bake an AMI with the Nomad agent and whatever other operational agents we want in there for observability and whatnot — and we use NSSM to launch the agent and keep it up and running on that host.
Additionally, Nomad clients running on Windows actually have no limitations in the core primitives. All the job types, and pretty much all of the task drivers are available, too, as long as the underlying runtime is supported on Windows. This is in contrast to Kubernetes, where Windows nodes have limited functionality, and getting them up and set up inside a cluster is a far more complicated task.
Those were all some of the benefits, and while it was a great fit, developing the solution wasn't all rainbows and unicorns. After we got everything set up, we really wanted to scale up to about 20,000 concurrent cooks. These are ridiculously resource-intensive, these cooks.
We wanted to scale it up to that large to prove that the cluster could withstand our predicted high-end usage patterns. This meant we needed to have a cluster that had well over 2,000 hosts available for the jobs and see previous statement. Again, these are beefy instances, so bringing them up and having them join the Nomad cluster was no big deal, and that just worked fine.
However, when we began our load test, even with smaller batch sizes, our cluster almost immediately started to choke. Jobs would get stuck in a pending state and stay that way for well over 30 minutes. What was weird was the queue was fine, and no jobs in the queue because it already made it past that part in the scheduling workflow. They just sat there forever pending. And many hosts wouldn't even get allocations assigned to them even though we had plenty of capacity. And we were using that spread algorithm, so it should have, but it didn't.
It seemed that the scheduling workers in the control plane were straight-up overwhelmed by the complexity of our jobs and the sheer number of clients they had to pour through to find the available capacity. We noticed many of the Raft metrics at that point were through the roof. Raft is heavy on disc IO because of checkpoints to the file system.
We upgraded the control plane to much larger instance types with all the cores and all the NVME drives again to give Raft operations that oomph. You need cores for schedulers in the Nomad control plane because each scheduler worker in that workflow is optimized to consume an entire core by itself.
We upgraded, and alas, it didn't fix our issue, but it's alright. In the end, we worked around this by splitting the cluster into multiple smaller clusters. We went with a 500 node size per cluster and changed our services that were calling out to the Nomad API to schedule to use a consistent hash ring instead to pick which cluster they would schedule to. Once we rolled this out, everything was right as rain, and we were easily able to get to the 20,000 concurrent cook number with minimal stress on the clusters involved.
Using a consistent hash ring ended up being a win anyway. It's really easy to add and remove clusters from this ring and without any of the usual federation headaches that come along with using that. Then, it also led us to spread our clusters out across multiple AWS regions, not just availability zones, for better reliability.
Cluster size and scheduling were the brunt of the issues we saw, but a few minor things that we noticed. We had to write our own sidecar to time out the cooking task because there's no inherent way in Nomad to say this task can only run for this long, and if it does anything after that, kill it and error out. We had to do that ourselves.
Also, with the ordered tasks within a lifecycle step, another thing that we had to work around ourselves. The Nomad API helped us there, but it'd be nice if you could say, run this, then this, then this in a lifecycle step.
Nomad API also provides an event stream, and it's an awesome thing to have in an API. You get events for every single thing that happens in your allocations — your tasks. If a Docker Daemon is pulling you get an event about it. You can glean a lot from what's going on in your system through this event stream. But it's really verbose. It's a giant JSON blob. When your cluster is in full swing, it was in our load test, it's an absolute firehose of data to process.
Here we are post-launch. It's been six months and 14 million cooks later. So, what have we learned from all of this?
It seemed like a lot of our scheduling problems were due to the size of our job specs. In a lot of cases, for these complex maps, we'll sometimes have to download literally 10,000 files for inputs for the engine to work through. We were originally passing those locations for those files in with the job spec, which was then getting serialized into the disc via Raft. That was the cause of a lot of our high Raft IO — because of ginormous objects that should have probably never been there, to begin with.
So, if you have a lot of parameters and a lot of data — and you need to get to a task in Nomad, and you want to do it through a job spec — consider putting that data someplace else and letting your task pull that down when you do that.
Also, Nomad has a plethora of metrics, and they give you insight into all things about the control plane, the nodes, their allocations, and the like. They're pretty extensive. Understanding the Raft metrics, which we didn't even look at before. We didn't have to when we ran our original cluster with Metahuman because there was never an issue. They're like, oh, what's these Raft metrics? This is interesting, and why are they off and to the right? What is going on here? Those were really helpful. All of it's helpful. Familiarize yourself with it because there are a million of them.
Going back to that event stream, it is a great way to get insight into what's going on. Like I said, an example is that some of the UEFN devs contacted us and said why is my job taking 30 minutes to pull?
You're like, hold on, and you go check the event stream. It’s because you used a container that wasn't pre-cached, and it took 20 minutes to download that. Sorry. But that was a great way to get insight into that. There are a bunch of other metrics and analytics data you can pull that may not be available in your normal operational data that are in those other metrics.
I know that's like a DevOps no-no. You're like, oh God, it's working. Don't change it. But Nomad's policy is that security updates come from the current version minus two. Always stay within that minus-two window. To Nomad team’s credit, they come out with a lot of cool, interesting features. So, you might be surprised at how much they could help you if you take the plunge and upgrade your cluster.
Apologies to anyone from Microsoft who's here, and feel free to come talk to me afterwards. But I got to say, the majority of issues that we've had in the past six months have come from running Windows, and particularly Docker's interactions with it.
The weirdest part of it is we get these syscall hangs. It's like Docker container tries to start. Oh, nope, Windows Kernel has said, no, no, no, I'm going to time that out. Never going to respond to your syscall. It's cool, don't worry about it. It's okay, Docker's great. Next time, I'm going to try to mount this volume to this task. And the Windows kernel's like, nope, sorry, syscall's going to hang again. Good luck. Have fun. Dockers would manifest as containers coming up and staying in pending or containers coming up and erroring within ten seconds. It was a weird thing to work with.
Also, it required non-intuitive network tweaks. When you're operating at like 25 gigabits, you got to tweak the kernel to handle that much traffic. But there were a lot of esoteric Windows network settings. Some of them we looked at and thought there's no way that's going to have any impact on anything. And It turned out it actually had a lot of impact on things. So, it required a lot of weird tweaking that didn't feel intuitive at all.
One of the more annoying issues we ran into is that on Windows the OS version of the container has to exactly match the host OS version. So, if you want to update the host OS, like we'd say upgrade a version — or even a minor version — it turns into great, we can do that, but now we got to convert every single container in this workflow to have that version. That's a lot more complicated than it sounds, given the nature of some of the containers in our workflow.
Also, resource isolation is not quite as sophisticated as Linux, particularly in Nomad's interaction with it. On Linux Nomad job specs, you can say, I want my task to have three cores. You just can't do that through Windows, through Nomad. So, Linux provides definitely more isolation and a better situation there. The last bit of it is steep licensing fees. It is expensive, so watch out for that. But anyway, that's my talk.
Thanks for attending, and I hope to see you in the Metaverse.