In 2017, Roblox was growing rapidly, surpassing 30 million monthly active users (MAUs). Internal engineering teams were scaling as well — driving significantly higher levels of resource consumption, capacity demands, and frequency of changes.
Rob Cameron, technical director of cloud services, realized that their infrastructure could not keep up. Dedicated servers were leading to increased resource waste. The company’s reliance on manual workflows and homegrown tooling resulted in significant productivity bottlenecks. Prior to deploying Nomad, Roblox was challenged with:
Deploying a new application would take up to eight weeks
Manual scheduling via legacy tooling
Adding additional resource capacity would take up to 12 weeks
Management of ~10,000 on-prem servers by hand
During the course of migration, Roblox has saved more than $10 million in Windows licensing
Running non-containerized Windows applications
Goal to migrate and containerize to Linux over time
Cameron knew that Roblox needed an orchestrator to modernize its infrastructure — a tool that could enable resource management, efficient scheduling, container adoption, and developer velocity at scale.
"Almost every weekend is the biggest weekend we've ever had in Roblox, our infrastructure had just become unmanageable to deal with application deployment in the old way," said Cameron.
Roblox evaluated Kubernetes, DC/OS, Docker Swarm, and HashiCorp Nomad. Roblox selected Nomad based on the following criteria:
Nomad’s simplicity enabled Roblox to setup a working cluster and deploy applications on bare metal in just four days. As a former consultant, Cameron knew firsthand the hidden costs of adopting trending technologies without proper evaluation of their maintenance costs over time. Since the Infrastructure team was juggling multiple initiatives (including migration to containers), Cameron valued Nomad’s operational ease-of-use and lean maintenance over more complex orchestrators such as Kubernetes.
While managed services were attractive, cloud costs incurred by many software companies were extraordinarily high – 50 percent or higher of their total revenue. Infrastructure costs, if left unchecked, were serious obstacles for achieving profitability. Roblox wanted an orchestrator that they could operate themselves with a lean budget and focus towards cost savings. Nomad won with an operator-friendly UX, ease-of-use, and ability to deploy to bare metal and cloud as a single, lightweight binary.
Flexible Workload Support
Roblox’s annual Windows licensing costs were rising to tens of millions of dollars. To lower costs, Cameron foresaw that the company would eventually need to migrate segments of its Windows applications. Nomad’s first class workload support was a big win for Roblox’s migration strategy from 32-bit Windows to 64-bit Linux.
The migration to Linux would allow Roblox to achieve greater developer productivity and finer-grained operational control. Nomad was able to remain in place as the single orchestrator, seamlessly deploying both Windows and Linux workloads in-place before, during, and after the migration.
Today, Roblox has deployed Nomad on 11,000+ nodes in 20 clusters across bare metal and cloud — serving 100 million MAUs in 200+ countries with 99.995 percent uptime.
Roblox wants to maintain performance of a large scale platform without an overreliance on additional headcount. The operational simplicity of Nomad is the key to fast adoption and high productivity. Nomad allows Roblox to reduce time on learning, debugging, and fixing infrastructure so that engineers can spend more time on adding value to its core gaming business. Other teams within the infrastructure group are able to assist in managing the Nomad deployments because the technology is easy to learn.
- < 8 minutes to deploy an application globally
- ~30 minutes to onboard a new developer into deploying applications onto Nomad
- 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes across 22 clusters, serving 420+ internal developers
"We have people who are first-time system administrators deploying applications, building containers, maintaining Nomad. There is a guy on our team who worked in the IT help desk for eight years — just today he upgraded an entire cluster himself."
"That’s the value proposition that I hope people understand. People seem to get stuck on ‘I need to run Kubernetes because my friend runs it’ — but do you really use it? Can you operate it at the level that’s needed?"
With the right technologies and focus, Roblox successfully implemented its containerization strategy, which helps the company to scale efficiently in dollars and personnel. By containerizing their legacy game engine, upgrading to 64-bit Linux CPU, and adopting Nomad as the single orchestration platform, Roblox achieved:
- Between 150-200% resource utilization — run double the workload on the same hardware
- Save over $10 million in Windows licensing
- Zero downtime to migrate application deployments from on-prem to AWS
"It takes no effort. We take a data center in Nomad and split it across AWS and on-prem and just increase compute in AWS and decrease compute in our local data center. We drain nodes in Nomad and the applications/jobs migrate silently from on-prem to AWS with no understanding needed by the end developers."
As Roblox goes through their containerization journey, the company recognizes the key to choosing technologies is maximizing business value and empowering engineers to solve the right problem. Nomad has enabled Roblox to scale to 100 million MAUs rapidly without additional operational overhead. The platform with Nomad ensures Roblox can continuously scale to reach more players globally.
"We didn’t want to choose any technology that requires the company to drive deep expertise, almost to the point where you have to be a code contributor back into the project to get what you want. Nomad is just very easy to adopt. For our developers who understand containers and microservices, Nomad is an immediate [tool that enables us] to move forward."
Rob Cameron Technical Director of Cloud Services Roblox
Rob has been solving hard technical challenges for nearly 20 years in the industry, consulting with over one thousand different organizations around security and scalable infrastructure. Before focusing on the technical challenges facing the gaming industry, he spent most of his career working for Juniper Networks in the security space.
- Majority on-premises (bare metal) , AWS, GCP, Azure
- Workload type:
- Linux, Windows
- Container Runtime:
- Jenkins, TeamCity, CircleCI, Drone
- Data Service:
- CockroachDB, MongoDB, InfluxDB, ElasticSearch, Vitess, MSSQL
- Portworx, EBS
- Version Control:
- GitLab, GitHub, GitHub Enterprise
- Consul, HAProxy, Traefik
- Security management: