Find out how Oscar Health Insurance built its own CI engine with Nomad and Docker to achieve near-instantaneous CI job startups in their multi-gigabyte repo.
Developers have plenty of options for CI engines and tools for running tests or building artifacts. Tools like Buildkit, TravisCI, and the widely popular Jenkins.
Unfortunately, many of these tools aren't options if you are working with large repositories, like monorepos, or if you have sensitive information like Protected Health Information (PHI) or Personally Identifiable Information (PII).
Wyatt Anderson, a software engineer at Oscar Health Insurance, will show you how his team is able to work with PHI by building and testing thousands of targets on their CI system, which is built on HashiCorp Nomad. This fast, distributed, scalable, and maintainable system uses Nomad along with Docker to achieve near-instantaneous CI job startups within their multi-gigabyte repository.
Hi, everyone. Thanks for coming to my talk today. I’m really excited to be here. I don’t know if any other presenters have made this joke yet, but I feel like I’m presenting in the Palace of Versailles in this ridiculous room here.
So: talking about scalable continuous integration with Nomad and Docker. I’m going to give some introductions on myself and the company I work for, some background on the problem I’m going to talk about: what we tried and what didn’t work, talk about what we built with Nomad. And then what we learned as we built our internal product.
My name’s Wyatt Anderson. There’s my Twitter handle and website, if you’re interested. I work for a company called Oscar Health. Oscar Health is a health insurance startup in New York City. We are building a consumer-focused, technology-driven healthcare experience from the ground up.
I work on a team called Product Infrastructure, and shout out to my teammates who are here today and everyone else from Oscar. Product Infrastructure at Oscar is responsible for a lot of Oscar’s web application infrastructure, UI frameworks, CI and CD, and our authentication systems as well.
And today I’m going to focus on the CI and CD part of that. CI stands for “continuous integration.” Philosophically for us, that means we want to have our developers merging and integrating their changes with every other developer’s changes, as frequently as possible. And we want to be testing those as frequently as possible.
We don’t want long-running feature branches. We don’t want people off doing submarine development. We want to get all that code integrated in the trunk and tested and hopefully deployed as quickly as possible.
So we’re doing a lot of batch scheduling of our CI/CD jobs with Nomad, and that’s what I’m going to talk about today.
So how do you build a health insurer from the ground up? Health insurance is an industry that doesn’t see a lot of frequent innovation, especially from the tech side. So what you end up having to do is build a lot of stuff from scratch.
We want to build a new, better health insurance company that really innovates. We don’t want to do the status quo because that’s not really exciting. So building a lot of stuff from scratch, there’s a lot of complexity right out of the gate.
The minimum viable product for a health insurer is dozens of jobs and services. It’s really difficult to even communicate the scale of the problem that we faced. For us, it’s not just a web app and a backend and a few little jobs and services here and there.
Today, we’ve got dozens of web applications, hundreds of jobs and services, thousands of test suites, all worked on by 110 or so engineers and growing. So how do we organize that complexity in our codebase?
The hypothetical view of what our architecture might look like—and this is radically simplified: You might have core libraries and payments models and our member enrollment models and provider data all tied into claims processing and our member website, the marketing website, the internal website, and the provider website.
Traditionally, you might organize all of these things into separate Git repositories, and then use a microservice architecture to communicate between them. You might publish packages to an internal package repository so that you can share some of these core libraries and model code.
We took a different track and we adopted what’s called a “monorepo.” Monorepos became popular at Google and Facebook and a lot of other companies with a lot of time and money to invest in tooling. In a monorepo, you organize all of your application code in one giant repository.
And I know some people probably bristle at the mention of that. It’s not for everybody, but we’ve made an investment in it, and it’s worked out well for us.
The overall architecture might look very similar, it’s just that you have 1 repository out on the edge there. But the component pieces of your architecture, and of your codebase, are going to look very similar.
You might still build the same things, you’re going to have those same dependency lines between all your jobs and services. You’re still going to have directories full of source code. You’re going to have these core shared libraries in your models and your websites.
You still need to organize your monorepo. You need to build your own boundaries with tools and process. We use a build system called PANTS to organize our monorepo. Our build system lets us describe explicitly the dependencies between our systems. It lets us describe the way that we want to build our software, lets us describe the build process for Python, for Go, for Java, for doing code gen for APIs, and it’s critical for navigating a large code base in a monorepo.
It’s cool because we can introspect our build graph, we can ask questions of our build system about our source code. We can ask, “What depends on this?” or, “How do I build this?” or, “What tests do I need to run when I change these source code files?”
The build graph in your monorepo might look something like this. This is a very small subset of ours. We can generate graphs like this from our build system so we can see these dependencies and see how components interplay with each other.
The green rectangles represent software libraries that we’ve written. The purple rectangles represent third-party code that we’re importing. And that yellow rectangle represents auto-generated API code that we can pull in. And our build system can use this information to grab exactly the source code that we need, to download and compile the libraries that we want, to generate our source code, package it all up into 1 distributable ...
We don’t have to ship our entire multi-gigabyte monorepo to every machine that we want to run our software on, because we can package everything automatically with the information contained in our build system.
Build systems make it easier to build and test a complex codebase, especially in a monorepo. Testing is critical. It’s really important for our team. Testing lets us move quickly. It gives us the confidence that we need to deploy hundreds of times a day, to be able to release at any time.
Our engineers can drive their own release process. They can ship code whenever they want. They can deploy fixes, they can deploy new features, they can interact with other teams. And it’s given us the ability to move very quickly as an organization, but not break things.
Fast and accurate test feedback is critical for maintaining that developer velocity at Oscar. We want to make it so that our engineers and our developers can receive the results of their test runs as soon as possible, and with as much accuracy as possible. Because if you have to sit around for an hour, for 3 hours, for the build to complete, you’re going to go do something else. You’re going to get distracted.
We want to be able to iterate quickly. We want to integrate our changes quickly. And we want to ship features quickly and safely.
If tests take a long time to run, if test results are inaccurate, people aren’t going to write tests. They’re not going to be motivated to spend time thoughtfully writing tests for their code. And people won’t pay attention to the test results. They won’t wait around for build failures. They’ll ignore build failures because flaky tests happen, or because this test always breaks or whatever. “I’ve got to ship my code.” We want to avoid that.
Testing in monorepos can be challenging, because a lot of off-the-shelf testing systems like Jenkins or hosted services like Travis CI or Buildkite aren’t designed with monorepos in mind. If you’re on the order of a couple hundred megabytes or tens of megabytes, maybe that’s easy. But if you’re dealing with multi-gigabyte repositories, pulling all that source code from your Git server or even just doing checkouts on disk can be very I/O-intensive, and that breaks down.
We also have a lot of test volume in our monorepo. Because we have that very comprehensive build graph, we can be very comprehensive in our test strategies. We have a library called Flag, which is open source. It’s on our Git hub, and we use Flag for parsing command-line flags, environment variables, and INI files into application and service configuration.
We can use our build graph to determine that we have about 1,500 things depending directly on that flag library. So that’s 1,500 Python files, at least, that import that library and use that directly. But it spirals out of control when we talk about transitive dependencies of library code.
Transitive dependencies mean that you import something that imports something that imports something else that imports the thing that you’re interested in. So on that build graph I showed earlier, you can see that on the left-hand side we have a job on a service and then that imports 1 target and another target that imports the other thing.
So there are over 6,600 transitive dependencies of that library. So for people working on core application infrastructure, shared code, shared libraries, we want to be able to test everything that depends on those core libraries. Or even on a smaller scale; maybe you have 660 things that depend on your code, or 66, or 6.
We want to be able to figure out everything that depends on that code and test it all at once when you make a change. And a real big benefit of a monorepo is that when you propose a change to a core shared library or to any kind of code in your codebase, you can figure out exactly what you need to test and you can test all of that.
But that creates a lot of test volume, and you’ve got to scale that out somewhere.
So the resource demands for building and testing in a monorepo are somewhat high. We build the world every night. That means we build every buildable target in our codebase. So that’s binaries, applications in code that we might actually deploy. And we also run every test every night. That’s about 4,500 targets. Takes about 97 hours of compute time; it’s about 4 days.
We do that in about 45 minutes of wall-clock time. We could probably speed that up, but 45 minutes is a lot better than 4 days. But without that massive parallelism, even with 32 cores, that’s still 3 hours. And multiply that kind of weight by 100 developers and you run out of cores, and everybody gets impatient, and your organization can’t scale quickly.
Honestly, even 45 minutes would be way too long for the average. The complexity of the amount of compute that you’re going to need to run your build system can roughly be described as on the order of the number of changes per day. So that scales with the number of developers that you have, the number of changes they’re making to systems, the number of tests, and then the connectedness of your build graph: how many connections there are between certain things, how many things depend on other things, how intertwined your codebase is.
Thankfully for us, as we’ve grown and scaled and moved in the direction of a service-oriented architecture, the connectedness overall has gone down.
But in core libraries, core utilities, things like configuration or database connections or secrets management, those are still pretty connected.
For us, that means we’ve had about 900,000 build jobs in the last 30 days. That’s 60,000 per day and growing, 6.6 million build jobs in the last year, and then a few hundred per minute at peak.
This only grows over time for us. We’re growing really aggressively as an organization, we’re building a lot more software, we’re trying to modularize things. And so we need a way to manage this workload.
Early on, we used to run our tests sequentially, and we had one Jenkins job that would get kicked off whenever you submitted a pull request. The test eventually got to a point where it took 24 hours to run a build, and everyone stopped paying attention, and people didn’t write tests. We had to stop everybody and be like, “OK, we’ve got to fix it. We’re going to fix all these tests, we’re going to fix our CI systems, because this just isn’t sustainable.”
So we need to run our tests with massive parallelism. We used a system called Kochiku, which is an open-source build system that was developed at Square, the payments company. Kochiku was great and it solved a lot of problems for us, because it was one of the first open-source build systems that was designed for this sort of workflow, where you have one way to build things, but that one build job of “Run the tests” or “Build the software” can spiral off into hundreds or thousands of different kinds of parameterized build jobs for your source code.
Kochiku’s architecture looked like this. It has a web application, a partitioner. The partitioner is what is responsible for looking at the changes in your pull requests and your commit, seeing that you changed this file, this file, and this file, and mapping that to a list of things that you want to build and test.
Then those test jobs get distributed off to your workers, and they go to work and run your tests and build your code.
Challenges with that: disk I/O. There was a comment in the Ruby source code for Kochiku that says, “Clone the repo,” and in parentheses it says, “Fast.” And cloning a multi-gigabyte repository is never fast. So as we grew and scaled, the disks on our build machines got hammered.
We were using GP2 EBS volumes, which have burstable I/O capacity. And during busy times, we would exhaust all of our I/O capacity in the first few minutes of an hour, and then the I/O performance of the disk would drop to like 100 IOPS, and everyone could go home for the day because you can’t get anything done.
So we moved to ephemeral SSDs. We put some Band-Aids on this, but it still didn’t work super well. Kochiku’s job configuration wasn’t flexible enough for us either. We wanted to be able to parameterize our build jobs and say, “We want more RAM for these Java tests. We want more CPUs for this integration test. We want to be able to run multiple containers for a very high-level integration test. We want to build iOS binaries. We want to run tests in IE on Windows.”
That didn’t work. Even if we wanted to do that, it’s an open-source project. We could’ve tweaked things, but Kochiku was difficult for us to customize because Kochiku’s written in Ruby, and Oscar isn’t a Ruby shop, plain and simple.
So, I think, for a Ruby organization, Kochiku might be a great choice. For us, it wasn’t.
A natural fit for us was Nomad. We’re already big users of other HashiCorp products. We really love Consul, Terraform, Vault, Packer, Vagrant to an extent. So Nomad seemed like a great idea. Nomad’s got a great batch scheduler, and this is a huge batch workload.
So what we did is we just replaced those worker nodes with a Nomad cluster, and a couple of other abstractions that I’m going to gloss over.
So how does Nomad fit into our architecture? We’ve abstracted away some of the Nomad details with a broker system, where we have a job dispatcher where our CI system will send our build jobs to that, and then it will send the build jobs to Nomad and record the job ID. Nomad’s going to churn over all these build jobs. It’s going to run the test. It’s going to build your binaries.
And then we use nomad-firehose, which is an open-source tool designed to generate a firehose of all the events that are happening in your Nomad cluster. So every time a job gets scheduled, every time an allocation changes state, those are going to get sent to our external log of activity that’s a Kafka topic, where we can consume those events at our own pace, and then map that back to our own data models for builds and tests that we can represent to developers.
So if allocations get rescheduled because Amazon replaced a machine or canceled our spot instance, that’s very transparent to the user, and we abstract all of that away. We just want the tests to run. We want them to run in a timely fashion. We don’t really care too much about the details.
So what do the jobs look like? We run everything in 1 job that we parameterize. We use Nomad’s priority capabilities to prioritize small build jobs over very large ones. So if you’re changing only a few test files and you only need to run a few tests, that’s going to be very fast and it’s going to get prioritized ahead of someone who’s making changes to core library code and needs to run 3,000 build jobs.
We run 2 tasks inside of a task group. We run a log collector and then the tests themselves. We dynamically assign resources to the tests at schedule time, so most tests get about 2,800 MHz of CPU and 3.5GB of RAM. And then based on historical performance or a couple of other defaults, we can bump that up with a multiplier.
So, if you’re building Java source code and you’re running a large Java integration test, we can throw a lot of RAM at you. Some of our Go compilation jobs surprisingly take a lot of memory. And then if you just want a lot of CPU or RAM, you can opt in to that in the configuration for your build job.
We use the log collector because we want to be able to isolate our tests in the test container. Tests write logs, build artifacts, screenshots, video recordings to the shared allocation directory between the tasks. And then the log collector’s responsible for vacuuming all those up at the end of the test run and shipping them off to our system.
It’s kind of cool that we can isolate our tests, run with no Docker networking, because they don’t need to pull any dependencies down. They don’t need to communicate with the outside world at all.
So how do we make it fast? Because I still haven’t talked about how we solved the problem of dealing with this multi-gigabyte Git repository. Nomad wouldn’t help us at all if we still had to do 4,500 Git checkouts or something. Cloning a 3GB Git repo is not fast. You’re never going to make that fast. As we add more code to the repo, as we grow, that’s only going to get more and more painful.
So, if you’re building 4,500 things, what happens is someone will submit code for review, we’ll figure out that you need to build 4,500 things. But we want that to run as if they’re all running in that same workspace, where you have got the repository checked out at a certain commit. Doing that 4,500 times is wasteful and it’s not particularly fast.
So we kind of abuse Docker to give us a snapshot of the repository at a certain point in time. And then we’ve optimized the distribution of that image so that we can start things up really quickly and avoid doing any Git operations at all in the individual test run.
In fact, we nuke the Git directory from the eventual image that we’re running things in, because we really only wanted to say, “Take a tarball, build a layered filesystem, and then run the tests.”
We build a master image every 24 hours. That master image is about 10BG. It contains the repository checked out at master, contains all the dependencies, all your third-party library code, the build toolchain, any interpreters or compilers that you might need, and then any system dependencies.
Then every time someone submits a commit to test, we build a patch image based on the latest master image. So those patch images are only about 1MB, 5MB, and rarely they might grow into the tens or hundreds of megabytes if you’re testing a very substantial change. But in the average case it’s very small. It’s easy to distribute an image that size. And eventually they get cached on our Nomad client nodes.
Docker’s great at caching images, at least in the most recent versions of Docker. We only have to download that master image once every 24 hours, or when the machine boots when we’re scaling up and down our cluster.
So we’re retrieving or even interacting with that 10GB master image very infrequently. And then once your patch image is pulled on the machine, they’re cached in memory, they’re cached on disk, and it’s very efficient to set up a container with the filesystem that looks like you did a big Git checkout or a Git clone. But in fact you just complied a couple of filesystem layers.
So how do we scale this solution? Because we still have the problem of test volume, we still have tens of thousands of tests a day, hundreds of thousands, millions of tests a month. We use EC2. So we use i3.2xlarge EC2 instances.
The i3 instance types are for very I/O-intensive workloads. The instances have about 8 cores, 61GB of RAM, and 1.6TB of directly attached NVMe SSDs. So that’s not a network block device; that’s like an SSD that’s physically attached to the server that you’re using, so you get 80,000 IOPS or something.
We’re very resilient to any kind of failures in here. We don’t care about data loss or even the node disappearing. So we can use that ephemeral storage really effectively. It makes our I/O very fast. It makes all those Docker image operations very fast.
And then it also benefits our test runs, because if we’re doing integration tests with databases or storage systems, we can use that local storage and the extremely high amount of RAM here to make that fast.
We scale between about 2 and 100 of these client nodes, depending on developer load. We always go down to about 2 nodes at night, so we have some baseline capacity for developers who might be night owls. And then we can scale all the way up to 100. We don’t need to go beyond that, but I think we could pretty easily.
We’ve optimized our AMI so that we can boot these machines to ready in about 5 minutes. We could probably get a little more than that if we wanted to, but we don’t need to react that quickly to changing demand. We use systemd jobs to download that master image before we even start Nomad and Docker and format the ephemeral filesystems.
Because that image is large, we have to pull it from our Docker repository. That takes a couple of minutes. We don’t want Nomad to be running at that point, because if Nomad starts accepting those jobs, it’s going to block those jobs where there could be other Nomad machines that could be ready to go and work on those things.
And so you’re going to get 7 jobs binpacked onto that machine, and they’re just going to be sitting there waiting for this image to download, doing nothing. So we order things in a way.
I used to think systemd was nightmarish, but in this case it really helped us solve this problem.
And Nomad handles these disappearing nodes and rescheduled jobs really well. We tend to see either very rare terminations of our spot instances, or we’ll suddenly lose 15% of our entire fleet.
The nodes get garbage-collected with Nomad on a configurable timeout basis. Amazon is supposed to tell you when they’re going to terminate a node. You’re supposed to get 2 minutes of warning. I don’t think we’ve ever seen that happen. I think we usually just see our nodes disappear and it’s like, “OK, well, bye.”
So we use an external job that directly interacts with the EC2 API to scale and maintain the cluster capacity. Because of some historical reasons, we can’t use auto-scaling groups or spot fleets, but we haven’t really needed to. So we just watch our queue depth and watch for terminations, and then scale up or down as we need to.
What else do we like about Nomad? There are a lot of nice things that we’ve benefited from. The amazing HTTP API: It was really easy to build Nomad into our architecture and just treat it as another configurable component.
You can generate API clients for it, but we haven’t even needed to do that. We can operate just about any aspect of our Nomad cluster with this API. We’ve rarely had to do anything else. I don’t think we’ve touched the configuration management code for Nomad for a long time, because you can just do everything else on the API.
It’s got Linux, macOS, and Windows agents. Most of our builds take place in Linux, in EC2, which is great. Windows, same thing. If we want to run tests with Internet Explorer, we can do that. And then macOS. If we want to build our iPhone apps and test those, we can do that too.
I wish we could spin up a Mac VM in EC2, but we can dream, can’t we?
Flexible task drivers, not just Docker. We can do
raw_exec, we can run commands directly in the box. We can run JVM things. And as of Nomad 0.9, we can write custom drivers if we wanted to.
It works great with Consul and Vault. When these nodes boot up, they automatically join the cluster. They automatically provision. We can use Vault to inject secrets into the containers, because not only do we run builds and tests in the containers but we build these Docker images. So we need to retrieve credentials for our VCS and we need to retrieve credentials for our code review system so that we can interact with the changes the developers are pushing.
And in general, it’s been very easy to integrate into our existing stack. Like many HashiCorp products, it’s just a single Go binary. We can distribute that, we can run it with systemd on our existing Linux machines. And we haven’t had to rethink anything about the rest of our systems architecture in order to incorporate Nomad into our stack to solve this problem.
So that’s what we’ve done. We’ve learned a lot of things along the way. A couple tidbits. I don’t necessarily know if these’ll be applicable to you, but I’m going to share them anyway.
Size your Nomad server nodes accordingly. Once we started to scale up, we noticed that Nomad will use a lot of CPU. Throw a lot of CPU at it. Throw a lot of RAM at it. We currently run our server agents on 8 core machines with 20-30-something gigabytes of RAM. And they routinely sit at about 90% CPU usage.
Before we did that, we’d notice some starvation and poor scheduling performance. But since we’ve thrown enough CPU at the problem, it’s worked great. Smooth sailing.
Consider enabling leaveoninterrupt, to make your upgrades easier. This may have been solved otherwise, and some of the Nomad devs can confirm or not. But when we upgraded from Nomad 0.6 to 0.8, we hadn’t had this enabled. And our plan was to spin up replacement server agents and drain the old ones. But I couldn’t find a way to get the currently elected leader to gracefully leave the cluster because we hadn’t turned this on.
So when we stopped the jobs, we just had to do it in an ungraceful way. Since then, since enabling this, when we do a “CTL stop Nomad,” this just works as intended. We can spin up new machines, terminate the old ones. Easy.
And then consider externalizing your logs. Not just job log output, but event logs. So this is the nomad-firehose stuff I talked about earlier. In our experience, Nomad has worked really well when we haven’t mucked around with any of the allocation or job garbage-collection parameters.
We’re throwing pretty high scheduling volume at this system. If you’re expecting to be able to go and query jobs on the API, to look up allocation information, historically you might find that it’s been garbage-collected and Nomad’s going to tell you, “I don’t know what you’re talking about. This never happened.”
So consider externalizing that into a more suitable storage system, something like Nomad or RabbitMQ or whatever Amazon’s log system is. Just so that you can query that on your own terms and let Nomad keep its Raft synchronization small and concise and everything else.
So that’s what we did. Thank you very much. There’s no Q&A, but I’ll be around for a bit and then I’ll probably be at the social thing tonight. So if you have any questions, I’d love to talk about this. I’d really love to talk to anyone who’s facing similar problems.