Multi-Tenant Workloads & Automated Compliance with Nomad & the HashiStack at Exact Sciences
Dec 21, 2020
See how Exact Sciences uses HashiCorp Nomad, Terraform, Vault, and Consul ACLs to operate workloads within a highly regulated industry.
Exact Sciences uses Nomad to operate workloads within a highly regulated industry. In this talk, dive into the architecture of their HashiCorp environment supporting these strict requirements including Nomad namespaces, Sentinel, Consul/Vault ACL policies and build automation (Nomad job specifications, Terraform). The environment is built to operate workloads across many AWS accounts within the AWS multi-account strategy to facilitate access control, cost tracking, security, and compliance.
Hi, my name is Tim Feyereisen, and I'm a senior site reliability engineer at Exact Sciences.
Today, I'll provide an overview of our HashiCorp environment. We're going to see how working within a highly regulated industry led us to implement a multi-tenant framework for Nomad.
I studied engineering mechanics at the University of Wisconsin, where I ended up doing a lot of scientific computing. After graduating, I spent 7 years at a healthcare software company doing a variety of roles, from tech support and installs to development and performance engineering.
I've been on Exact's SRE team for 2 years, and we've done a lot in that time. I'm most proud of our team's execution on a total migration of our internal applications from a managed datacenter into AWS.
In my free time, I like to work with my hands and be outside.
Who is Exact Sciences? We provide early detection and screening for cancer. We operate labs across the country to screen for cancer and help guide treatment. We also do a lot of research in bioinformatics to develop better tests. Our goal is to deliver life-changing innovations that make cancer an afterthought.
Operating Workloads in a Highly Regulated Industry
Let's start with our high-level problem statement. We need to easily develop and operate workloads in a highly regulated industry, making sure that we have the highest levels of compliance and governance built into our environments.
By "easily operate," I mean minimizing the operational overhead of this environment by making it auto-healing and highly available and scalable.
To work within strict compliance requirements, we need to standardize and automate governance.
Finally, to rapidly develop and innovate, we want to give our engineers the right tools and not make it difficult to use the platform or comply with governance.
The Company's Agenda
Let's take a look at our agenda, starting with guiding principles, which are the high-level objectives of our infrastructure. These concepts underpin our decision making as it relates to infrastructure design and operation.
Then we'll expand on our goals as they relate to the environment.
Let's take a look at an example of one of our flagship internal applications, our lab information system. Hopefully this gives us some context as we discuss our goals.
This application runs as a Java binary on an EC2 instance. In adopting cloud-based design principles, we're refactoring this application to be a collection of services and tasks. This will both empower our engineers to innovate and simplify operations by leveraging the benefits of containerization and cloud services.
Our goal is to modernize the application without a total rewrite, so that we're breaking out pieces to run alongside the existing application with the same strong governance requirements that already exist.
Goals of the Migration
As we start to design a platform for this migration, we need to begin with our goals. The 3 goals I introduced earlier are really the guiding principles for our cloud operating model at Exact Sciences.
If we build out these goals into a more detailed view, we can start to understand our requirements better.
Within our streamlined operations principle, it's important that we have reliable deployments, single-pane monitoring, and a unified, scalable platform.
For the platform, we want to run many types of workloads from a single entry point. The platform also needs to scale indefinitely and be highly available.
Next, we need to be able to see at a glance what the health is for all jobs that are running on the platform.
Last, we want fully automated deployments, including rollbacks and the ability to support multiple deployment paths.
These 3 requirements are critical to operational excellence, but another benefit is that they free up our SRE resources so that we can spend more time innovating instead of deploying and troubleshooting.
Our next principle is to standardize governance, including cost tracking, access, and auditing. The first step of this is to be able to track costs. These costs need to be consolidated into a single view and grouped by business unit and/or service.
Next, we need very strong governance around access that begins with multi-factor authentication and least privileged access. This access should be managed centrally from the single-source-of-truth system and provisioned with automation.
As much as possible, we need to build these components into the actual platform so that we're not deploying access piecemeal across existing infrastructure.
Finally, we need to audit every action taken in our environment. This includes authentication, but also secret retrieval, deployments, configuration changes, API calls, and more.
Our cloud footprint is growing so rapidly that unless we can bake governance into our infrastructure at a foundational layer, we're going to quickly fall out of compliance.
Our last guiding principle is empowering engineers. We hope to foster a culture of innovation by letting engineers choose their own technologies, launch their own infrastructure, and explore new services and technologies without limitations.
Giving our engineers the freedom of technology—call it bring your own tech—allows us to adopt new technologies quickly.
We want a platform that doesn't pigeonhole us into certain languages, runtimes, or environments.
Next, we want software engineers to launch their own infrastructure as needed, whether it's more compute, a cloud service, or a Docker container.
Finally, we want a platform where it's safe to explore and safe to fail. Giving engineers admin-like access seems like it conflicts with governance and compliance, but it doesn't need to if the platform is designed correctly.
As we expand on our guiding principles, we start to gather a lot of requirements for what our platform needs to accomplish.
Even though this list seems like a lot, it's not exhaustive. It does give us a good starting point, where we can begin to talk about specific technologies that can help us achieve our goals.
Achieving the Goals
Now that we've dissected our goals and gathered some requirements, let's return to our problem statement.
I propose we can achieve our goals by building a multi-tenant workload orchestration platform. Multi-tenant principles force us to standardize and pre-bake governance into our infrastructure, while a best-of-breed workload orchestration platform and the supporting tooling offer us rapid development and streamlined infrastructure operations. By expanding our goals into requirements, we've taken our problem statement and turned it into a mission statement.
Let's take a quick look back at our agenda with some additional context. The following sections expand on our goals as they relate to the platform.
Let's dig into the details now to see how this type of platform can be implemented, starting with operations.
Before talking infrastructure, we need to address 2 things: why we chose Nomad, and a bit of backstory on how our cloud organization is structured.
We started looking into Nomad due to the integration with our existing Vault infrastructure, but saw quickly how other Nomad features support our guiding principles. We really liked how we can run more than just Docker, and that namespaces offer us a great compliment to our existing organization.
Here's the backstory on our cloud organization. When we talk about running anything in the cloud at Exact Sciences, we start the process by deciding where that workload fits into the diagram of organization structure shown in this slide.
Our entire platform, including AWS, HashiCorp, and other supporting services, inherit from this concept.
Governance is another extension of this concept, so that user access follows the same pattern. A user who has access to a particular account will have access to operate independently in that space, whether that's creating AWS resources, running Nomad jobs, or working with secrets.
In terms of the HashiCorp tools, these accounts map to namespaces. This philosophy of segmentation is commonly adopted in AWS as the multi-account strategy.
Beyond governance, there are other benefits to the strategy, which include limiting the blast radius for issues stemming from a compromised system and segmenting the network for easier traffic policy design and monitoring.
This is all great for governance and compliance, but it introduces a lot of complexity in managing so many accounts. We have somewhere around 100 accounts, and we're growing by about 6 per month.
There's no easy way to accomplish this without a lot of automation, so let's look at how we automate our HashiCorp infrastructure.
Within the AWS multi-account strategy, we launch Vault server into an enterprise secrets account, and Nomad and Consul into an enterprise operations account. Both of these are launched with Terraform.
Our Nomad clients are also launched with Terraform into specific business unit accounts, where they run workloads for our tenants.
One Nomad client cluster is deployed for each business unit staging environment. Launching our entire environment from Terraform means we can easily create dev and test environments for iterating on upgrades or new features.
We have 3 copies of this environment, 2 of which we regularly destroy and recreate. As we'll see in the governance section, all of the authentication components are launched as part of the environment provisioning.
Our clusters are launched as auto-scaling groups sourced from a single Hashi AMI. The machines are then configured through user data to control what services they run and how they connect.
This example shows the user data templating out the Nomad and Consul client config, meaning this example server is a Nomad client. This model has been extremely successful for us in terms of streamlining operations, specifically in being able to push out updates to Vault, Consul, and Nomad servers to quickly adopt new features.
Sourcing all of the machines from the same AMI means that we can focus a lot of energy on making it secure.
Details on the Clusters
Let's look at the clusters in more detail.
Vault server operates out of our enterprise secrets account launched as an auto-scaling group that maintains 5 healthy instances across 2 availability zones.
We don't run out of multiple regions yet, but are planning for this in 2021, as soon as we can support fast and reliable networking across regions. You can see in the code snippet how some variables are parameterized for the user data.
We use integrated storage for the Vault backend, where the leader IP address is a Consul DNS name.
This query pulls the wrap leader node IP from Consul, simplifying both disaster recovery and deployments, as we don't have to manage static IP addresses.
We also leverage auto-unseal with AWS KMS, which enables Vault servers to come online instantly after a scaling event, with no manual intervention.
Consul server runs out of our enterprise operations account with the same availability as Vault server.
You might notice that while we leverage a lot of native AWS HashiCorp integrations, our retry join is DNS-based. Consul supports an easy to retry join based on tags, but the provider is really meant to operate within a single AWS account.
Because we have clients running in many accounts, we rely on DNS in order for Consul servers and clients to join the datacenter. Instead of an internal DNS name, we plan to migrate this to Consul DNS. similar to our Vault server cluster, now that we've fully rolled out Consul DNS in our environment.
Other items to note here are the usage of ACL, Autopilot, and gossip encryption. ACL and gossip encryption support auditing and access, while Autopilot simplifies the rollout of new Consul versions.
Nomad server is also launched into our enterprise operations account. Here, we have the same DNS retry join as with Consul, which we also plan to migrate to Consul DNS-based query.
In the Vault stanza, we can see the Nomad server is fed a Vault token on server boot. This token is generated in user data by logging into Vault with the EC2 instances' IAM profile.
We'll dig into the details here in our next section, but this is an important piece of the authentication workflow.
Finally, our Nomad clients are launched into our business units.
Our simplest form of the auto-scaling group is a baseline of 2 servers, with the ability to scale up to 10. we've adopted several other patterns to support different types of workloads.
We currently scale based on VM performance counters, but we're excited to investigate more Nomad-native ways to determine when we need to add more clients.
In terms of the actual config file, the most important piece is that we're registering a node class with the client. The value of the node class ultimately corresponds to the namespace for the client.
We use a Sentinel policy in order to make sure that a Nomad job that runs within a namespace is placed on the right Nomad client.
This introduces a requirement on our Nomad jobs, though, that they both have a node class constraint and that it matches the namespace.
In a later section, I'll discuss how we've approached the requirement to make it as transparent to engineers as possible.
The combination of our AWS footprint and HashiCorp infrastructure drives our multi-tenant approach to the platform while enabling high availability, near limitless scalability, and, ultimately, streamlined operations.
Hashi and Governance
Now that we've looked a bit at our infrastructure, we'll move into governance. There are a lot of concepts here that we could get into at a very technical level. In this discussion, I'll focus mostly on the automation.
When we provision a cluster, we also provision the authentication, starting with a set of global authentication resources. One example of this is an AWS admin role, which administrators can use to obtain credentials through the cluster.
These credentials allow them to operate as administrators across all namespaces in Nomad, Consul, and Vault. For each namespace, we deploy AWS IAM auth roles for both the Nomad clients and our users.
Our users get tokens from Vault for the AWS account they operate in when they need to generate or view secrets.
These same concepts extend to ACL policies in Consul and Nomad. We leverage the Consul and Nomad secrets engines involved, which we can use to generate properly scoped ACL policies based on an IAM role.
There are some tokens within our environment that remain static to allow for us to continue some operations if Vault were to be unavailable. These static tokens are stored in AWS secrets manager and are only accessible by either global administrators or the cluster servers themselves.
When a user needs access credentials, they first authenticate to AWS with MFA. These credentials are used to log into Vault, where they can then retrieve temporary access tokens for Consul and Nomad, leveraging the Consul and Nomad secrets engines.
To simplify this process, we developed a CLI tool for our engineers, which encapsulates the authentication workflow into a single command.
We don't currently use single sign-on (SSO) for any of the HashiCorp tools, but with the recent announcements of SSO for Consul and, upcoming, for Nomad, we're planning to shift to a more standard login workflow leveraging our SSO provider.
There are 2 ways that a Nomad job can get access to Vault and Consul. The first is by interfacing directly with the Vault and Consul API.
In this workflow, the code is acting as a Vault or Consul client and authenticates the Vault using the EC2 instance profile. The instance profile is tied directly to the namespace policy, so the result is access only to secrets within the namespace.
The second method is to use the Nomad job-embedded templates, which requires that a job requests a Vault policy. In this case, the Vault client is the Nomad server, and we have to restrict the allowed ACL policy through an additional Sentinel policy.
This Sentinel policy requires that the Vault policy, which a job requests, corresponds to the namespace which the job is being launched into.
We have another requirement on Nomad jobs, similar to the node class requirement, in that the Vault policy being requested must match the namespace.
We got the governance out of the way with only 2 requirements; that's not bad.
Let's move on to show how we've tried to simplify the development process for engineers while also maintaining compliance.
We have 2 common development workflows.
The first is to run a Docker container locally. We want to provide the same level of access to AWS, Consul, and Vault that an engineer would achieve by running their container as a Nomad job.
Engineers leverage our authentication CLI tool to easily authenticate and generate tokens, which they can then inject into their Docker containers.
For the second workflow, where they will run their code as a Nomad job, the authentication pieces are abstracted from the engineer deploying it.
When we run a Nomad job, the job automatically authenticates and receives a namespace-scope policy. An engineer has access to these resources within the environment by virtue of having access to the Nomad namespace itself.
This second workflow, combined with our usage of Terraform, helps enable infrastructure as a service for our engineers.
Let's come back to those Nomad job requirements we introduced earlier.
This image describes our first iteration of helping engineers produce compliant Nomad jobs.
Instead of writing a true Nomad job spec, engineers write a Nomad job with 2 small changes. The first is that the business unit is a field at the top level. The second is that there's an additional field available within the task that handles Vault's secrets pathing.
When either the engineer or our bill pipeline posts the job to our API, we template out the requirements into a valid Nomad job spec, 1 per environment within the business unit.
At this point, we can issue a request to our API and retrieve a valid Nomad job spec with all of our requirements met.
It works, but there's room for improvement, especially with 2 major recent changes.
The first change is that Nomad can now authenticate to Vault namespaces.
This means we can keep the same pathing for secrets within our embedded templates and get different environment-specific values based on which namespace we authenticate to.
The second change is related to our adoption of Terraform.
As we are exploring Terraform Enterprise, there is an opportunity to move our Nomad job specs entirely into Terraform and use variable replacement for namespaces and node classes.
We're also very excited about significantly improving our self-service infrastructure offerings.
Both of these changes are something we're actively pursuing to even further simplify our job deployment pattern.
What's most important here is that we've created a platform that has incredibly strict access requirements and workload isolation, while keeping a significant part of that abstracted from our engineers.
There's little overhead to get up and running in Nomad with all the integrations baked in. They're almost an afterthought to the people writing code.
Let's take a look back at our goals and summarize what we discussed.
Our first goal was to streamline operations. We showed how we automatically provision the clusters in a highly available and scalable way.
We've also succeeded in our goal of being able to launch workloads into many different AWS accounts from a single Nomad server cluster.
Our second goal was the standardized governance.
We accomplished this by heavily leveraging our existing governance standards in AWS. We minimize long-lived credentials as much as possible, and we launched all of our authentication infrastructure through automation.
Finally, we empowered our engineers by giving them sandbox environments and the freedom to explore different types of technologies.
We've also developed tooling and practices that significantly lowered the development overhead while maintaining our governance and compliance standards.
This has been a lot of content in a short talk. I hope that it gives you some ideas or concepts to think about. We're always making improvements to this platform, and I'd love to hear your thoughts, ideas, or questions.
Here are a few resources that we use when designing and building this environment. I'd recommend starting here, if you want some more technical information or design considerations:
Thanks for joining me. I'd like to acknowledge my teammates at Exact Sciences, Kevin Hubbard, Josh Kramer, and Brandon Peterson, for their contributions to this presentation on the design and implementation of our environment. Thank you as well to Bryan Schleisman, senior solution engineer at HashiCorp, who worked closely with us as we built this out.