Q2's Nomad Journey and Deep Dive
Dec 01, 2020
Learn why and how Q2 uses HashiCorp Nomad after their evaluation of several major orchestrators.
Walk through Q2's Nomad environment and how they leverage the Consul and Vault integrations. They retrieve secrets dynamically, applying token auth method, plus dynamic certificates with Vault. The presenation will also explore templating, reusing container images between stacks, and cover some of the best practices found running Nomad on-premises. Lastly, this talk will cover their management of a fleet of VMs in their traditional VMware datacenter environments using modern methods.
I'm Cody Jarrett, and I'm the SRE Lead here at Q2. I've been with the company for about six years. I work on the team that's responsible for our architecture and operations of the orchestrator and distributed systems platforms at Q2. Today I'll be talking about why Q2 is using Nomad, and we'll dive into our environment to discuss how we use the tools. Finally, we'll share some of our best practices that we follow.
Who Is Q2?
We provide a comprehensive digital banking solution, which includes account openings, digital online banking, lending solutions, fraud protection, and a lot more. We enable financial institutions to provide rich data-driven experiences to every account holder. And we meet the needs of consumer banking, small business, corporate banking, and even FinTech companies.
To show our scale, one in ten online banking users in the U.S. uses Q2. $2.6 million worth of money movement flows across our platform every minute — or about $1.5 trillion annually — and 33% of the top 100 U.S. banks are Q2 customers. Most of this activity is happening within applications that are running on top of Nomad.
What Problems Did We Need to Solve?
Getting into our Nomad usage, let's talk about where we were about three years ago. Back in the 2016-2017 timeframe, our applications were running in a very dated monolithic style. There was a lot of manual and tedious work that was involved in deploying and upgrading our applications. There was a high potential for mistakes as well. A lot of care and feeding to maintain apps after they were deployed and ran took place. Managing the lifecycle was often painful, and just ensuring that the app stayed running took a lot of time.
People responded to alerts and restarted the services by hand. Introducing new changes across the entire environment was nearly impossible and super-risky. If we wanted to mass-rollout changes to all customers at the same time, it took a lot of planning and work and had a high level of risk.
It was very difficult to roll back changes after deployments. It was a very manual task that involved a lot of unwinding of the application pieces that were changed. We needed a better way to scale out the environments to enable faster changes and self-service. It was clear we needed to overhaul as we continued to scale.
While evaluating solutions and the path forward, several important considerations had to be met. First, a large portion of our application is Windows-based. Our core business logic, for example, runs on IIS. We needed to have Windows and Linux support.
Although we prefer running containers, we didn't necessarily want a hard requirement to have to use them. We liked the idea of being able to run apps directly on VMs if the use case called for it.
We wanted to begin using workload orchestration for our current applications and support the future. We wanted to make improvements to workflows without massive and time-consuming rewrites of our existing applications. Because of the industry we're in, security and compliance enforcement was extremely critical. Finally, we needed the solution to be simple, easy to use, and maintain.
Selecting HashiCorp Nomad
We evaluated several of the major orchestrators. Ultimately, we chose Nomad because it met all of our requirements — it made the most sense in our environment. Let's talk about some of the outcomes we've been able to achieve with Nomad.
Increased Deployment Across the Board
Since deployments are much simpler, it's become much easier to stand up applications, which has allowed for better testing and experimentation and easier to change in general. Deployments and changes have a better history as well for auditing purposes.
We gained a lot of extra resiliency for our production stacks as well. One way we did this was by preventing single points of failure within application stacks and ensuring that multiple copies are running all at the same time. Nomad automatically ensures the services stay running healthy — and handles VM failures and maintenance events for us.
Improved Standards and Security
Our customers expect extremely secure and compliant environments — and Nomad is helping us achieve that. We've been able to improve and enforce standards around all of our applications. App stacks are defined as code now — and version-controlled — and go through a peer review process.
We've been able to funnel changes through automation tools to enforce standards. Tools like Sentinel, for example. Sentinel has allowed us to enforce allowed drivers, allowed services, and standard config options within jobs. It's also allowed us to specify minimum and maximum resource standards.
Access to manage and deploy jobs is tightly controlled using Nomad ACLs and Vault. This means changes have an audit trail, and secrets that may have once existed in config files now have been moved to Vault. Nomad and Vault both have audit logging capabilities. That lets you have a better understanding of actions that take place in the environment — and it’s something that we use heavily.
Legacy and New Applications
This is one of the biggest wins we achieved with Nomad. We were able to move our existing application workloads into Nomad without costly, time-consuming application rewrites. This meant we gained a ton of benefits of using an orchestrator while parallel efforts began developing new and future applications that also run in Nomad.
Nomad let us enable more self-service around standing up our environments. This reduced load on other teams, which led to faster delivery. Nomad handles placements of the applications in the environment, which was one less concern for the application deployer.
Latest Technology Practices
Using Nomad, our SRE and platform engineering teams have been able to transition to some of the latest technology management practices. We've been able to move from the pets to cattle model. Underlying VMs are much more generic now, which is great for lifecycle and maintenance purposes. We've been able to shift more of our concerns to helping and enabling our Dev teams and other business units at Q2.
Let's dive into our environment next and see what that looks like. Today we're running over 7,000 jobs in Nomad across the environment. That translates into over 40,000 tasks across 1,500 VMs. We use most of the HashiStack today; Nomad, Consul and Vault. We also use Terraform and Packer for VM management. Most of our Nomad infrastructure lives in on-prem, very traditional — but modern — datacenters. We do have some workloads with cloud providers as well.
We have several Nomad and Consul clusters across multiple environments, and we federate those between datacenters. Our networks are heavily segmented and firewalled — and we align Nomad node clusters to these different networks.
We make use of the constraint parameter within jobs to match workloads to their appropriate networks. We also use Consul network segments to align segments to the different networks. This is so we don't send full gossip traffic across firewall boundaries and tiers.
We also enjoy Nomad as we weren't required to use overlay networks. We mostly use bridge and host mode at the Docker level, which helps keep things simple. Nomad now does have CNI support. That's something we may look into in the future if we need it.
One cool thing about Nomad is you're not limited to running containers. There are different task drivers available like Docker, Podman, Firecracker, and even straight executables. We use the Docker and Exec drivers, plus we wrote our own Microsoft IIS App Pool driver. This lets us run a huge number of Docker containers on Linux VMs. And for Windows VMs, we run executables and IIS app pools directly on the VMs.
We tried starting with Docker on Windows but had way too many challenges there — like stability and loss of density. With the app pool driver, we maintain a fleet of VMs running IIS plus the Nomad and Consul agents. When we submit a job, the artifact stanza pulls down the app pool content and starts it up within IIS. Our next steps are moving away from IIS-bound applications to straight executables as we move to .NET Core.
We generate Nomad jobs through a Jinja templating process and store those in version control. Here's a snippet of a job broken into two images in familiar HCL format. We generate a job file per Nomad region today but look forward to trying out the new multi-region deployments. We treat our physical datacenters as Nomad regions, primarily due to how the Nomad leader retrieves the Vault tokens from the local Vault cluster.
We leverage template stanzas very heavily within jobs. This lets us keep generic containers and app pools — and we layer on the customizations we need at runtime with template stanzas in the job files. This is using Consul template behind the scenes. That lets you do nice templating and looping. It also lets you pull secrets and certificates from Vault. You don't have to store those directly in job files.
With Vault PKI, certificates are pulled dynamically when they're needed. And since they're treated as a lease, Nomad and Vault work to rotate and refresh those certificates when they expire automatically.
Consul and Vault Integrations
With Nomad, we take advantage of the built-in Consul and Vault integrations very heavily. All of our jobs have service stanzas and include extra tags and health checks. We leverage Vault integrations with Nomad. We use a token role instead of Vault so that within jobs, you can specify Vault policies. Those policies to grant you access to specific paths in Vault. That way, you can use template stanzas to pull secrets or certificates straight from Vault.
We use performance replication with Vault to replicate secrets between datacenters. Teams update secrets in the primary Vault cluster and those replicate between datacenters.
We use Consul's and Nomad's secret engines in Vault so operators can log in to Vault with their LDAP credentials. They can then request a Nomad token and have temporary access to the Nomad clusters. This means the token is automatically revoked and has a full audit trail behind it.
For load balancing, we leverage a few solutions. We primarily use Fabio, which has been great. And if you're not familiar with Fabio, it's a Consul native software load balancer. Every service that gets registered in the Consul has a special tag that Fabio is looking for. Fabio is watching Consul, it's looking for all healthy services that also have that extra tag. It takes that tag, strips off the URI portion and adds a route to its route table.
We place several software load balancers in each network listening on standard HTTP ports. Then Fabio routes traffic to the applications which are running on dynamic ports within Nomad. We started with Fabio as a system job on every node when the environment was small. This worked great at first, but as we began increasing VM counts and the number of Fabios, the sheer number of Fabio instances started putting a high amount of load on Consul.
We've moved to a pool of dedicated VMs for running Fabio now in each network. And we refer to those as Ingress nodes. We're also beginning to evaluate another software load balancer called Traefik, which also has Consul integrations, leverages tags, and has a growing community behind it.
I wanted to show an example of a job with load balancing tags in it and what that looks like in Consul and Fabio. Here's what the tags look like in a Nomad job within the service stanza — there are routes automatically added to Fabio. The first part of the tag is the URI path. And the strip directive removes that URI on the proxied call to the service.
Here's the Consul UI that shows three services are registered for customer01-frontend. And the Fabio route table shows three routes — each getting about 33% of the traffic. We see a source and destination column. The destination shows the IP address and dynamic port that the container — or service — is listening on.
All the cluster level configurations are managed through Terraform, Nomad ACLs, namespaces, Consul ACLs, Sentinel policies, Vault policies, roles, and engines. That helps keep management practices very familiar and consistent. And we store those all in version control.
Cluster Level Configurations
We use VMware in our datacenters for VM hypervisors, and we leverage Terraform for provisioning. We have Packer pipelines that build VM templates, which are about 95% configured. We then use Terraform to build out the Nomad clients and leverage user data and
cloud-init to lay on the final 5% of the configurations to make provisioning extremely quick and painless. For Windows Nomad clients, we use
cloudbase-init and have the same user data method. This is helping us manage our on-prem environments the same way we're managing cloud resources, which has been great.
When it comes to monitoring, we initially started with our existing monitoring solutions. But we had some trouble regarding VM changes and rebuilds — and trying to follow containers around wasn't working that great. We very quickly moved to Prometheus and haven't looked back.
We run Telegraf as a system job on every node with several of its built-in input plugins. And we register Telegraf as a service within Consul. Prometheus automatically finds and begins scraping those agents as they come up and move around. This means VMs come and go and Nomad tasks move around — and we automatically scrape them and collect metrics no matter where they end up.
We run a couple of standard Prometheus exporters as well, like Blackbox exporter for API monitoring and Consul exporter for service status changes. Finally, for long-term retention, we use Thanos.
Here's an example of a Prometheus job on the left that connects to Consul and scrapes any service that's registered called Telegraf. On the right, a job that scrapes any service that's been tagged with scrape metrics. Those are services that our developers write, and they automatically get monitored because of the Consul service discovery and Prometheus integration.
We learned a lot of lessons along the way and wanted to share some of the job-level best practices with you.
That's a setting that we often set the mode to fail because that helps whenever there are node issues — like there's a VM that's having some problem or it's degraded in state. If your allocations fail multiple times in a row, that allocation will eventually go to a failed state. Nomad will make sure that it starts up on a healthy node.
You want to make sure that every single job has a health check of some kind. And ideally, maybe a shallow health check if you can — but something that tests that the application is working.
You want to enable these because you probably want to restart tasks after a threshold of failed health checks.
You may not want the default reschedule mode, which is exponential backoff. As those grow between attempts that could take up to an hour before retrying to restart your job. We often use constant mode with a low delay between attempts — especially on some of our critical jobs.
And finally, one of the best resources these days is the HashiCorp Learn site to pick up more best practices on job configurations.
Nomad has offered Q2 an orchestration solution that's simple, easy to use, and stable. It was the key to solving Q2's deployment problems — and it's allowed us to scale for future growth.