Case Study

Abstracting Nomad at SeatGeek: Patterns for deploying to Nomad at scale

Learn from the lessons of the platform team at SeatGeek and how they improved their platform's ease of use for deployment, security, networking, and observability with HashiCorp Nomad, Consul, and Vault.

»Transcript

Hi everyone. My name's Jose Dias-Gonzalez, and as he said, this talk is about extracting Nomad — the patterns for deploying to Nomad at scale. Before I jump into the technical meat of the talk, I want to talk about SeatGeek — what we build and why we're building in this way.

»What is SeatGeek?

SeatGeek is a high-growth live event ticketing platform. If you go to any sports, concerts, theater, or animal sports types of events, you probably have heard of us or used us. We partner with some of the largest names in the industry, such as the Dallas Cowboys, the Brooklyn Nets, MLS, etc. 

I want to talk about the ticketing problem space for SeatGeek, who we are, who our customers are, and who we partner with — and that'll give you a little bit of context as to what we've built and why we've built it in this way.

»Who Are Our Customers and Partners? 

On the customer side, we have folks who buy tickets to live events — someone who wants to go to a concert, attend a sporting event to see a basketball game or something like that. We provide an experience for them to find the best seats available, ensure they have all the information they need when they attend an event — as well as offering a great in-venue experience. 

We've also been massively fortunate to establish partnerships with some of the largest properties in the world. It's The Dallas Cowboys or Manchester City if you're a football fan on either side of the globe. We also partner with basketball teams such as The Brooklyn Nets, The Cleveland Cavaliers. Then we partner with a bunch of organizations such as Brooklyn Sports and Entertainment Global, as well as Jujamcyn  on the concert or theater side.

On the partnership side, we work with them on back-office operations. How do you run an event, how do you sell tickets, and provide that experience with folks coming into a venue, checking in, and purchasing tickets. That's a little bit about the business. Now we'll talk about more pragmatic stuff. 

»Technical Background and Context

On the technical side, aside from using a ton of languages and frameworks, services at SeatGeek are fairly generic and homogenous, in how they are built and run. All applications that we run load configuration roughly in the same way. On the consumer side —that means seatgeak.com — we load environment variables into a configuration object and read that everywhere, regardless of the service. 

For partner-facing services — stuff that runs a backoffice for folks at our partners — they typically have a lot more configuration. But again, all that is stored in XML files and loaded up in the same way. So, if you're on the consumer side, it's all environment variables. If you're on the partner side, it's all via configuration files, pretty similar. 

All applications also communicate in roughly the same way. If you're an app developer, you have the option to send messages over HTTP across our internal service mesh to another service. In rare cases, we do communicate via TCP, and then engineers are allowed to specify that they prefer either a TCP communication or HTTP – depending on how they configure their services.

For longer running tasks such as payment processing or email sending – something like that —  those typically occur in a background process. An engineer can choose to send a message into one of our messaging brokers — typically RabbitMQ or SQS — and process that work asynchronously — either within their own service or in another separate service downstream.

Messages are largely sent in the same format. So if you're an engineer on a Python service, you can go to a. net service, and it's the same object that you're looking at. Then the background processing frameworks we've built process those messages in roughly the same way, regardless of what the language or framework is. 

Our resilience team has also gone to great lengths to standardize how we ship metrics and logs for our monitoring purposes. If your application makes an HTTP request to another service, we've enabled APM to track all of those requests across the stack. Applications are also shipping logs in a standard format wherever possible. Then we enrich them with platform-level data so that we can isolate issues to regions or instances or any other number of dimensions. 

Metadata about where messages and metrics are generated are also injected into the log messages themselves. That way, we can use that during incident investigation or after the incident. 

This is all to say the solution we're building towards and what works for SeatGeek makes quite a few assumptions about how the systems work — and how the services work — and that we're not expanding in terms of the number of options folks can use. So, what works for us may not work for your organization if you're not building into a homogenous environment. 

»Platform Needs 

With that context, I'll talk about our actual Nomad adoption. In the early days of Nomad at SeatGeek, we mainly experimented with Docker-based deployments. At this point, we used more than just Docker. We used Logseq  — we're experimenting with other types of drivers, etc.

The earliest test of Nomad was to see how we can perform certain actions on the platform. The platform requirements are relatively simple. For everyone else, it would be when an engineer completes a bug fix or a feature, how do they get those changes from their laptop in front of our customers? If we want a route from one service to another, how should that be configured? What do we expose there?

If an engineer wants to specify an application secret, for example, for payment processing, maybe they have a token or something like that — how do we expose that to an engineer? How do we ingest logs? How do we scale? Those are the kinds of things we were focused on really early on and what drove our initial investigation.

»Just a CLA Away  

In the beginning, we started using Nomad calls. This was prior to the 1.0 era, so we didn't have access to the HCL2. A lot of it was playing around with set calls and saying — all right, fine, let's set the image, set the count, run that job. 

Today, if you're using the HCL2 — and I'll have an example of how that works — it's a little bit simpler. But in our case, deploying a particular commit SHAjust required changing the image property and then submitting the job if you wanted to scale. It's pretty much the same way you set the count, submit the job.

Application configuration was provided via template stanzas . That was regardless of whether or not we were running things where they loaded XML configuration or if they were loading environment variables. We had Consul-based DNS already, so everything was communicating via our internal void-based service mesh.

Today by configuring the correct HCL job file, these are much simpler tasks to execute for smaller Nomad installations. For many folks dipping their toes into Nomad today, you can get pretty far along with the Nomad CLI and HCL templating for support. We recommend using the Nomad CLI this way for the following use cases: 

  • If you're using Nomad as a proof of concept within a new organization or a new deployment, it's a good place to start and play around. You can stand up a simple Nomad cluster and submit jobs to see how things run. In all honesty, you don't need a large deployment tooling or deployment platform if you're playing around with Nomad for the first time. 

  • If you have a local home lab that you're playing around with, same thing. Just set up a simple Nomad cluster, submit those jobs directly. You probably have other things you are concentrating on that you can better spend your time on. 

  • Finally, if you're playing around with new functionality in Nomad — especially with new features that come out with every single release — before you build that into your tooling, play around with the Nomad CLI to submit those jobs and see how that runs and executes in your platform.

»Nomad CLI v2 

While these two options weren't available to us at the time, it may be interesting for you to look into Levant and Nomad Pack. On the Levant side, it's a slightly improved method of interacting with the Nomad on the CLI. 

It allows for a bit more flexibility around templating, including defining your own functions. It overcomes a couple of limitations in the HCL tool itself around the sort of things you can template out. It also has a couple of interesting ways of interacting with Nomad itself. I think it's a really good option for folks with smaller clusters or home labs. 

Nomad Pack is also another entry in this space. If you're familiar with the Kubernetes world, it's pretty similar to Helm. If you're providing a library of applications or if your applications are fairly similar to each other and you just inject configuration, it could be a good option for you.

»Planning for the Future 

All that aside, once we had our feet wet with job specs and our most common workflows with Nomad, we started looking at expanding SeatGeek's footprint and usage of Nomad, so we asked ourselves a couple of questions: 

What were the pain points around deployments? If you're an engineer, how do you get your code up, and how do you know it's up and running? What did we feel comfortable supporting on the platform side? Certainly we can build the world, but if we can't support the world, then are we better suited not to build all of those features and functionalities? 

Then, finally, what is the best experience that we can provide to engineers? If we scope ourselves to a very small portion of what Nomad provides and provide a best-in-class experience for that, an engineer will be more excited and more likely to use the deployment platform to ship code.

»The Interface

We decided early on to allow engineers full access to what was deployed and how it was deployed. If you were an engineer, we expected you to understand the following concepts: You had to know how to write a Dockerfile. You had to know the various knobs in a Nomad HCL job specification. You’ve got to know how Consul templating worked. You had to know how Vault and Consul integrated into our platform.

This was a complete 180 from our previous deployments, where an engineer would have limited access to the underlying host — and they would be able to specify here's the command I want to run and a couple of operating system packages. With Nomad and with HCL2, we provided a way for folks to completely define what they were running and opened up the entire world to what they wanted to run on our platform.

In practical terms, this meant an engineer's interface to deployments was an infrastructure directory. So, every single repository gets this infrastructure directory. You can define one or more job files. You can define one or more Dockerfiles that we would build for you.  You can define configuration — whether that's file or environment-based — or your applications for usage within that app — as well as files for configuring how applications are monitored. 

»The User Journey 

The user journey for deploying to our platform was as follows: You would go to a deployment UI — we called it Deli; it was short for delivery. You'd find your application and environment configuration. If you wanted to deploy to production, you'd go to prod and find your app. Same thing for staging or any other environment. 

You'd select your commit that you want to deploy to that environment, which in certain environments with scope to certain branches. If you're in production, we would only allow the integration branch to be deployed. Whereas if you were in pre-production, you can deploy pretty much anything — and then click deploy. 

Under the hood, this would involve slurping down that configuration, modifying all of the job specs, injecting any commit SHAs and environment variables, configuration, etc., submitting the job into Nomad and then tailing that job until it completed or didn't complete.

»Job Processing 

Initially, the injection configuration — the part that injected configuration to a service — involved roughly the following steps: We'd inject a template for environment variables or for application configuration — if that was on the XML side. 

We would configure a couple of monitoring sidecars, so things that would ship logs or metrics to our downstream services. Then vendor-specific template files that were germane to our platform. This was a lovely readable 20-line for loop.There were a couple of functions that we call. But it was pretty easy to maintain and update — and then, therefore, debug and use overall. 

This lovely piece of code eventually grew into a very large set of modules, spanning many thousands of lines of code and many thousands of lines of for loops. We had a gigantic nice inner for loop of 1,000 lines of code that probably had seven or eight nests of four loops deep.

That covered stuff like the auto-scaling rules. How do we auto-scale for a particular environment? It covered monitoring configuration. If you're transitioning from, for instance, Graylog into Datadog, how did we handle that? Constraining specific options for particular teams or environments. We didn't allow raw_exec, for instance, for non-platform engineers, so we had to constrain that for other things.

That's all to say that this was very complicated for us to maintain. It was easy for us to break the deployment experience for engineers. And easy for them to get therefore upset with us for not providing a deployment experience that we promised.

»Middleware 

The next iteration of our deployment middleware took inspiration from the Kubernetes world in the form of admission controllers. If you're familiar with Kubernetes, you have validation controllers or an admission controller, submit your job. It can mutate it, or it can reject it and say this is not valid for the platform. 

In our case, we built a service called NAC. That's just that lovely middleware right there, and it's short for Nomad Admission Controller. That acts as a semi-transparent proxy to Nomad itself. Controllers within NAC were scoped to particular types of functions, such as a controller that disables raw_exec jobs. There was a controller that injected the valid datacenters for a given environment.There's a controller that enforced version secrets for an application, and then there was a controller that enforces the app’s set default policies. 

In this particular case, I would recommend building a system such as this if you have a more mature platform team that is willing to write code and needs to have that central configuration for your platform. If you need to enforce rules, this is a great way to enforce those rules and put it in front of Nomad. 

An alternative to this is to lint your jobs prior to submission. If you're very comfortable with engineers overriding your rules, you can write the same tool. It runs on the CLI, you put it in CI, submit the job whether or not it was in the proper format.

»Engineer Complaints

The above changes still describe our initial Deli-based deployment. But in 2021, we took a closer look at how engineers used our platform to deploy. They had the following things to say:

  • I don't understand all the options in the Nomad job specification. I usually copy-paste the configuration from one job to another. 

  • The HCL I configure and the jobs that get deployed aren't ever really the same because we were mutating them in between. So, I don't know how to tell whether or not that was something our platform supports or Nomad does.

  • I hate copy-pasting the same configuration from one job to another — and I always forget to add a particular stanza, or it's misconfigured for that particular type of job. 

At this point, we took a step back and looked at the various ways we could unify the configuration we expose to engineers. It has culminated in a newer system that we call Rollout. And in this system, we emphasized convention over configuration.

»Refreshed Approach 

Rather than providing the entire Nomad job spec and allowing folks full access to the platform — because we know that all applications are built and run in the same way — we can constrain that to a couple of small configuration options. This means instead of having an 80+ line HCL job file that defines how to run the entire job, an engineer can just write 5-6 lines of code to specify — here's my single process that I need to run.

This method for abstracting the configuration into a simpler format is pretty common across companies. If you go from one company to another, you'll see that there's another configuration format that's similar to this.

The unfortunate part of this is that you can't share this configuration format because it's very specific to how you run applications at your work. I'm happy to talk about this later if you want, I'll show you my spec, but I'm pretty positive that it'll be a different spec if you go to your company or any other place.

»Exposed As APIs 

We also standardized and simplified how external systems integrate with deployments, resulting in many fewer commits to get something deployed. If you've ever worked with CI/CD pipelines, you'll notice that there are probably 10 or 11 commits before you get to running your tests. That's something that we wanted to avoid with Rollout.

Previously, I mentioned that, at our end, we embrace simplifying interfaces for deployments. With rollout, we wanted to expose our deployment platform as an API — as opposed to something that we pulled from for you.

In our case, the YAML file we submit is turned into JSON and submitted to an API. Then, on that end, we convert that into HCL2 and finally submit that to Nomad. So, if an engineer wants to deploy, they modify their YAML format, they can lint it in CIDR, that gets submitted into our deployment platform. And we tail logs for that deployment back to the CIDR job until completion, whether or not that's pass or fail.

I recommend this kind of platform if you have a team that has a lot of experience in writing code and is comfortable managing that platform as opposed to a team that's more configuration-driven. And if you have a team that needs to provide standards across your organization for different types of jobs. 

If you don't have standards across your organization or if you're trying to define those, this may not be a good fit for you. With all that in mind, I'm going to go into particular things we built within our platform. 

»Secrets Management

I want to give you a little bit of how that's evolved over time. For us, secrets management is pretty interesting. All applications require access to secrets, and Nomad 1.4 had the ability to set secrets directly within it — they don't need Vault. 

But in our case, we use Vault, and we wondered how should we expose those environment variables to users? How do we expose configuration and ensure those applications are running in the proper method?

»Secrets: Prototype

Secrets management at SeatGeek is pretty interesting. With our initial Nomad setup, we had template stanzas that did something like this: Every single secret that you wanted to pull, you would pull from a path. We would read it in — it was great. This is all KvV1. 

It worked well for a playground application but had a couple of sharp edges. In our case, we required specifying every single key for an application. If you needed a payment token and you also needed a payment token for another service, you needed both those keys there. 

If you had multiple jobs, you also had to specify that template stanza over and over and over for each job, which can be frustrating. And, if you're missing one of these keys, the deployment will fail because of how Consul templating works.

»Secrets: Deli 

Our first iteration of exposing secrets to applications and production environments was something that we call env.hcl. You could specify your data store configuration, a secret, and the Consul services you want to link. And under the hood we would take that in,  generate that template for you, and then inject that into every single job.

It’s a pretty useful, easy way for an engineer to add a new secret into their application or a new DataStore link, etc. Of course, you still had to configure upstream for Vault access to a particular DataStore, or you had to inject that secret into that particular environment. But those are standard things.

To add a secret called Lollipop, you add a stanza; it's called a secret with their name lollipop in there. Then, an engineer would go into the path in Vault — it was a secret API — and to their lollipop environment path, set that value, and then they would trigger a deploy. Then under the hood — when you deployed — our Deli service would read in your env.hcl stanza, check for that environment variable within Vault, and verify that it was there that had a value, inject the proper template and then trigger the deploy for you. So it worked pretty well.

»Secrets: Rollout 

We wanted the new Rollout system to decrease the number of steps to add a new environment variable or secret. Previously we would require a user to create a commit in that repository then trigger a deploy.

In the case of Vault’sNomad integration, sometimes someone would set a variable value to update something that would need to be coordinated with a deployment. And because of how secrets are refreshed on a given interval, that secret would be refreshed prior to that deployment, causing potential issues during runtime.

In the new setup, we inject the templates like this: You specify the application, path, and version of the secrets. We're taking advantage of KvV2 at this point. And we inject that version of the secret. 

All secrets are stored as JSON within Vault itself — at that Vault path. So, a user would set the lollipop key inside your environment, and then we trigger a deploy. Every single deploy is tied to a version of secrets versus having that kind of hang around. 

Under the hood, everything is now stored in that JSON dictionary.This allows engineers to quickly scan what secrets are available, which is great. Previously, they would say what is available? I don't know what the value is — and they'd have to go to several paths. 

The secrets are now versioned, and they're versioned up to a hundred things. We know exactly who's changed something and when it changed — we can revert if we need to. Then deploying a secret is an explicit action an engineer takes, so they can coordinate that with application changes they might need to. It works pretty well for us. If you have a similar setup, this might be a great avenue for you. 

»Observability

Next is observability. Our application monitoring stack has gone through a couple of different iterations on  how we provide monitoring for logs or metrics.

Initially, we had folks copy-pasting configuration into their applications. There was a configuration for a filebeat sidecar so they could ship logs. There's a configuration for a telegraph sidecar so they could ship metrics, etc. 

This worked fairly well, but it didn't work for very long. We found the application monitoring space progresses extremely quickly — at SeatGeek and externally . If we're using a tool, that tool may change how it gets configured, or we might find another platform or another provider is more useful. 

Requiring an engineer to make that change every single time; the resilience team investigated a new, better tool and provided a better experience. But that was taking away from engineer productivity and the ability to ship product to our customers. And it was also, again, something they needed to copy-paste for every single job — and something we wanted to avoid.

»Observability: V1

The first generation of this was template-based, as I mentioned. You'd inject a file, be it a YAML file or New Relic config or Telegraph, etc. And under the hood we would read that job — read every configuration file — inject the proper sidecar depending on what configuration files were available, and then deploy.

If you had a YAML file, we could have some token we needed to inject; the deployment system would read that filebeat, interpolate variables, inject that template, and then submit the job. 

Early on, we would detect features based on the existence of these files. So if you wanted APM, for instance, we would inject the APM only if you had the New Relic config in place. Or if you wanted to ship logs, you we would only ship logs if you had filebeat in place. 

That was good because then an engineer could say, I am doing my operations in X way; I'm shipping my logs in Y format. It didn't work well if an engineer forgot one of these files. They would ask, why are my logs missing? Or why am I not collecting metrics? It didn't work well if your application was shipping things in different formats.

We have some applications where they might have, for the most part, Python logging to JSON, but they may have an Nginx process running because — it's running some static process —  and that Nginx process is logging in a different format. So it's well suited for those setups. 

»Observability: V2

Rather than letting this be open-ended, our deployment tooling is now injecting all this information. When we were still using Deli, we started injecting the configuration via a Dockerfile, which we found was the closest thing to the actual running process. If you're running Nginx, you can specify that my log format is in JSON and or in Nginx format. If you're running your application, we specify the standard is JSON format, but you can override that via label. 

This has worked pretty well. Downstream, the agents we use to ship metrics —  and log and provide APM services — can introspect on those labels and then provide the services as necessary. 

It does require a little more work in integrating those agents with your running applications, and it doesn’t work well if you have non-container-based deployment systems. So,if you're deploying raw_exec jobs or pulling some other thing, maybe you don't have that capability there. So, I do want to recommend that you only use this pattern if you're fairly homogenous and have standardized on some specification where you can introspect on that. 

One other thing we provide — in Deli at least — is the ability to override configuration on a per-job basis. So, if you have a single Dockerfile that runs multiple processes but has a slightly different fig, you can go back and ]override just on that application — on that process setup. 

Again, the one nasty thing about this is you might need to duplicate that configuration. If you have multiple jobs and you need to specify these three jobs are Nginx, you need to respecify that over and over and over. That might get unsynced with your underlying codebase. So if your codebase now changes to a different format —  you need to go and find the places where you've referenced that. 

Adding support for new configurations means you need to expand the set of labels. So,if you want to expose HTTP processing or monitoring — now that's a new configuration that you expose.

»Observability: Rollout

In the Rollout world, we make a ton of assumptions about the best setup is for folks. Then, we remove all of the knobs completely. If you’re an engineer, we expect you to log in via JSON. If you can't log in via JSON, we need to talk and figure out a workaround for you. Rather than exposing all of these knobs for you and making it more confusing, standardize the system so we can do things more efficiently.

Similarly, we make some assumptions about how metrics are collected and how we inject APM configuration into applications. One nice thing about this is we can have a unified answer as to why a system works in a certain way — why are we shipping X, or why are we ingesting something in Y format because it's the same everywhere. We don't have to do something that's application or process-specific.

This has also cut down on a ton of support requests because an engineer can go from one team to another and have the same kind of experience across teams. The one pain point is when we're migrating systems or updating configuration, you have to upgrade everywhere, and it's not a piecemeal operation. You need a bit more thought and care into that pattern. 

»Networking

At SeatGeek, we embraced microservices really early on — in 2010, we had microservices, and then in 2012, people were calling it microservices. And that was interesting for us. 

For us, it was largely an attempt to isolate older frameworks from newer code.  In our case, we had a Symphony 1.1 app that we didn't want to continue working on, and folks were writing Python code.

We started using microservices to push more of that logic into separate services. This happened before we adopted containerization and Nomad — by which point we had about 40 applications running maybe 100-200 services. We're quite a bit bigger now, but it's pretty much the same networking model.

»Networking: Origins 

Initially, engineers would set up environment variables to talk across services. We’d assign a well-known port to a service, and you can talk to local hosts on that port on any given box — and we would then ship your request off to the correct service. That was all done via a load balancer, that was applied to every single server. 

So, if you were on any given server, you can talk to the API or our map service via the same port — there wasn't any guessing.  A lot of that was hardcoded into your service. If you needed to talk to that service, you read a file that said this was the mapping — all right, great. I'm going to use that in my application.

But for the most part, folks were reading this via environment variables. The nice thing about that was that it allowed folks to locally run that service on different ports and then do cross-service communication. So, we were ready to shift into a more service mesh-based approach. 

»Networking: Deli

For our first iteration of service communication on Nomad, we piggybacked off the env.hcl  file format. We had a way for a user to define a service. You can define that service on a given scheme or port. you can define what service it talked to under the hood and expose that in a different environment variable depending on what you wanted to do. We would, of course, process that and inject the Consul template for usage. Then you would talk to whatever that registered service was. Pretty simple. 

I think this is straightforward if you start with Nomad and have Consul set up or if you're using Nomad's integrated service communication. This is probably the way you want to go. We found you have to be careful depending on how many services you have and how you're loading up those templates — especially if you're iterating over a lot of stuff. 

Some things we found early on: With much smaller Consul clusters, we overloaded the Consul cluster and had a ton of outages because we were trying to render these very large, very complex Consul templates. 

Keep in mind that if you're deploying this kind of functionality in pre-production, your pre-production is probably a lot smaller than your production. So, multiply by whatever number of requests or services you're templating out. 

»Networking: Rollout

Our current iteration is fairly similar. An engineer defines their configuration and rollout YAML file, and then we inject that rollout YAML file configuration and a sidecar for the engineer. We also inject the environment variable with the correct value, so under the hood, they talk directly to the service via Consul Connect. I don't have to do too much of anything different. 

That also applies to our existing Deli-based deployments. They define the env.hcl  file, and we turn that into a correct Consul Connect configuration and inject that for an engineer. 

This allows us to have much more secure communication across services. Hopefully, in the future, allows us to shut down requests outside of the mesh from servers.

I recommend looking at this route if you're comfortable with Nomad's current Consul Connect usage instructions. Specifically, there are a lot of things you might want to configure on the Consul Connect side that aren't quite exposed to users via Nomad. 

If you're on something like Kubernetes, you have full reign of everything. But on the Nomad side, there are a couple of options that may be useful but aren't exposed quite yet and might require you to build tooling to push them out via APIs. 

And that's all I've got — so, if you're still here, my Twitter handle is @savant. You can come and reach me after the talk or reach out on Twitter — and hopefully, someone's learned at least one thing here. 

Thank you.

 

More resources like this one

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 7/21/2022
  • Case Study

Using Terraform Enterprise to support 3000 users at Booking.com

  • 2/22/2020
  • Case Study

Terraforming RDS: What Instacart Learned Managing Over 50 AWS RDS PostgreSQL Instances with Terraform

  • 10/7/2019
  • Case Study

Running Windows Microservices on Nomad at Jet.com