Distributed Configuration Management With Nomad at Samsung Research
Aug 02, 2019
Learn how Samsung Research used Nomad to orchestrate a datacenter platform that can now be rebuilt in 15 minutes with a full day's recovery.
Samsung research uses every tool in the HashiCorp stack: Terraform, Vault, Consul, and Nomad. Schedulers such as HashiCorp Nomad are most commonly associated with running stateless services. For the Platform team at Samsung Research, however, Nomad provides much more than just this.
With a Nomad binary installed on servers, the team is able to bootstrap its entire datacenter platform in minutes using Nomad to control:
- DB clusters
- Schema upgrades
- Package installation
- and maintenance tasks
It also drastically lowers the barrier of entry for developers to configure their own instances of infrastructure and run their own code.
This talk by James Rasell, a distributed systems engineer for Samsung Research, will dive into his team's system architecture, discussing the rationale for decisions and tradeoffs made in using Nomad to control more than just the stateless API services the team runs.
Distributed Systems Engineer, Samsung Research
I’m going to be talking about distributed configuration with Nomad. This is a research project that I was working on recently. And I wanted to talk about it and share it with you because we found Nomad fit quite a nice niche in which we basically replaced using a system, the inner system, with Nomad.
I am a distributed systems engineer. I have a background in infrastructure and ops, not development, and I’ve spent most of my life on call. And I think because of that, I tend to automate that 20% that gets forgotten about. I don’t like having to wake up at 3 o’clock in the morning, having to restart a service, or having to just twiddle that little tool to make sure it works.
I’ve created a couple of Nomad tools for the community. Sherpa is an autoscaling tool. Its primary use is to autoscale Nomad jobs, either using an external webhook or an internal autoscaler. Levant is a deployment and templating tool, just to provide some kind of developer help. And Nomad-Toast, not named by me, is a notification tool that can take deployment actions or allocation changes and stick that into Slack for you.
» An overview of Nomad-Vault-Consul architecture
Let’s get into this. I’m going to give an overview of what the architecture looked like. This is probably one of the worst infrastructure diagrams I’ve ever written
, but this illustrates it quite well.
On the left-hand side we’ve got the Vault cluster that we were using for secrets management. And it’s important to understand that the Consul here is used only for Vault backend. This isn’t used by Nomad. I’ll get to the Consul for Nomad in a little bit.
In the middle we’ve got the core of Nomad service, and then on the right is this whole compute layer. This is different types of servers with different configurations.
This was all bootstrapped using Terraform. We used libvirt and the Vault providers. Then to do the config management, we were using batch groups. A bit old school, but we didn’t really need the full-blown configuration management deal. We didn’t want to have Ansible, Puppet, CFEngine, whatever you wish. And it was just doing a minimal install.
This is what we got to: We could bootstrap that in 8 minutes. Discussing more about that bootstrapping process is a bit out of scope for today, but if you are interested in it, come find me later and I’ll be happy to chat about it.
» Architecture considerations
What are some of the considerations when we are coming up with this architecture? We do have the secure and segregated Consul and Vault cluster. As I said, we used it for PKI and for secrets. This might not be possible for everyone, if you’ve got cost constraints, or anything like that. We are running on physical hardware and virtualizing using libvert, so that made sense to us to keep that whole portion separate.
Every server was running a Nomad client, even the Nomad servers themselves. I think I remember reading a year ago that they advised not to do this, because if your Nomad clients are using all of the resources or a lot of the resources on that server, you’re going to starve the Nomad servers, and they’re not going to be able to do their job.
You can get around this. The Nomad client configuration has a resources parameter, and you can reserve resources so Nomad won’t schedule onto that. We also did this because it’s nice to be able to run maintenance tasks, reporting tools, and management across all your cluster just using Nomad. That was one of the primary reasons we did that.
And on that pool of workloads, we used meta parameters and class parameters, or class values, for the clients themselves. Some of the workload we were running was very hardware-specific, so we had to ensure that a particular batch workload was placed on a particular server. And using the Nomad client meta, we were able to make this general pool of workloads be highly flexible in where we were placing things.
That was the bootstrap phase, and so from here on in everything’s managed by Nomad. And what were some of the first things we did after the bootstrapping process? During bootstrapping we used this small Go process called gotun, and that was providing proxying between the Nomad servers and the clients and the Vault cluster. Has 5 stars on GitHub, but worked for us.
And then it might have been quite a long time between bootstrapping and the actual use of the clusters. We just left this in place to provide connectivity. But we would stop that using a batch job through Nomad, and then we would start Fabio.
» Consul for Nomad
I’m a big fan of Fabio. We were using that with traffic shaping to provide access to Vault from the rest of the environment. This is where Consul for Nomad comes in, and we run it on Nomad. The Consul server is a service job, and the Consul client is a system job, because we wanted that everywhere.
I had a bit of a love-hate relationship by doing this. Sometimes I’d look at it and think, “Why the hell am I running Consul on Nomad?” But having Nomad manage Consul resource allocations if we needed to was quite nice. And we preferred in the end doing that than running Consul on the servers as a binary itself.
The top code snippet is the template section from the job that was stopping gotun. It’s not very pretty. This was an early example of that.
All we’re doing is embedding batch groups within the template section. Running this as a batch job and will remove and stop the gotun binary. The Consul code back there, if you’re familiar with Consul, this is just writing a KV, and this was done at bootstrap time. It’s basically telling Fabio that when you’re coming to root traffic to Vault, just put 100% of that traffic to anything, the service with the active tag.
Vault acts quite nicely in updating Consul if the active mode changes. This was a nice way to actively ensure that we are rooting to the correct mode.
The final snippet is from the Vault config file. This is just telling Vault that when you register with Consul, put these particular tags into your current Consul service registration. And you’ll notice we are using the proto of HTTP. That’s because we were doing SSL offloading at the Fabio layer. And that’s to do with the whole bootstrapping process. How can Vault itself run with TLS if Vault is providing with TLS unless you have some seed cluster somewhere?
That was a lot nicer for us than running HAproxy. We struggled with HAproxy to the point where 3 of us had to look into the HAproxy source code, and that was not a fun day.
» TLS management
Some of this might be defunct after this morning’s announcements. I might look back at this in a couple of weeks and go, “What the hell was I thinking?”
TLS does take a long time to set up. It was difficult but it was something we wanted quite urgently. Even having TLS certificates with a TTL (time-to-live) of 10 years, so you could have just one cluster that sat there, a Vault cluster. You generate a 10-year certificate and you put that there and run Nomad using that for example. That wasn’t the proper answer for us.
We used short TTLs as default. We did that from day one. That was both on infrastructure applications such as Nomad, such as our database, but that was also on all of our internal applications. And some of our smaller applications, some of our maintenance tasks that ran for minutes, for seconds, they would use low TTLs. I think the lowest is 15 minutes. I could be wrong on that. And the highest one we had across the whole platform was 722 hours, which is 30 days.
As I said, this took a long time to get stable. I got asked by my manager, “What do you think is the likelihood of getting this in after we do everything?” And I told him it was pretty impossible.
To help manage the low TTLs, we wrote a small application that would initially perform TLS expiration checks. So you would be able to pass a threshold in days and it would check the certificate it was pointed to and it would tell you whether or not it was going to expire.
We later added IP SAN checks. This allowed us to bring more servers into the cluster, and we could just trigger this job. And all the TLS certificates in the cluster which were affected by these IP SAN changes would then update their certificate. And if either of these checks failed, cert manager would go off to Vault, request a new certificate, grab that, and put it on disk. And then we would run any arbitrary commands, anything you want, which would then force that TLS certificate to be updated and be changed.
We use this process quite successfully for everything that wasn’t managed by Nomad. We also had things like built-in jitter to avoid thundering herd. We also exposed structured logging telemetry, so that we could monitor TLS certificates in exactly the same way that you would want to monitor any of your applications.
» The downsides
There were some downsides, as always. It used some interesting deployment logic to make sure that all nodes of a particular class had the job run on them. We were effectively replicating the system batch. It wasn’t hugely problematic for us. This was because we weren’t having all these ephemeral nodes that were coming in and out.
We were running on 10, so, the members of our that cluster were not changing particularly much. Applications that don’t support SIGHUP reload I had to restart through that mechanism. And, no matter how good the automation is, I always get a bit squeaky when I know one of my applications is performing a restart at three 3 o’clock in the morning.
So, if you are ever writing applications, please always plan for TLS reload via SIGHUP. And picking Nomad as an example, you may have TLS on the API, but you also have downstream TLS connections to Consul and Vault, for example. So being able to reload those is quite important, otherwise you’re going to have to restart the application anyway.
» TLS bundle splitting
So I put bundle splitting in because we did it this way, and I am not sure if this is a decent way, but this is how we did it. When Nomad is requesting a TLS bundle from Vault, it uses Consul template, but it comes as a full bundle. You get the private key, the CA chain, the certificate in there. Most applications don’t accept a bundle.
We did write our internal applications to accept a bundle. One way we got around that was doing that. And the application itself would then split the bundle, and load it into memory how it wanted to. But, for things like our database, for the things that we didn’t control, we used this
And we use that to perform splitting magic that we needed to get these bundles into their component certificate private key. This is a very top section. There was a lot more than that. For anyone that uses Consul template, you will be familiar with the first section. We’re just requesting a certificate from Vault, and we are piping that to JSON so you got the JSON representation of that bundle.
The second is where gomplate comes in. And if we look at the data within the template, we are effectively printing out from the bundle, the private key of that JSON. And we’re writing that template file to disk. This is where it gets a bit sketchy.
When we were running the applications, we invoked gomplate as the primary application to run. The
-D is specifying the data source, so in this case the JSON bundle that we want to take in as our data and spin up. The
-F is the template file to process, the one that was written in the previous slide. And the
O is the output file, the rendering of that file.
Then, at the bottom of this slide, we can call the application with those split files out. This worked quite well for us. We put a few fixes into gomplate itself to make sure that signals were handled properly and propagated to children. But if anyone knows a better way, please tell me, because that would be great. If not, this does work, and this was pretty solid for us.
» Running a database on Nomad
Now I’m going to move on to CockroachDB. For those who don’t know what Cockroach is, it’s distributed SQL with horizontal scaling. It provides location partitioning, so that you can run global clusters. And it is cloud-native technology. So, if anyone is playing Buzzword Bingo, there you go.
I’m going to start this section by quoting Sir Kelsey Hightower, as he should be called, who spoke at HashiConf in 2016 and said that, “Most people get really excited about running a database inside of a cluster manager like Nomad; this is going to make you lose your job. Guaranteed.” I still have a job, so this isn’t exactly true. But I still do believe this is true for most traditional DBs.
That being said, if I had something like AWS IDS available to me, I’m probably going to use that instead of trying to maintain my own database cluster on a scheduler. But, counter to that, more modern DBs like CockroachDB, like Ties.DB, I think it’s become easier and safer over the past few years to do this, to a point where we run this happily, as our main data source in this project, without too much trouble.
Even when we got into trouble, when we have a moment when the TLS certificates may have expired on the Cockroach cluster, by doing some manual intervention and some kind of funky hacks onto the cluster itself, we manage to bring the cluster back up. We didn’t lose any data. It was there, it was fine, and so it proved to be quite resilient.
» Setting up the cluster
How do we set up the cluster? We had service placed across hosts using constraints. This ensured that we had HA and redundancy. We used ephemeral disks as well.
If anyone’s not familiar with what ephemeral disks are, we have it set up so that if an allocation failed, it would be put on the same server, the same physical host that it failed on. This meant that only the local and the data directories had to be copied across disk. It was a lot safer, it was a lot quicker, and we found this to work really well. It also worked really well when doing deployments and upgrades of the cluster.
Cockroach requires an initialization, just an
init command, to be run. And again, we use Nomad to run this, but as a batch job. This was all part of apply-plan, and we could trigger it from our deployment tool. And one thing that we always did was—we paid careful attention to everything, of course—but we paid careful attention to the job parameters in Cockroach, and we kept analyzing to make sure that we had time-out set right, that we had different standards configured separately. So we knew what we were getting into if something failed or if we were doing it to deployment. And this on screen is how a couple of the bits of the jobs looked. The top left is the actual DB cluster group section. You can see the ephemeral disk using “sticky.”
We are using a count of 3 and the distinct hosts to force this to go across a number of hosts and not all been back onto one. And the constraint there is just an example. This was to make sure that this wouldn’t run on any hardware that was better suited to running another type of workload.
» Initializing the cluster
The second box on screen is the init command. And this job, really apart from pulling some TLS search from Vault, this was all the job did. It would run it in less than a second, and it just makes sure that it initializes the cluster on the first run. But now you’ve got the cluster. What do you do with your schemas and everything like that?
We store that table schema alongside our application code. And we wrote a small application that was used to apply the schema changes. It used
gobuffalo/packr. And the schema change was effectively an up or a down file. And just including things like
We had the up and downs that we could roll black any problems that we encountered. It was also idempotent. So that often, when we triggered full deploy of any platform code that had been built, we could deploy with the migrations. We would ensure that we’ve got the latest changes. If there weren’t any changes, and they had all been applied, then the code would just go through it anyway, and it would be fine.
We even had a command that we could seed data for development purposes. Again, this was run through Nomad as a batch job. This was perfect for any integration testing, for local development, and also for doing demos. So that we could get some data in, bring it all up, automate it, and we could go.
Backups are probably one of the most common batch workloads to run on Nomad. The periodic stanza within Nomad’s batch job template allows you to configure the backup jobs to run to meet any SLAs you have, so you can set it to run every minute, every hour, every day. However you need. We use this mechanism to back up CockroachDB tables as well as our Consul Nomad cluster and our Vault cluster.
We wrote thin wrappers around the Cockroach dump command and the Consul snapshot command, just so that we could control a little better some of the naming when we put that file up into our external storage. And the application also has a restore command. If anyone remembers reading about the GitLab outage, this is kind of relevant to there.
A backup isn’t a backup until you’ve tested that the restore works. It just isn’t. We have the restore command for 2 reasons. The first is so that we can test it quickly. The second is that, when you get into that DR situation, you’re not trying to stumble through a process that you’re not familiar with, running different commands that you aren’t sure of.
Having this ensures consistency across both the backup and the restore. And with this in place and with the automated bootstrapping, you can build this platform in 15 minutes with a full day’s recovery. And so “everything as code” works nicely.
» Ingress and discovery
Just a few miscellaneous points that I wanted to bring up that don’t really deserve their own section. Or I didn’t find anything as interesting to say about them.
As I mentioned, I’m a big Fabio fan. I’ve used it for as long as I’ve used Nomad. And we used Fabio to provide external access to all of our services and their UIs.
We added HTTP-style access to Fabio itself so you can secure UIs. We also added gRPC access features to Fabio. And the UIs—the Consul, the Nomads, the Vault UIs—we ran community UIs instead of the official ones. We liked the idea of being able to constrain where the UI ran, rather than being dictated that it ran where the API was. But this was just one of our personal preferences.
» Service discovery
This is probably a bit old since this morning. For applications we didn’t maintain, we used Consul DNS service discovery. Very easy to plug into applications. For our internal codebase, we were running gRPC services. So we use the gRPC Consul resolver, and this worked brilliantly. It was 30 lines of code to hook it up. Very simple, and worked very nicely.
» Path to production
This was something instilled in me by one of my old managers, and it’s something I still hold dear. That entire platform, the entire infrastructure, the entire process could be built on my local developing machine that I have in my house, at home. We were using the exact same processes, all the same tools.
And this was really nice because it allowed us to even test the infrastructure processes. We didn’t have to push that code and then see 30 red jobs in Jenkins before we figured out that it’s a typo. And this was really important in having this path to production.
All of our deployments as well, we use TeamCity. We use Levant, which I explained and described earlier, and we use Nomad-Toast. Again, Nomad-Toast is running on the Nomad cluster.
And this gave us better automation. This gave us better observability about what was going on.
And despite being a research project, we did have monitoring in place. We used our own application for this. Very small. But we had a Consul health-checking service that would then use the health checks to alert our Opsgenie, so that we can get woken up at 4 o’clock in the morning.
With that we also shipped all our logs to Humio, using a FileBeat shipper. Proved really good. We used the log-shipping button that is described in most of the reference architecture. We were running just a log shipper per server. And we found that to scale fine. And we ran that as a system job without any constraints, and that means that it will just run on every server, every client in the cluster.
» Wrapping up
From the bootstrapping phase, once that was done, we used Nomad to manage everything, and I mean everything. We even had particular people building Nomad jobs which would install packages on the underlying hardware.
And relying on a single mechanism for running these tasks, for me, simplifies the whole process. It ensures consistency. And it also lowers the barrier to entry.
If I have someone in my team that is more of an out-and-out developer, who doesn’t know Terraform, who doesn’t know Ansible, who doesn’t know Jenkins or things like that, and doesn’t know Nomad, in this environment, all they would have to do is learn a little bit of Nomad. Even just the little bit they really care about, which is the config and the commands. They can get my help or someone else’s help, and they’re configuring their instances, or they're running their application code really simply.
It’s true we did write a fair number of small applications to help with the tasks. And this isn’t to say that any of the tools we’re using were bad, just this really helps us smooth any edges out and streamline the process. And it made, for me, a platform that was very easy to deploy, really easy to maintain, but also very solid and worked very well for us.
Thank you. I hope you enjoy the rest of your conference, and if you’ve got any questions, I’ll see you out there.