Build your own autoscaling feature with HashiCorp Nomad
Jun 28, 2018
There's a common misconception that autoscaling means automatic scaling up or scaling down. But in most cases, it actually means metrics-based scaling. However, Joyent used HashiCorp Nomad to build true autoscaling that works for any cloud.
UPDATE 2020: HashiCorp Nomad now contains its own autoscaler functionality. Learn more in this blog post.
The concept of autoscaling for the cloud can be as simple as a distributed cron job. If you use a cloud, or a private virtualisation environment that doesn't support autoscaling, then you can use HashiCorp Nomad to build your own. In this session, Paul Stack, a software developer at Joyent, demonstrates how he used Nomad to build an autoscaling feature on top of Joyent's Triton cloud.
- Paul StackSoftware developer, Joyent
Okay, so my name's Paul Stack, I work for a company called Joyent, and I'm here to talk about how to build your own autoscaling feature using Nomad. Everybody know what Nomad is I'm presuming? Someone not know what Nomad is? Okay, so we'll look at it as we go through.
Autoscaling is supposedly a very simple concept that allows you to scale up or scale down a system in a simple manner, that allows you to optimize for cost. For example, every day in your development environment or your test environment, you can actually spin your entire environment up during the work hours, and only be paying for that time, and then at the end of the day, spin it back down so that you're not actually wasting money.
Autoscaling in GCP, Azure, and AWS
Autoscaling, it's an interesting name for the concept. A lot of major clouds have it, so Amazon has it. This is all Terraform configuration codes [1:11], so if it looks a little alien, I apologize. I primarily use that Terraform code. In Amazon, you basically pass a name of an Auto Scaling group ("Auto Scaling" is the AWS-specific product name), you pass the max size, the min size, and then the desired capacity of how many instances you want running. You can give it a specific health check and you can launch it based off a known configuration. It's quite a simple way, but you actually have constraints about how many it's allowed to run to, and the minimum that it's allowed to.
Azure has a very similar concept. They have something called scale sets, and the Terraform configuration is slightly more long-winded in this case, only because it's a different configuration. It's not a mapping of one-to-one. But in Azure, the main part is that you see capacity equals two. There are no constraints, you can't say it runs with a maximum of five and a minimum of two, you actually say, "I want two machines running," and it will adhere to try and keep two machines running.
Google Cloud also has autoscaling, but in a slightly different manner. You use instant groups, and you use auto-scalers based on those instant groups. As we can see, a lot of the major clouds have autoscaling.
Why is it known as autoscaling? It's not autoscaling okay. autoscaling is what Amazon calls it, but of course, unless you configure other policies, it doesn't automatically scale. I've soon realized there was a massive misconception around autoscaling. Mid last year when I was doing some work for a company in America, and I implemented what they call Auto Scaling in Amazon, and the CTO was like, "Great. When it's at peak times, we'll add more machines." I was like, "No. It's going to adhere to keep three machines or five machines running," and he was like, "But it's autoscaling." It's, "No, that's metrics-based scaling." There's a huge difference in how people actually do it.
I actually believe, and some people who know me will laugh at this, I actually believe Microsoft have it correct—they call it "scale sets." They don't actually use the term autoscaling, because people then don't actually believe it's autoscaling.
What we know today is that autoscaling is actually metrics-based scaling in the names, and how you would do it looks as follows. This is an Amazon example [3:27], you would create an alarm. When something happens within our system, in this case, when the average CPU utilization is greater than 80% for a period of five minutes, then we add an alarm. Our system pings and says wait, there's something happening right now, therefore we actually have to be able to react to an event.
Based off of that alarm, you can do steps two different ways: firstly you can have an increased group size, and what we would say is—when we actually trigger this alarm, we will add 30% of our machines, and it would allow you to do that. The converse is that when we trigger a different alarm, we can remove instances when we meet a specific target.
Triton Service Groups: Autoscaling for the Joyent cloud
I mentioned at the start I work for a company called Joyent. We didn't have autoscaling. We have a cloud called Triton, and in January this year, my boss, who's here, I actually suggested that we needed an autoscaling feature that was an upstack feature based on top of Triton. We wanted to change how development worked within the company, and we didn't actually want to couple anything into our system, so that we could actually release in different schedules compared to the Triton core. We built an upstack service on top called Triton Service Groups, and this was our equivalent of autoscaling.
Now, because I've used quite a few clouds, it takes a little bit of each cloud that I liked. This is the scenario when you build in a feature that other people already have. You can take the pieces that you like from one and from the other. Again, we used the idea of templates. Templates are our equivalent of Amazon launch configurations. We didn't want to call it launch configuration because that belongs to Amazon, so therefore we have our own version of it. In this case, we have a template name, just for record in our database [[5:35]](https://youtu.be/9oqrk-18Dzs?t=5m35s. We have a machine name prefix, all machines will come up with a specific name prefix on the front of them, we have an image that we will create against, because Triton has the idea of images that are already pre-baked, and we have a package, so like the size of an instance, this was what it is. Then we attached some networks that we require.
Then at the bottom we have what's called a Triton service group resource . All we did right here is we basically said, "Bring this Triton service group up. We want three machines, and it's based on the image above." It's very simple, very easy. But the internals of it, we tried to keep even more simple. I mentioned that we were built on top of Triton. As we can see [[6:22]](https://youtu.be/9oqrk-18Dzs?t=6m22s, we have the Joyent public APIs. This is what SDKs hit, this is what the Triton CLI hits, we took it a level on top. This big black box at the top, we have three things in it. Firstly it's a metadata store. We needed to store our information somewhere. In our case, we actually chose Cockroach DB because we wrote all of this code in Go, and Cockroach is a very simple, single binary system to deploy. It allows you to scale out very easily, just by running a binary and running a join command, so it's perfect for this use case. It was perfect for us to understand, and it's very simple for administrators to look after.
The next big thing that we chose is right in the middle [7:07]. We chose Nomad. Nomad is the perfect orchestrator for what we're trying to do right now, because we're treating it as a distributed cron job. When we built autoscaling, in a nutshell you define a level and every so often in a health check, it checks if that level is the same. If it's exceeded it removes an instance or x instances, if it's below it will add more instances. We can get Nomad to do that. We can pass control of Nomad over to do that.
Then lastly, we have API and a Fabio instance on the front, just because we needed a load balancer and we just needed an API. The API's written in Go, it's open source, you can go and have a look at it, and I'll show you the repository at the end. Then we needed a Terraform provider that actually spoke to this.
Let's have a look inside what it's actually doing. Firstly, if we go in and have a look, our system actually will generate a Nomad job from within the code itself. We will pass in certain parameters, so we'll pass in a data center of course, we have some rules, we have an artifact that's already pre-created, it's templatized so that we can download and we can actually run, and then we run a number of commands inside that CLI. We have count, we have package ID, we have image ID, we have name and we have template ID. We have just certain variables that are required when we interact with the Triton API in order to spin up nodes.
It will generate a Nomad job based on the name of the Triton service group, and the template ID. In our case, template IDs are immutable. Templates are immutable. We don't want you changing a template after instances have been launched based on that template. We took the Amazon approach here, because from all the years that I've been using autoscaling groups, it was very important that when you change the template, you may not want the instances to change straight away. You would only potentially want the instances to change on the next launch, on the recreate, so that we could actually make changes to the template as much as we want, and we could orchestrate when those changes were actually going to filter their way through the system.
We tried to be quite careful here. Because this was potentially for a public cloud, we did not want you changing something and then us inadvertently going and destroying your machines and recreating them, because we don't know what you have done with those instances since you have actually created them on our system, so it's important that that was the case.
The instance itself is extremely simple [10:10]. We have an ID, we have a template name as I said, we have a package, we have an image, whether the firewall's enabled, and then we pass in some user data and some metadata. We do not destroy any data. If you change your templates, if you change your instance size, if you change anything to do with it, even if you destroy your Triton service group, we do not delete any data. We simply archive it so that for later purposes, you can actually see when something is still happening in our system.
Then lastly, our groups are very simple. We have a group name, a template ID that's attached to it, and we have a capacity. We know when it's been created and we know when it's been updated, again, for audit trail purposes.
Design decisions for Triton Service Groups
Let's have a look at what else is going on here. How simple is simple? The CLI is probably one of the most basic Go programs I have ever written, and I said it at the start, it's actually written against the Triton public API. Why did we do that? We didn't want to maintain other versions of APIs. The APIs are there, the APIs are available, anyone who uses our cloud can write against that API. Our Terraform configuration writes against that API, Packer configuration writes against that API, so we just thought that we're another part of that chain that wrote against that API. We could leverage all of the pieces that the actual core Triton team had given us, so that we could do it going forward.
But of course, this gives us limitations. This is not a mature project, this is only a very short-lived project that was started in January as I said. There were only three people working on it. Originally for the first month, there was only me working on it, because my colleague was on paternity leave, so it was a case of we had to make trade-offs very early on.
Number one, scaling right now is serial. I really do not want to DoS attack my own public cloud. That would not go down very well with my company, and it would cause me some big problems internally. Until this became a more mature project, we thought let's do this is a serial manner. The API is fast enough that to spin up 10 instances, it still only takes less than a minute, so that's not too bad an order to do. The plan was that if we needed three machines spun up or three instances managed, it would spin up three Goroutines, and it would actually do everything concurrently. We just didn't quite get there yet.
Number two. The health checking is extremely basic. I said we don't want to change your machines without you doing it yourself, because we don't know what is on your machines. We didn't get a chance to add features like they have in Amazon, where you can actually specify load balance or health checks. It just wasn't possible in the time period that we were given for this project, and it also wasn't possible with just the fact that we only had a few people. For this to be more usable for anyone that is actually is going to use it in production, you would need to write your own check on top before you orchestrate actually bringing machines up and down, so it's really important that people know that.
Number three, and the most important, there is currently no idea of metrics-based scaling. I put a very small set. Hands up, people who use an Amazon metrics-based scaling. Okay. A very small number of people in the room, but hands up people who use autoscaling in some manner. A lot more. It wasn't a deal breaker to release that at the start, but we actually still needed to plan for it going forward.
Demo: Using Terraform and Nomad with Triton
Let's have a look at actually what it's doing. I have a very simple Terraform configuration file right here [14:36]. Extremely simple. I'm not going to run this, I've actually prerecorded because I've been dumb with demos before, so I apologize, but I'm just going to show you exactly the same things all running on my machine.
Firstly we have an image that we have pre-selected. In this case it's called a base-64 LTS. It's just a predefined image. It could be Ubuntu, it could be CentOS, if you really wanted, it could also be Windows. From there, we can specify some instances, an instance template based on that. In this case we have an image, which is the data source defined above. We have a package and we have some networks, and whether the firewall's enabled or not, and then we have some tags. Then the last part is extremely simple. We have a group name, we have a template, and we have a number of machines.
We wanted specifically to keep this code as easy as possible, and if I 'Terraform plan' it, this should not fail, we can see it's going to create two things. First thing it's going to do is it's going to create a template, and the next thing, it's going to create a service group. Notice it's actually not going to tell you it's going to create two machines. This is just going to create two Terraform resources and pass this off to Nomad as the system that manages it for us. If we Terraform apply that, it's going to be extremely fast in order to create it. We can see that in less than a second, it's actually gone off and created some pieces of the puzzle.
Now I have Nomad running locally, in a Vagrant box, and as you can see, there are currently no jobs running in Nomad. If I 'Nomad status' at this point, we can see that two things have happened. The first thing is we actually have a top level job, which is a batch job that actually will manage everything. It's the top level piece of the puzzle. Then we have a periodic job that runs under that, and the cron in this case is every 60 seconds. Every 60 seconds it will actually check the state of your machines, and it will actually rise based on that. We can actually see, if we go into the Nomad console again, if I refresh, we can see that it's currently running, and we can actually go in, the whole way down, and we can see exactly what it's doing and we can see the definition of what it's created and so on and so forth. Everything's there.
We also have a Cockroach DB running, which we can see that there are some tables in place. We have accounts, we have groups, we have keys, we have templates, and we have users, because we needed to specify all those pieces in order to actually spin the machines up and down. We don't want to keep any of your data, we don't want to keep any of your SSH keys, we don't want to keep anything. All we want to do is be a proxy to the underlying APIs, which actually handles it for us.
I can actually show you this in motion. In exactly the same way we're doing it, we have Triton instance list, in this case we're going to show that there are no instances currently running. It's a little slow, I apologize. No instances are running right now. I have a Terraform plan file. In this case, it's actually going to spin up 10 instances that are pre-configured. Now we're going to 'Terraform apply', and we're going to approve that plan. Here we've actually applied the resources.
Now I can actually start to interact with Triton again, and start to say, "Give me the list of instances that are currently running now, with the name 'srpol'." Now we should start to see instances spinning up. There's the first four, and we run it again, we'll see some more, and we run it again and we'll see some more. This is where the problem I told you about, this is serial right now, but it works, and we can actually, at the end, run Triton instance list and we'll see that some of the states will actually change from provisioning, and the whole way through to running.
It's actually managing our instances for us, and at any point in the demo, we can run Terraform plan again to detect drift. I'm actually going to change it from 10 machines to 15 machines, and I'm going to apply again. We can see it's extremely fast when we approve it. It's rolled out that in less than a second, this video is not sped up, I promise. It would look even better if it was. We can actually see that it's going to create five more machines. You can see the edge of the machines, the new five, seven, six, four, two and one. You can see it's been done one at a time. There were a lot of plans that took us forward in order to do this.
The bad news. Unfortunately, since I submitted this talk to HashiDays, the project is no longer in place. It was my baby for the first six months of this year. I really loved it and it was everything that I'd done, but I've moved onto a completely different project right now. But all of the code is still open source. You can go and you can actually take all this code.
Lastly, this is the last check, we'll see what machines are actually running at this point, and we have 15 of them, and so on and so forth. Let's kill that. That's boring now.
Open sourcing Triton Service Groups and more design decisions
The repositories are as follows, https://github.com/joyent/triton-service-groups. Then we have one more, we have TSG/CLI. They're extremely simple repositories, they're extremely simple code, and you can go and you can play with them as much as you want.
I wanted to show you one thing inside the scaling, just to show you the limitation of what we had. Just to show you how simple that autoscaling can be, first thing we do is we know how many running instances there should be. I told you tl;dr is you define a metric and the autoscaling group will tell whether you're above and below, and actually act on that.
Second thing we do is get the length of instances that are currently running, and then we can get the expected instances. Then we have a scale count. Of course, we can have a negative number or a positive number at this point, and then, if we have instances to remove, we loop over. Here is where we would inject the code that would spin up multiple goroutines in order to do it, and we get specific errors coming back from it. If we need to scale up, again, all we do is iterate over and we use the API to create new instances, and then we tag the instances with what you actually require, with a name and so on and so forth.
This is as simple as it gets. We introduced one other area. In the hope that we would use it for going forward. In Amazon, two things happen. I talk about Amazon, just because it's the cloud I use the most. I've never really used any of the other major clouds for autoscaling. But when you look in the audit trail, two things happen, and you could hook in.
You've got a machine launch event happen, or a machine destroy event happen. Our company decided that it would be really good to show our users that health checks are actually happening at the predefined time, because you don't potentially know when Amazon is running health checks. If they say it's every 300 seconds, we presume it's every 300 seconds. If it's every 60, so on and so forth. We introduced the idea of a NoOp. We will actually tell you when the health check has run, and it will make sure that it's actually specified in the output, so that you can actually create events based off the back of that.
It's very nice for audit trail history to show that your health check is passing. I understand it's a vanity metric for some people, but some people take solace in the fact that they know their system is healthy, and it's healthy every time the system checks it's healthy, rather than just reacting off the back of it. Of course, if you have an SNS or a notification system that picks up these, you can send events based on email or Slack channel integrations, or whatever you want based on that.
We wanted to keep this extremely simple. We needed to keep this extremely simple. I personally don't know how many people work in auto-scale in all of the other clouds, but when you have three people working in a small cloud, only in autoscaling, we didn't want to rewrite the wheel. Reinvent the wheel excuse me. We needed to leverage systems that were already in place. Luckily, Nomad was the perfect distributed cron job for us, and CockroachDB was extremely simple in order to actually set things up.
The last repository we have, we have what's called TSD infrastructure. This is currently private, and I don't understand why it's private, but I promise you I will turn this on. I will make this public as well, so people can actually go and have a look. This is effectively all of our Packer scripts, this is all of our configuration for Vault, for Consul, and for Nomad. Everything that we do, by default, in fact, sorry, my boss is at the back so let's just make it public without him seeing. It's public now, and people can go and have a look. That wasn't actually meant, it really wasn't, but everything that we do is actually in here. This allows us to spin the whole system up, just using a simple Terraform configuration or a Terraform plan, Terraform apply, and it will actually bring up our entire data center. If you want to do this in Triton, I believe it's something in the region of maybe 14 or 15 machines for the entire system, so it's not too bad. It's really not too bad.
True autoscaling for any cloud
The point of this talk is not that Triton now has autoscaling. The point of this talk is the fact that using and leveraging simple tools, you can apply these exact same principals to your own system. For example, if you use VMware internally, and you don't have the ability to auto-scale, or your ops—I'm sorry for the ops people in here, I'm going to literally say this—if your ops people don't have autoscaling enabled for any of your clouds, you can write this, run this locally from your machine, and still auto-scale against any cloud on the internet.
If I had a CLI tool that would actually integrate with Digital Ocean, I could run exactly the same autoscaling against Digital Ocean, using exactly what I have now. In the Nomad job we have a binary called TSG/CLI, you swap that out for any binary that you write against any cloud that you want, and you can actually automate and actually auto-scale based on that cloud.
Now I know this is dangerous, and I'm not telling you to go and actually run this on your local machine just to bypass things, what I'm trying to show you is that it's extremely simple. It doesn't have to be a ridiculously convoluted way of autoscaling in your system. You have a job to do, your job is to bring X amount of instances up If your system doesn't support that, you can implement that very easily, using and leveraging existent tools. The only thing that you would have to do if you used our code specifically, is you would have to leverage different data models. Ours are based specifically towards Triton, so we use package IDs and image IDs. If you wanted to write your own system against Amazon, if you wanted to do it as a side project, then you would have to change it to instance type and image IDs. You can do this, and you can do this in a very easy manner.
There's one more thing that I just wanted to show [27:10]. Everything, as I said, is running on my local machine. We have Nomad running inside Vagrant, and as I SSH a few times, this is actually a Nomad client running on my local machine, and this is a Nomad server running on my local machine. We can actually say 'Nomad agent info', and we can actually see that everything is running locally using the 19216 address.
We didn't want people to actually be able to run this so easily. The reason I'm actually able to run this right now is because I set an environment variable. I did export TSG dev, mode equals one. It skips all authentication, not against the public cloud, just against my local machine, and this actually allows me to do this. As we were building this, we were actually using this and testing this out in our local machine, and against the cloud. We actually found that we needed, in some level, to be able to store some account information, so what I mean by that is in the same manner in Amazon as you have organizations, organizations can have users, but multiple users can spin up instances within that organization. We needed, as a second level, to introduce that we had some level of account details.
We don't store anything other than what organization you work for, so in our case, it was Samsung or Joyent was the top level org, and then the user would be stack 72 or it would be Kristoff, or it would be Justin, and that was the only information we stored. We didn't want to have to integrate PKI information in any way, we didn't want to have to store SSH keys as I said earlier, because starting to do that means that you have to them start to rotate. What this is doing under the hood, is this will go off to your Joyent account and it will create a specific SSH key required to interact with Joyent, because that's how Joyent interaction is done, via the CLI, so it will do it via SSH-based authentication, and it will spin it up that way.
We wanted to make it very clear what account that this was currently running under, and we also wanted to make it that if you deleted the SSH key, that was it. You wanted to stop the access to your account, so that we wanted that in place.
Anybody have any questions? I have just less than five minutes. None? Everybody just want coffee? Thank you so much for your time. Enjoy the rest of the conference. If you have any questions, please get in contact.