Breaking Down Barriers by Improving the UX of Terraform: An Under Armour Story
Sep 19, 2017
Kyle Rockman discusses how Under Armour allows non-technical people to use HashiCorp Terraform to spin up new services. Here’s how...
For an introduction to Terraform for non-technical readers, start with Terraform for the Non-Technical by HashiCorp solutions engineer Sean Carolan.
What if you wanted anyone in your organization to be able to use Terraform? Not only the Operations team, or your developers, but even the less technical people.
When Under Armour achieved this, it removed a lot of the friction that might slow engineers from serving customers. Hear from Kyle Rockman how they did it, while still maintaining best practices, such as security.
To make this possilbe, Under Armour built a templating system, based on Terraform, delivered as a service. By employing a React/Redux UI, REST API, Docker containers, Consul, Vault, S3, and some Python glue, it makes Terraform “something that even a project manager could use,” Rockman says.
Don’t miss the demo, at 21:50.
Tools & Cloud engineer, Under Armour
I wanna start this presentation off with a poll. Raise your hand if you use Terraform at your organization. Okay, keep your hand raised if you let any developer at your organization use Terraform. Now keep your hand raised if you would let anyone at your organization use Terraform. Today, I'm gonna show you how we made Terraform easier to use for everyone, while still ensuring best practices are in place at our organization.
Hi, I'm Kyle Rockman, and I'm on the Under Armour Connected Fitness team. I develop internal Platform as a Service systems for our engineers, to make their jobs easier and more enjoyable every day. I do have a caveat, I'm a human being, so I'm working on limited information just like all of us, so I might be wrong about some stuff. All I ask, is that you think about what I'm going to talk about, and see how that might help you at your organization.
Today, I'm going to show you how we crafted a solution at Under Armour Connected Fitness, that allows anyone at our organization to manage their infrastructure, while still adhering to the best practices set forth by the infrastructure and security teams at Under Armour. The major principle underwriting everything that we do at Under Armour is this. To make all athletes better through passion, design, and the relentless pursuit of innovation. Yes, I work for Under Armour, but I work for a very particular part of Under Armour called Connected Fitness.
» The challenge: Empowering engineers to use infrastructure
We build and maintain some apps that you may or may not be familiar with. Those apps are Map my Fitness, MyFitnessPal, Endomondo, UA Record, and UA Shop. They are these mobile and web applications, that allow you to do different things, like log workouts, track your sleep, track nutrition, shop for clothes, etc. About three years ago, Under Armour decided that it needed to move into the digital space, so it went into acquisition mode, and it purchased a startup here in Austin called Map My Fitness. About a year later, it then purchased a startup in San Francisco called MyFitnessPal, and then a company in Copenhagen called Endomondo, and smashed them all together and called them Under Armour Connected Fitness.
Over the last three to four years, the infrastructure teams have been coming together, and trying to reconcile all these different technology stacks that we use, and provide a single tooling platform for everyone to build the apps of the future on. But, what does that mean for me as an infrastructure engineer? What are my goals? Well, my goals can be summed up in this sentence. To empower UA engineers to frictionlessly deliver excellent software experiences directly to our consumers. Some of the key points of this philosophy are, make the powerful simple, make teammates more effective, things will break, never let them break the same way twice, iteratively create the platform in the open, so that all developers can contribute to it, and data is sacred. We have a term at Under Armour, protect this house.
When we are breaking down a problem at Under Armour, often we first look to the principles that we want to achieve, for the problem that we are trying to solve. This helps drive our direction for research and experimentation. The problem the infrastructure team is trying to solve, stems from the fact that we are a small team, trying to support an ever expanding engineering organization with every expanding needs. That's a pretty big problem space. But today, we're gonna focus specifically on, the areas of provisioning infrastructure in a cloud platform.
The first principle that we're going to achieve is self service Infrastructure as Code. Most of us should understand what Infrastructure as Code now at this point, but we wanted to take it a step further and make it self service. Such that, anyone in the engineering organization doesn't need an infrastructure engineer to help them spin something up. That means that we're gonna have to offload a lot of knowledge burden to everyone else in the organization, and someone people, they're not gonna swallow that well, they have lots of other priorities. We need to also reduce the learning curve, or the barrier to entry, for everyone in the organization. Often, you do this by putting in place, guidelines, or guardrails and tools, processes, stuff like that.
What happens though, is these things often slow down the engineering organization, or even slow down power users. So, we need to make sure that what we build, stays out of the way of power users, in a system that has these guard rails. The last thing that we want to achieve, is to make the right way to do something, the easiest thing to do. This is essentially best practices enforcement through adoption.
» Choosing one stack
What are some of the solutions that we could use to fix this problem? Mainly the self service infrastructure one, for the entire engineering organization. Under Armour, when it purchased all these companies, they were in different clouds. Somewhere in Rackspace, somewhere in AWS classic, some had custom private solutions, so we all decided that we wanted to standardize on one cloud solution, and we chose AWS. We could just use the AWS API, that's difficult, right? There's a lot to it, there's a lot you'd have to build around it, but thankfully there's API's for, libraries for the API. Also, we were using the Salt stack for most of our configuration management of the stuff that was in Rackspace, so we thought, "Hey, why can't we just blend together Salt stack with the Boto3, Python, AWS API library, and go to town.
That was actually the first iteration of this system, and it was definitely hard to get people to adopt it, you had to know a lot about everything that was going on, it wasn't really easy to work with. So, we kept looking. Around this time, Amazon came out with a thing called AWS CloudFormation, oh, pretty cool, right? Configuration as Code, they're handling it for you, sounds great. Maybe it was just the fact that we tried it so early on, but we ran into a lot of problems. We had issues where we couldn't really introspect what was going on, we ran into edge case situations, where we had to contact AWS support, and only they could fix it. That leaves us in a bad position if something really goes wrong, so we kept looking.
We knew that internally, we were using a HashiCorp tool called Packer to do our Vagrant images for our local development environment, so why not just extend that? Why not build AMI's. That brings you into this whole new worlds of immutable infrastructure, right? It's great, but it slows you down, right? The process to get a change in requires a build, requires a new instance to spin up, and our engineers wanted to move faster than that. Around that time, HashiCorp came out with a new tool called Terraform, looks pretty good, right? Terraform does have a few sticking points, when you wanna scale out the usage to an entire engineering organization, not just a few skilled Terraform engineers. The first one being configuration and state file management. It's a lot of files to deal with, do you put them in a Git repository, do you put them in multiple Git repositories, what about public, stuff like that.
Then, there's the provider ecosystem and the massive API plane, that you can use to create these common architecture patterns. Well, how do we deliver that, and make it such that it's easy for anyone to use these common architecture patterns that we wanna create at Under Armour. Well, HashiCorp came out with an answer for that, it was called Modules. It makes inputs and outputs, and you have standardized configuration, it can be a little difficult to tweak, there's not logic on top of it, so you're kind of limited to what you can do inside of it. It became a little difficult, but it definitely helps with common architecture pattern problem.
» The distributed engineering problem
Another problem that we have at Under Armour, is that we're globally distributed engineering organization, so we work in multiple time zones. When you're trying to share work in progress changes, or EdgeCase situations where you're having a problem, that can become difficult. You either have to check the code into a branch, have the other person pull it down, run terraform locally, you can put it into a CI process, but that kinda makes it a black box. There's things you can do, they didn't feel good.
» Enforcement of standard usage
The last thing, is that there's actually no enforcement of the standardized usage of it. You could be using a different version of Terraform than I am, based on when we downloaded it to our local machine if you're not paying attention. You could fat finger a command, or something like that. These are things that we wanna fix. Without further ado, I'd like to introduce you to the solution with these principles to these problems, that I believe fixes in this implementation. Introducing, Estate.
I have a co-worker who's a big linguistics nerd, and he was telling me how I aptly named the tool for Under Armour. He told me that Estate comes from the Latin word for status, or status of an owner, with respect to property, especially one of large extent with a big house on it. If you know anything about Under Armour, we kind of have a house thing going on. First I wanna cover some of the high level features of the tool, and then we'll dive deeper into some of the subcategories.
» Rolling our own code on top of Terraform as a service
It has automatic file management and a grouping scheme, a templating system, adds logic on top of what Terraform you wanna write. Terraform as a service, means that everyone is running the same version of the tool not on their local machine. Then, we can provide a REST API that allows anyone to build any integrations with the system, without the need of the infrastructure, they can use the API.
A UI means we get that single point of contact for everyone in the organization, and you can share URLs. If someone's having a problem with something, just pop a URL into Slack, click, I can see it, boom. Lastly, we want to get out of the way of power users, so we make it like a command line interface, you can run any arbitrary commands. All of this adds up to delivering business value, faster with less risk.
Let's dive a little deeper into the UI and REST API features. As I said before, we have a way to manage the configuration and state file, and you can view all this data straight from the UI. You can also view the plan and apply output, so that you can share that with someone. Because we have that deep linking capability, you can give them a link directly to the plan output if you're questioning what's going on, or you want help with something. You can give a link to the state file and see data about the state, if they have access to it. API's allow for those out of band integrations to be written. An Example of one is that, a team wanted to import their existing Terraform data into this tool that we built, and they were able to do it right through the REST API.
Terraform as a service means we're standardizing the usage, so no more fat fingering commands, automatic usage of the plan output during apply, and we enforce which version of Terraform you have to use, cause you're not running it on your local machine. Now let's cover some of these data management aspects of the tool. As I said before, we have this kind of loose grouping, that we call namespaces. It's basically the place where you do all your work from. You can group the Terraform configuration however you want, but it's just, this is the building block unit, a namespace defines all of these things. That allows us to then layer an off system on top of the tool, to allow access control to the namespaces, and because we have that off system, we have the concept of a user, so someone can grab a lock on the namespace, and now anyone else who comes into the tool and can see that namespace, can see that person is editing that namespace, I need to wait.
We implemented automatic state file management. Now, some of you more experiences Terraform users might think, "Why did you do this? Terraform has a solution for this. Terraform remote state, Terraform backends." Well, we actually ran into some issues with this, when the backends change came out, we were implementing it, and it had a very large namespace that we were testing this on that created a lot of state. We decided to put the state into consult, into the consult key values, and it actually errored out, because there was a limit to the size that it could put into a key value.
What this all does, is it scoops all this data, and stores it in a database. That's really powerful, because that means then, that you can use your robust database management tools to move this data around, replicate it, encrypt it, protect it, whatever you need to. It might be something that you're more comfortable with. We're really comfortable running Postgres at our organization, but if you're more comfortable with something else like, MYSQL and you run it well, you can use that too, and if you're in AWS, you can leverage, Amazon RDS.
Now I wanna move on to the templating system. This caused a lot of internal strife, because we're like, "Why not just use modules, why build something else on top?" To me, the biggest thing that the templating system gives you, is that you get logic. At the time, that was not really possible in the Terraform module system. So, we layered on the logic engine Jinja2, on top of your raw Terraform. If you've ever used the Jinja2 library, it's a really robust templating system. One of the things that we also wanted to do at Under Armour, is standardize on a DSL. Terraform comes with HCL, which not everyone in our organization may be comfortable learning. We know that HCL can transition down JSON, and we were using the Saltstack before, which most of the configuration is in YAML. YAML can go to JSON, and if Terraform can handle JSON, why not allow you to write all your templates in either HCL, YAML, or JSON, whatever you're comfortable with.
The local development story for this templating system, is based around having a way to specify a form, and then a set of Terraform that gets rendered based on the inputs of the form. We implemented JSON Schema, so that you can codify your HTML form. When you're writing a template, you write JSON Schema, you write your template body, with your Jinja2 logic on top, and then that gets put together and gets output as [00:16:30] JSON. Inside of the tool, it's all within the UI. You get this really cool Wysiwyg editor experience, where you can flip over to editing, making changes to your JSON Schema or to your Terraform, and then you can flip back to the experience that someone who's instantiating this template in a namespace would have. Where you can test what the form is doing to the rendered Terraform, and stuff like that. That makes it really fast and really iterative, and you don't have to constantly commit changes.
As part of making your changes when you save the template, you get prompted for an automatic semantic versioning bump. You can choose major, minor, patch, and it will automatically take the current version number that the template was on, and bump it by that increment. We automatically enforce version manage, semantic version, well, not semantic, but version management. When you're instantiating a template in a namespace, the templates are gonna change separate from your usage of it. That means you're gonna get outdated. There's a problem there, we need to be able to see when it's outdated, and the user may not know what changes happened in that template. So, when they update in their template instance in their namespace view, they can actually see the diff before they accept the changes into their namespace.
Then, all of this data is also stored in the DB, so anything that you've done to it previously, for encryption, or protection, or backup, it all applies to the templating system as well. To me, the system is starting to sound pretty great, but I often worry that too many guard rails may be limiting power users. Let's touch on some of the power user features. Like I said before, we treat it like a command line tool. You can run any arbitrary command right from the UI. There is an extension system, we call it Overrides, that you can do to the templates. If there's Terraform written in the template, and they only expose a few fields, and you wanna say, modify the max ASG size, you can do that with an override, without having to re-update the template or re-revision the template. If you have some special use case, you have a way to override that.
It's just a Django application, if you've ever worked in the Django stack, you know that it's highly extensible, huge plugged in architecture, you can build on top of it, and add any kind of things that you need to. Now I wanna show you how the architecture of how we run this in AWS. We have multiple instances, and different availability zones inside of auto scaling group, fronted by an elastic load balancer. These instances can talk to an RDS Postgres instance, and a Memcached, ElastiCache instance, then there's a route 53 entry, fronting the ELB, that every user goes to.
Now, I'm gonna flip over to demo the tool for you. Okay, so here it is. I'm gonna log in. We'll flip over to the namespaces, I have a couple namespaces here. A couple things on this is, you can make namespaces only accessible if you're in a certain group, so I can't actually see what's in the HashiConf 2, based on the user that I'm logged into. You can also lock namespaces, you can see which users have that lock. This user is working on this namespace. This is kinda the main view that you work from. This is a namespace, you put all your Terraform for whatever kind of general grouping that you see fit for your organization in here. You can do things like, grab the lock to say I'm working on it, you can view the Terraform files, and you can see it's just an editor with raw Terraform in it, you can do whatever you would do with regular Terraform files.
Here I have something that defines a service. Then we will save these changes. Right from the UI, I can run plan. I don't have to know what the command is, I don't have to know what the way our company uses it, I can just say, "Hey, plan this. Run Terraform for me with these files." What's happening underneath the hood, is a worker containers getting stood up, a command output is being streamed through the cache system to the UI, and then we also store the plan output in the cache system, so that it gets automatically used during apply. We can see here, that we get all the nice antsy coloring, that you would get in the command line as well, but we added an additional layer. The background changes color, so this is blue, it means everything is good. If something bombed in the plan, we'd get a big red background. If someone didn't know how to read the Terraform output, they could say, "There's something going wrong here, it's a red background."
This is gonna create a couple things in AWS. We'll run this, and this is doing the exact same thing. Spinning up a docker container in the background, laying down the plan output file, handing it to Terraform apply, and streaming the output to the UI. And it will make stuff. We can see, green, great, it must have worked. If we really wanna dig in further, we can read about the apply output and everything that it did. The other thing that happens during apply, is the state file gets written down. The system is smart enough to detect that, suck it up, and now we can view it from the UI as well, so we can see the state file. We have that AWS system layered on top of this, only people who are granted access to view this namespace can actually see the state files, so it's protected.
Remember I talked about getting out of the way of power users, we have this experiment feature, that acts as arbitrary commands. You can do literally anything from inside of it, but you're safe because you're inside the context of a isolated docker container. I can see which version of Terraform we're using, and that will do the exact same thing. Spin up the docker container, take what you wrote down, run it as a bash script, and give you the output. You can do a Terraform graph command, and get that right from here, pump that into Graphviz and see what it just created.
I didn't show this, but you can actually instantiate new things. You have the ability to instantiate a regular files, which are just like blank files essentially, and then the entire template list is listed here, so you can choose which templates to instantiate in your namespace. Before we do that, I wanna show you the template editor. Let's flip over to templates, and show you the template editor. It is a very similar UI, similar layout. You have your template body, you can choose which mode you wanna be in, this ones written in HCL. The only thing here is, that this one just creates a bunch of variables. Just like the other namespace where we wrote it all down, now we have it in this template, and it can be instantiated multiple times.
Then, to show you more of the form, I'm gonna show you a different template that actually has some JSON Schema. We leveraged JSON schema, this is gonna create a form that has two fields, a service name, and a instance type. This instance type is actually gonna be restricted to certain values, so that we don't have to worry about the user knowing which things to type in. Then, it has all the Terraform with Jinja2 on top of it, right in HCL to allow you to plug in these values however you want. Let's flip over to the preview. This is how it would look when you actually instantiate it in your namespace. Now I can play with the template, before I've even made any changes, and it's live rendering. If I type in something, you can see that on the rendered side, it's rendering as I type. So I can see errors happening in real time as I'm making changes. I can use the overrides to make that things are doing what they need to be doing, all before you even save anything.
Then, if you make some changes, we'll just change the description, add test to it, when you save you're prompted for your version increment. If we look before, we're on version 0.1.0, or 0.1.1. If we apply a minor increment, we should go up to 0.2.0 automatically. There we are.
Now let's look at a namespace that actually has these templates instantiated in it. Here, we have the two templates, we have the configuration template, the whole rendered output, I can see it, there's no form, just the description of what that template does. Then the other one, had the actual form here, just like the preview view, you can see it here, but now I have this big flashing thing up in the corner. Oh yeah, we made a change to that template. This is outdated. I can click on this, and now I get a diff, of the changes that happened to the template since my version, and I can view it and make sure that these are the things that I wanna accept into my namespace. We will go ahead and do that.
One of the other things that we wanted to implement on the system, is the ability to disable resources. Normally, what you have to do is delete that file, and if you're using a GIT repository, that means it's not there, so you have to use GIT-log, or look at the history somehow to figure out that that file existed, and what was inside of it, to figure it out. What we wanted to do was make it easy from the UI. I'm going back to the namespace where I had just provisioned some stuff, and I'm gonna go to the service file, and I'm gonna hit disable. What this actually does, is it tells the system that's gonna lay down the files, to not lay this file down. It makes it so that i can still see what's going on in the file, but I know that it's not being used, because it's been disable.
Now if I plan, it should say that we're gonna destroy all those resources we just created. Cool. I can still see what was there, without having to dig into GIT or whatever, and I can still use the namespace. I can continue, I can do blue green deployments or whatever, right from here. Apply that, make sure that I clean, be a good cloud citizen. While that's going, I'm gonna show you the documentation for the API. The API it's a fully swagger API, you can view the documentation, you can actually play with the API right from the UI, putting in the parameters that you would need to, all the different resources are documented out. Then if you've ever used Django application, there's also an administrator view, that you can see the data in the backend. I can go and play with the data, if I'm an administrator I can make it such, that this role is not actually applied. Now I can go into that namespace if I go view the site. Now, this HashiConf 2 doesn't have the owning group, so I can go in and see it if I'm an administrator, see what's going on in it, that kind of thing.
» The future of Estate
Back to the presentation. After we implemented this internally, we got some pretty good feedback. I just wanna read one of the quotes. "Estate has allowed us to speed up the time to deliver a new service by allowing anyone on the team to provision standardized architectures." You can have engineers who know more about Terraform and what's going on, create the templates and things like that, and then someone who's less skilled, or maybe even a product manager be like, "I need to spin up a new service, and our services are S3 Buckets with CloudFront." Just instantiate the template, type in the name of the new thing that you're spinning up, plan applied, done, you have your thing.
It's all rainbows and unicorns, right? Not actually. The tool works really well, but we still have our issues that we run into from time to time, or things that we wanna fix about it. Such as, the eternal question of, we have this Cambrian explosion of namespaces, which namespace houses this resource? We've talked about, if you have this all in a GIT repository, it would be much easier, right? You could just do a search of all Estate, or the configuration files. We're thinking about implementing a way to dump all the data from the tool into Elasticsearch, and then you could create more complex queries and visualizations on top of your Terraform usage.
Delete is still hard, right? You have to make sure that people understand that deleting the data in the database doesn't actually delete the cloud resources. Now Terraform came out with something, Terraform Destroy, and we could maybe hook up the tool to automatically run Terraform Destroy when you delete the namespace. Something I didn't mention before, but when you delete a namespace, it doesn't actually delete the data in the database, it's a soft delete, so you can still get it back, safety.
Provider credentials, they're still in the Terraform, unless you use some sort of other integration. We've kicked around the idea of starting to use Vault, and do a Vault integration so we can automatically provision STS tokens with a specific role for whatever you have, stuff like that. Then, because Terraform isn't 1.0, there's still possible backwards incompatibilities that happen. We could have issues, we had issues when they implemented the backend system. We upgraded to it, then we started playing around with it on some of our namespace, and then we ran into that file, or the size limit on the consult key value. That's actually what drove us to build the system of automatic state file management.
I do have a caveat, we are hoping to open source this tool in the coming months, I'm still working with Under Armour Corporate to make it happen. So, stay tuned to the GitHub Under Armour Organization, and you may see it there. Thank you.