How Instruqt is Powering Arcade-Themed Learning Machines with Terraform
Dec 19, 2018
See how Instruqt, a cloud + DevOps learning platform, is already using Terraform 0.12 for their sandboxed learning environments.
Instruqt is an online learning platform for DevOps and Cloud technologies. They use HashiCorp Terraform heavily for provisioning their sandboxed learning environments on any cloud that the user needs to learn on.
In this talk, the founder of Instruqt, Erik Veld, will illustrate how his company uses Terraform with a high level of expertise in their platform. He'll also look at the newest release of Terraform (Terraform 0.12) and explain how it's already greatly simplifying their codebase.
See how Instruqt, a cloud + DevOps learning platform, is already using Terraform 0.12 for their sandboxed learning environments.
Developer Advocate, HashiCorp
Erik Veld: My name’s Erik Veld, and I work for a company called Instruqt, and I’ll do a talk on [Terraform](https://www.hashicorp.com/products/terraform "HashiCorp Terraform”) with this machine here. Hopefully everything goes well. It’s one big demo, so let’s get started.
Let me choose my presentation. People were saying on Twitter that I should do “Hard” but I need the slides, so I’ll do “Easy.” If you wanna take a look at the rules, you can check out the machine downstairs, and you can actually play Terraform and Vault challenges and learn something along the way.
I work for Instruqt, which is a learning company. We have an online learning platform for DevOps and cloud tooling. You can get hands-on experience with it, and we spin up real environments for users to play in, so every user gets a sandbox where they can play with the tools and then in that environment, we give the user challenges to solve. So you get a sort of puzzle, you have to do something, and then we validate that you’ve done that.
This arcade runs on that platform. It’s just an alternative frontend, which looks a little bit more gamey and arcadey, but it’s the actual same thing. The idea for Instruqt started at HashiConf back in Europe in 2016, so we’re coming full circle now, with the machine being here.
» Playing a Terraform challenge on an arcade machine
To better explain what Instruqt is, let me just show you by going through a challenge. Here we have an assignment. We need to create a plan based on Terraform config in our home folder, and we need to write the plan out to file. So let’s get started.
We have some Terraform files, and we have “database,” so there’s no “state” yet, so let’s do a Terraform init. And if we do a Terraform plan.out, if I hit Enter now I would have the correct answer. But, just to show you that we actually validate things, let me try it like this. OK. It checked into my environment, tried to see if there is a plan file. There wasn’t. So let’s try it again.
We use Terraform everywhere in our company and platform. We built the entire platform with Terraform, and everything is as code, so we do infrastructure as code. And the reason we do that is because it’s easy to automate, it’s declarative, so if you write the plan files and you apply them, you know what’s going to happen before you actually apply it. There are no runtime dependencies there. And it creates reproducible environments, so every time you apply a plan, it has exactly the same result.
And then if you have the files as code, it means that you can put them in version control. Then, if you make a mistake, you can always go back to an earlier version and apply that, and everything’s fine again. And then if you write your Terraform code as modules, you can share them with other people, or you could use modules that other people wrote from the module registry.
But sometimes features are not yet supported by Terraform providers, so what do you do?
We like to use things when they’re alpha and beta, because that’s the kind of things you shouldn’t run in production. So how do we go around that? Well, for instance, we use local provisioners. When something is in GCloud alpha or beta, if we can’t use it in Terraform yet, then we can wrap those commands in a local provisioner. Then at least we have it as code in our source files. And then as soon as it becomes available in Terraform, we can replace that with the actual resources.
Or you can create your own provider. For instance, we created a pooled AWS account provider, where we can create accounts beforehand and then grab one, and give that to a user, so they don’t have to wait for that.
But another good option is to supplement the providers that are coming officially with Terraform with the community ones. We use a lot of those, and we’re now able to do 1 apply and have our full environment up and running. An example of custom providers that we use, heavily, is the communities provider by sl1pm4t, which I believe is sitting down there, so thank you. We have used your provider heavily. And the G Suite provider by DeviaVir, for instance, to create Google Cloud accounts for users.
To give you an idea of how to use these community providers, let me just show you by doing the challenge. So, we have some code here. If we do a Terraform init right now, we’ll get an error message because we don’t have the G Suite provider, so what can we do to get that? I have downloaded it here already.
We first have to create the directory that it said we should, “plugins.” And we’re on Linux, so we need to create that and then—I have it already on disk—we want to copy it to the directory we just created. So if we then do a Terraform init, everything is fine.
It’s as simple as that to use the community providers. Just download them for the OS you are on, copy them into the directory, and then you can use them. If we then do a Terraform plan, you’ll see that we can now create a G Suite user by using the community provider, and we didn’t have to do anything difficult. So that should be the correct answer.
» An environment defined by users
We use Terraform a lot more than just to create our platform. We actually abuse it inside our platform as well, because we use it to create those sandbox environments and to manage it.
We have an environment that is defined by users, because everybody can create content. And then we apply that configuration at runtime to give the user the environment. So, as I said, anybody can create content. And then we do the actual plumbing underneath it, so you only have to define what you want in the environment and then we’ll make sure that the platform can do what it wants.
So basically it looks like this. We create the platform with Terraform, we create user environments in that with Terraform, and then we just completed a challenge with Terraform, on the platform. So it’s a little bit Terraform inception. We really love Terraform.
To give you an idea of how Terraform does this, let me give you a look behind the scenes by showing the logs of the arcade machine that is standing in the Diplomat Room. So let’s use our CLI, and we want the one for Terraform Arcade, and let’s grab it. People have been playing, which is nice. You see a lot of Terraform codes scrolling by. There are a lot of logs in the last 2 hours. Let me scroll up and show you what gets applied.
So if we go to the top of this
apply,” we have Terraform initializing the state, then we create the user environments with the Terraform code that was created by the users, and we spin up, for instance, for the Terraform. It runs on Kubernetes. We create a namespace for the user, we add keys etc. There are “services,” there are “pods.” Basically, you define the container, and then we generate the code to actually do this. Let me show you how we do that.
Since we allow anyone to create content and they can put anything in there, basically it’s a black box to us. We need to provide them the tools to be expressive in creating these environments. So we created a Web SDK for people that are not so comfortable with the command line and the CLI that you just saw me use to tail the logs. And we wanted to make creating these environments more accessible, so we created a simple template that the user would have to fill out, and then with that we would generate Terraform code.
So you basically specify, “I want the container ‘this name, this image, etc.’ or ‘this VM’ or ‘this Google Cloud project’ or ‘this AWS project,’” and you would get it. And you can just specify any Docker image or any VM image, and we would just handle that, because it’s very easy to do with Terraform.
Let me show you what this template looks like. We need to create a track called “Terraforming Postgres,” 2 containers, “Postgres” and “Terraform,” and then we need to expose the ports. Should be simple. So, nothing here. So, let’s use the CLI, and we want to create a track, it should have a title, “Terraforming Postgres.” There we go. So it created a skeleton for us to use, so we don’t have to do everything from scratch, and in there it created 2 files.
So it has the track file, which is basically all the things you visually see in the UI, like the assignments that I’ve been showing you here, the nodes, the tabs, the descriptions, etc. And more important for this challenge is we have the config YAML, which is just really simple YAML file. It has an entry for containers, there can also be an entry for VMs or projects, and you can just specify the name, the image, shell, memory, ports, etc.
Let’s grab this, and we need to grab Terraform, so let’s call this one “HashiConf Terraform,” and unfortunately it does not have bash. So then we have Postgres 9.6, because we want to do things with that later. Let’s choose a small version, Alpine. That actually does have bash; that’s nice. And we need to specify some ports. There we go. “Default Postgres port.” Should be enough. That’s how easy it is to specify this environment.
» Using Golang templates to generate Terraform code
But that’s not Terraform code, so how do we get there? We use Golang templates to generate Terraform code from that simple config, and it allows us to do things like looping, if statements etc. And then the resources are defined as Terraform modules, and then we generate a main .tf file that implements those modules and fills it with the details that it got from the config YAML file.
For us, Golang was the obvious go-to choice, because we do everything in Golang. All the backend code, everything is coded in Golang. But you could use the templates for anything. And in Golang, it’s really simple. You basically have a data structure. In this case, you have a name for an event, there’s a little string that defines the template, which then displays the name in there, and then we create a template from that, like parsing the string, and we execute the template with the data that we defined above, and then eventually it will print out “HashiConf 2018 is awesome.”
And in other languages, it’s usually just as simple.
To show you what we have as templates and how the config YAML fits in, let me just show you how the code is generated. I have the templates on disk here. Let’s take a look at the containers module since we just used that, and let’s take a look at a services template. As you can see here, it’s basically a normal Terraform file. It just has some things that will be replaced by Golang, which is the double curly braces. For instance, if you look at the bottom, there are the ports that we loop over. So it grabs the container that we specified in the config YAML, loops over the ports, and then for each of them creates a “ports” entry in the Terraform code.
Here we have our “Terraforming Postgres,” which is still the same config YAML. So if we would run the generator on this, it should populate that template with this code. Then we can take a look at what it generates. So it created 3 directories: a core thing that does some of the plumbing, and then 1 for each of the containers.
If we look here and take a look at the services that were generated from that template, it basically looks exactly the same as we had before, just all the things that we want to loop over, etc., or we want to have complex conditional statements, we can put those in the templates and then generate this code. That should solve this one.
If you don’t believe that there are checks behind all of these, if you want to see me do it on “Hard” later, I’ll do that in the hallway.
» Separating states
So then we have this code, and it gets applied at runtime. So what do we do with the state? Because we want to separate that state for all the participants, we just don’t want that in 1 big bundle, because that would be horrible. So we separate the state per track per user—we call that a participant—and they all get their own little state file in Google Cloud, because we run on Google Cloud and we use their storage buckets.
And we do this by configuring Google Cloud’s storage backends every time a user does a start. So we basically initialize the backend with directory in Google Cloud’s storage then we create the storage file in there. Which now sounds really stupid, because there’s the Terraform workspace command, but, when we started doing this, it didn’t exist yet. And for a long time, GCS was not supported as the storage backend. I believe it was only S3 and Atlas at the time. So we had to tool around that.
And since Terraform manages the state for us, for the environment, it creates it and then if we want to get rid of all of the environments, all we have to do is a Terraform Destroy, and it takes care of all of that. So we only have to keep track of when was the environment created. Every track has a TTL based on how long all the challenges take, and then afterwards it automatically gets cleaned up.
Now I’ll show you how easy it is to do now. We used to code around this, create buckets, create directories. But now all you have to do is “Terraform workspace new track-user.” Done. No more maintaining your state yourself, separating it. Terraform handles all of that for you.
This works great mostly, but there are some things you have to keep in mind when you do a lot of parallel Terraform runs, because even though you have defined a remote state, for instance, in Google Cloud there is still state that is stored locally, like which workspace you’re in, which backend is defined, etc.
So if you do a lot of parallel and they end up with the same backends, that’s horrible. So we create separate local directories for each of the runs, so they get put into a temp directory, so we make sure all the environments are correctly separated and there’s not some weird state, where people lose a VM or gets somebody else’s.
So let me write a simple bash script to show you what we do in Golang code but then in bash. So we have a Terraform file here that defines the backend. And let’s write some bash. So, in bash, let’s grab the participant as a parameter. We need to create a directory, so let’s just use the variable we just took. We need to grab the current directory, because that’s where our Terraform code is. And then we can start doing the operations, right? So first we need the directory to be created where we want the state files and workspace to live. Then let’s move to that directory and we can start doing our Terraform calls.
Let’s create a workspace for that participant, and then we can do a Terraform init in the directory where we were before, and then we can go back. So we’re nice and clean, back in the same state where we were. So if we run this, and we pass in a participant—track and user combination—you’ll see that it creates the workspace, it creates the state file, and on disk, it stores it in the .terraform file, and you have the environment, which specifies the workspace. And then in here we also have the state file, which then correctly specifies that backend.
So what it’ll end up doing is create a separate state file, named after the environment, inside that bucket. So you just end up with a bucket full of .tf state files. That should solve this one.
This system works really well for creating all the user environments. It’s all automated, the state is handled with Terraform, it’s relatively quick depending on the resources you try to spin up. That’s all really great.
» Pooling instances to speed things up for users
Only we don’t make the environments; users do. And not all the environments are made equal, and the tools in them aren’t. So tools use a lot of resources and take a really long time to start up, and our users don’t want to wait for that. So we had to come up with a way to go around that. We had to either speed up the black-box application that the users spin up, which is almost impossible, or we had to have the instance already running and give them that, so you would lose the startup time. And we wanted find a way to do it with Terraform.
So we first looked into writing a Terraform provider to do that, but it was a little bit iffy writing a Terraform provider that then basically runs on other Terraform providers to do things. The way we solved it is to create pooled instances of the heavy application servers, and then at runtime, when a user starts a track, we would import that into the state file, if it was available, and then use that in the state.
So basically, we check if there’s an instance available, and if there is, we grab it. Otherwise we first create one and then grab it, and we claim it so no other users can end up with the same VM. Then we import it and we do the apply, and then we fill up the pool again so the next person could get a running VM again.
Let me show you, manually, how this would work. I have 2 directories here, and the track we created before. Let’s go into the pooled one. We have a compute instance, a VM that we want to create, and it has a label that says “available is true.” So, that’s a machine that, when we create it, is available for users to claim. So, let’s do Terraform init, and “auto approve.” This instance will then be created. It has the label that it’s available, so then when users start a track, we can try and search in Google Cloud for the instances that have a label with “available = true.”
So what I’ll do after is I’ll check with G Cloud commands to see, “Can we find this instance running?” With the label we said. And then we’ll go to a participant directory, where we’ll simulate the users starting a track, and then we will import this instance that we just spun up and put into the pool and add that to the user.
So if we do, “gcloud” in the project where we are, and we do “compute instances list,” and we want to filter on the labels “available = true,” because we don’t want instances that were already taken by other users. So OK, we found one. It’s the one we just created, “wanted.” So let’s go into the participant directory. This is a user that just clicked “start” on their track, so first we have to create state, and there’s nothing in there yet. So we then want to grab the instance that we just created with Terraform.
So we want the Google compute instance called VM—that was the name of the resource in the file—and then we need to specify where it lives. So it’s the “project,” it’s in “Europe” and it was called “wanted.” So now that VM gets imported into this state, and if we then do a Terraform plan, we will see that it will try to change the label from “true” to “false,” so it’s no longer available.
So let’s apply this. And this takes a lot less time than when we have to create a VM from scratch and then wait for the application to boot. Instead of waiting 10 minutes for an application to start, we can take it and it takes seconds. So that should be correct.
» How we do checks
So we have the user environment, all the things are running in there. For instance, we have the track we created with the Postgres and the Terraform. But how do we do things like check? Challenges consist of metadata like the track YAML file I showed you before and lifecycle hooks.
When the user starts a challenge, we run a setup script. And it prepares the environment for that user so they can complete the challenge. And then when they hit the check button, we execute the check script depending on where you define that you go. And we validate that it’s correct. If it’s incorrect, we give back the message to the user. And then eventually if you have the correct answer, we’ll run the cleanup so you can tidy up after yourself.
Let’s add a challenge to that track we created before and see how that works. We have 2 shell scripts, which we can use already. And they will be executed in the Terraform container that we specified before. So you have to set up Terraform, which runs setup in the Terraform container, and then the check Terraform, which runs it there. So let’s take a look at the setup script. In this case, just a bash script. It creates a file on disk for the provider and it creates the database file.
So it’s basically the challenge that I did at the start. This is how the setup of that was done to put the files in my environment at that time. And then when you hit the check button, we execute this. So it’s nsh in this case because the Terraform container doesn’t have bash. And we first check, “Is there a .tf state file?” which means that you actually did an apply. If you didn’t do an apply, then we can immediately give you feedback on that. And then if you didn’t apply, we check the Terraform state to see if the database called “users” is actually in that state.
Let’s create a challenge. Let’s just call it “apply.” It created a skeleton for us to start working with. And we still have the directories from generator, but let’s go into the apply directory where we have the scripts. Well these are the sample scripts, we don’t need those because we just got the other ones we wanted. So let’s remove these and put in the ones that we want to use. Let’s specify where we want to move them.
OK, let’s see what we can do. If I take a look at the track YAML file, you’ll see that it populated the challenges area there with this challenge, with the name that I supplied, and with tabs and notes, etc. So let’s see if I did a good job.
It found some problems still, because in the templated code, it also created like a scaffold for tabs. But it still refers to the shell that was in the sample scripts that we had. So let’s fix that. So here on the bottom we have the shell, and it tries to go to hostname shell, but we don’t have that. We have Terraform. OK, simple. Now everything’s good. We can push it, and it did get applied. The other one is added to the arcade, but that wasn’t working anymore. So I have to skip that one.
If we now do it with Terraform validate, that’s fine. So now the track got pushed to the platform, and it’s live for users to play immediately. They can now hit the start button and do it.
» What will change with Terraform 0.12 this year
So this is what we do right now, but they just released Terraform 0.12. And that changes a lot for us, because we can skip a lot of the stuff that we do now.
If we look at how we generated the code, we have loops, we have conditionals, etc. But Terraform 0.12 supports all of those. So that basically means that all of the generation steps that we do now, we can get rid of those, which is awesome. For instance, if we define our track configuration as complex variables, you can have a list of containers, with name and image, a list of ports, the number. Everything is nice and tight.
This doesn’t work yet until later this year, so don’t try to do this yet. If you loop over the resources with Terraform instead of Golang, we can just loop over the modules. So we loop over the containers and then it’ll grab the module that we define. And then for each of the containers in that loop, it’ll create a container in the save file.
And we could then get rid of the main .tf file because everything is basically generic, and we could just create a .tf virus file that inputs that data structure we want. That’s the only thing we then have to generate.
So let’s download and install Terraform 0.12. Unfortunately it’s not out until later this year, but they did release the alpha. So that’s always a good idea. So we have Terraform 0.12 code here. And right now we’re running Terraform 0.11.8. So that’s why the check failed because it’s checking what version it outputs. But I have Terraform here on disk.
OK, we’re running the alpha. Everything is great. Then we need to remove the providers. because you have to rebuild the providers for Terraform 0.12. And we have to build our own Kubernetes one to make the demo work. So in this directory we have Terraform 0.12 code. And as you can see here at the top, we have a nicely specified complex variable. In the bottom, we still have all the old stuff like how you used to do it, separate variables, etc. Now we have like a nice complex one, really simple.
Let’s take a look, for instance, at the services the way we had it before. Instead of having the loop in our template, we can now just have a dynamic code block for port. And for each of the ports specified in the container, we can loop over those and create a port like this. No more Golang needed. So if we do a validate now, OK, it’s actually valid. So Terraform 0.12 works. It’s not just in the slides. That’s cool.
So let’s see what happens if we do a plan. And as you can see, in the template or in the main .tf file, we have multiple ports specified. We had port 22 and 80 specified in there. So when we did the plan, it generated multiple ports just by looping over them, as you can see here in the bottom. So that basically means that we can get rid of all the Golang templating just by using Terraform 0.12.
We’ve been using Terraform for a while. And the main takeaway that we’ve had is defining infrastructure as code is the best way possible. It makes it repeatable. You can automate it. You can share it with other people, and it’s version-controlled. In case you mess up, you can always go back to previous versions.
If the official providers don’t support a resource that you want to use, or that you can’t find, then you can always use a community provider or even wrap CLIs that you have with a local provisioner; that works as well. Worst case, you can always code your own provider. It’s not that hard. It’s basically wrapping API calls.
And keep local state in mind if you do a lot of parallel runs, like if you’re going to automate all your Terraform runs. And a lot of users are going to create environments at the same time. Make sure that the local state is also split.
And finally, Terraform 0.12 makes all our lives easier. And it makes defining infrastructure a lot more expressive and fun. So start using these new features that it supports.
So that’s my talk. If you have any questions, you can come talk to me in the hallways. I’ll be around. And if you want to see me do it on hard, I’ll do that. Thank you.