Case Study

Simplify Your Terraform Codebase Using Terraform 0.12 and Terraform Enterprise

Take a look at how Instruqt is using Terraform 0.12 and Terraform Enterprise to provision arbitrary sandbox environments.

Instruqt is an online interactive IT learning platform. Their users can define and create sandbox environments based on real infrastructure and tools. In these sandboxes, they can guide their users through their product using challenges.

At Instruqt, co-founder Adé Mochtar and the rest of his engineers have been long-time Terraform users and fans. They use Terraform extensively, both to provision their core infrastructure and for provisioning the sandbox environments for their users.

In this talk, Adé will show how Instruqt has adopted Terraform and Terraform Enterprise to provision arbitrary sandbox environments. Hear about the challenges they ran into, like state management, parallel execution, and code generation; and then learn how they solved these challenges.

Thanks to Terraform 0.12 they no longer need to write Go templates to inject parameters into their resources, they can just use HCL. And with Terraform Enterprise, they've removed all of the complexity that comes with maintaining and customizing their remote state management.

Speakers

Transcript

I'm Adé Mochtar. I'm CTO of Instruqt and, as CTOs do, I focus mostly on the technical side. But also I have a keen interest in learning, and especially learning new technology. And that's really what Instruqt is about.

Instruqt is an online IT learning platform. What does that mean? We have the capability to spin up sandboxes with real infrastructure, and that can be anything from containers, virtual machines, Google Cloud projects, AWS accounts, anything that you want or that you need to run your software and teach it to your users.

In these sandboxes we are able to spin up challenges to provide to our users to make them think a little bit about your products and to let them experiment hands-on. While doing it, we validate that what you are doing is correct, and if you get stuck we give you a little bit of a hint to move forward.

A HashiStack sandbox

So how does it look? If you go to instruqt.com/hashicorp you'll land on a page like this where you'll see a couple of "tracks," as we call them, and these tracks on the screen are all about Consul, so you can learn a little bit about Consul Connect, for instance. You can even go to the hub where there are a bunch of our tracks used as demos. And if we click on one of these tracks you can drop into the sandbox environment where we show you running a Consul instance which we have pre-populated with some services and a certain scenario and a certain task we give you to solve.

It's nice to be back here at Westergasfabriek, because back in 2016, we did the first HashiConf EU here in Amsterdam. Xebia, the parent company of Instruqt, was co-organizing, and that's where we got the idea for something like Instruqt.

We created a HashiContest around the four major tools of HashiCorp and we said, "Hey, we need to challenge these users a little bit—give them the tools and let them play around with it." We called it the HashiContest, and one of the cool things was that we got a lot of feedback that it was really cool as a competition, but also as a really effective learning tool. That's where the seeds got planted for Instruqt.

How we built Instruqt?

We mainly build on these great technologies. We have Kubernetes, we run Terraform, of course, and we use a lot of Golang. This talk is mainly going to focus on Terraform.

We use Terraform basically everywhere. It not only manages our platform—of course we run on GCP, and if we spin up a lot of infrastructure, we use Terraform for that—but we also manage all the user sandboxes. Every time someone starts our tracks, we spin up a sandbox and we use Terraform for that. And we even have tracks to teach you about Terraform. It's Terraform all the way down.

Digging into the sandboxes

Let's focus a little bit on the user sandboxes, because that's where the interesting part is.

Terraform creates these sandboxes on the fly, and we insert some parameters where needed. For instance, for the Consul Connect track, we insert some configuration about which container to run, so we have a Consul container, we have some environment variables, some ports to expose, stuff like that.

To give you a little bit of an overview on how that looks, we create a Kubernetes namespace, and inside that namespace we can run several pods. We have some configurations, some secrets, and we can also spin up some other infrastructure that's linked to it. We can spin up VMs, we can spin up GCP projects or AWS accounts. And all that is exposed through our proxy to the user that's playing our tracks.

But how does Terraform know what to build? For that we use Terraform modules. We used to use modules to create an abstraction around the infrastructure that we spin up, and, as I said, we support several different infrastructure components, ranging from containers to virtual machines, to full GCP projects and AWS accounts.

To give you an example of what's required for a container, these are typically the parameters that you want:

  • Name

  • Image

  • Memory

  • List of ports

  • Map of environment variables

We want to give it a name, which is also used as the hostname; which image you want to use, so for example, it can be a Docker Hub image, it can be something from a private registry; how much memory does it need; a list of ports to expose; and also a map of environment variables that are available.

Injecting parameters, pre-0.12

However, up until Terraform 0.11 we relied on Golang templating to inject these parameters. What did that look like? We have a Terraform resource, in this case it's a replication controller. Back when we wrote this thing, there was no Kubernetes deployment resource yet, so we used the replication controller. And we had a bunch of Golang template syntax in there. We basically range over a list of environment variables and inject them as blocks in our Terraform configuration.

Because we can spin up multiple containers, we also had to template out the module. And since we generate the module, we also have to generate a source location, so this is a lot of Terraform generation.

Terraform 0.12 simplifies things

Luckily there's Terraform 0.12, which can simplify things a lot. Let me show you.

There are 3 features that really help us with this:

  • Complex types

  • Dynamically generated nested blocks

  • Improved conditional operators

The first one is the complex types. Before 0.12, you could use strings, lists, and maps, but they're arbitrarily limited. In 0.12 we get arbitrary complex types, and best of all they are also usable as module inputs and outputs.

What does that look like? We define our variable, in this case for the container, which is an object, and we basically specify all the parameters I just showed you. This is nice, clean syntax—good, readable, a lot better. Before, this used to be 5 different variables, much more difficult to read.

The second one is the ability to generate dynamic nested blocks based on the very tight definition that you just saw.

What does it look like? In this case I've upgraded my replication controller to a deployment, but I've also added some dynamic blocks based on the container environment. The good thing is we no longer need templating of our Terraform codes, which is a big help and also validation of Terraform codes, so we now can run Terraform validated. It will say sensible things rather than struggling with all the Golang templates, which is a nice benefit.

And lastly, the improvements on the conditional operators, especially the lazy evaluation, are very nice. Previously, the memory was a separate variable, which had a default value. With complex types, it's not possible to have a default value for one of the sub-values, but luckily there's a conditional operator which says, "If this parameter is null we can specify a default value," in this case 128MB.

Less templating needed

We succeeded in removing all the templating from our module. We still have a lot of templating here. The good thing is we no longer need to generate our modules. We can use a pre-defined module as a source, which is really cool, but we still have a lot of other parameters that need to be templated. But luckily there's more to come, eventually, I hope.

Resource and module for_each can simplify things even more. This is not currently part of 0.12, but it will be released soon, I hope. What we can then do is basically get rid of all our templating, switch to a simple for_each statement in our module definition, and basically move all parameters to either a .tfvars file or pass it in command line using JSON or something. And once we've done this, we remove all templating needs from our system.

I'm hoping for it later this summer, but you never know.

Simplified workflow with 0.12

Terraform 0.12, for me, is not just about the configuration language. It's also about simplifying workflow, and there are a couple of things I want to highlight because it really makes your life easier.

First of all, upgrading codebase is very easy, so big props to the team creating the tooling and the documentation, because it really helps you upgrading your configuration.

What I did is I ran Terraform 0.12 upgrade and it handles most changes. I only had a few things that I needed to update manually, but that was mainly because I didn't read the docs and didn't follow the steps. I should have upgraded by plugins by providers before I ran the upgrade. So don't be like me; just follow the steps and probably you won't need to change anything manually.

What's also very useful is the context-rich error messages, and I'm going to give you a couple of examples. The reason why this is very helpful is because you're most likely, when upgrading to 0.12, you're going to use the new complex types for your variables, meaning you will have a lot of variable references that are no longer valid, and Terraform can really help pinpointing those. That really helps in testing out your module and upgrading your source code.

Another challenge we used to have before 0.12 was reading Terraform plan output. These are a couple of examples. Say I have two resources that are being changed. One is a config map in Kubernetes which contains some JSON structure, and the other one is a list of services.

On screen is a fairly small example of a JSON structure, but I've seen very large structures, and you need to copy this out and do a manual compare, which is really a pain in the ass. And also if you have a lot of properties that are being updated, it's not instantly visible what's going to change, so luckily there's also Terraform 0.12 to the rescue.

Especially the json diff, as I'm going to show you, is really wonderful because this will help you and save you a lot of time. They structure the JSON and tell you which property of the JSON is going to change and how it's going to change, and this is much more readable and is going to save you a lot of time. And also for the list of services, it's much more clear which services are getting edited or which ones are going to remove. Yeah, big save on time.

Helpful collaboration features

Something else I want to talk about is not necessarily related to 0.12, but I think it's a very interesting topic because we run a fairly atypical use case for Terraform. When handling states, this is something that you can do, but I think there are better options.

Let's go back to our platform. We have the platform where we run our core infrastructure, the core services, Kubernetes configuration, or the DNS configuration, all stuff like that, and these changes are very infrequent. But once we change them, these changes can be very high impact. What we want here is good peer review, a good workflow to validate that the changes that I'm making are the right changes, and that the team member can verify that I'm doing the right things.

Currently we store our state in GCS buckets, which sort of suffices, but we don't really get anything for the collaboration features. That's where I think Terraform Cloud and Terraform Enterprise come in because they offer some really good collaboration features on running complex plans on your infrastructure, so you can be more sure about the changes that you're going to do, that they're not going to break anything, because that will make your users very sad.

Just a quick example. I took this slide, this picture, off the website. I don't have a running example for you, but the fact that you can have a discussion around Terraform plans much like you would in GitHub pull requests is very helpful. You can have a double-check whether you're not destroying something that you shouldn't, which is really helpful.

User sandboxes and state

This is where things get interesting state-wise. Because every sandbox that we create is basically a Terraform run. And because we create sandboxes for every user that plays one of our tracks, these runs can, and thus will, be executed concurrently, and that creates a whole new set of problems that you don't have if you just have a single plan that you apply.

Our first idea was, Let's store the state in a GCS bucket, in a subdirectory per sandbox. Easy, right? You just do the terraform_init with a backend configuration, you specify a prefix, and we insert one of our sandbox IDs and we're done. So we thought.

We ran this for a while and all looked well while developing, but once we rolled this out to production we got some interesting results. Some users were saying, "Hey, the sandbox that you gave me, it's exactly the same machine as my neighbor got." And we were like, "No, that's not possible. We spin up a sandbox just for you. It's unique. You should be the only one to access it." But still they say, “Look here, I'm touching a file, and the other guy can see it." So that was really weird.

What happened is that the configuration of which backend you use, or which workspace you use, is still local to your Terraform run. So what you need to do to prevent these shared sandboxes, is that every Terraform run must be initialized in a separate subdirectory. Otherwise you will just refer to the same state file. You will overwrite which backend configuration you're using, and you get partial applies where you can share sandboxes.

Our fix is simple. "We already have the unique sandbox ID, so let's make a temporary directory to store the Terraform state in and use that to initialize Terraform."

When remote state backends fail

Another interesting problem we ran into is, at high concurrency or high loads, sometimes these remote state backends fail, which is really interesting but also quite annoying, and we get these vague error messages like “TLS handshake timeout,” “I/O timeout,” and this is really annoying because we really rely on these Terraform runs to be successful to spin up these training environments. It's not like, "I did my apply for an infrastructure change and I can do it again if it fails." No, there are people waiting for the sandbox, and we preferably have it up and running in a couple of seconds.

When this fails this is really annoying. We started looking at some alternative backends, mainly 2: the http backend, a slightly lesser-known backend, and the remote backend, as part of Terraform Enterprise or Terraform Cloud.

First, let's look a little bit at the HTTP backend. The cool thing is that you can implement this yourself. The only thing that you need is a REST server that supports a “GET” and a “PUT” and a “POST” and a “DELETE,” and you can nicely integrate it with the rest of your infrastructure. The cool thing is, you have one less runtime dependency in your infrastructure. But it does require some custom work.

Then we have the remote backend. The cool thing about that is that you don't need to customize anything; it just works out of the box. You get a nice API you can talk to, so it's part of Terraform Enterprise. That really is also something that's interesting to use, because you don't need to do anything for it.

But we chose the route of the HTTP backend, and we've just started implementing this. documentation for this is very limited. It’s about limited to: "State will be fetched via ‘GET,’ updated via ‘POST,’ and purged with ‘DELETE.’" And that's about it. Luckily, with some reverse engineering with some other examples we found on GitHub, we have a working implementation. I'm not going to show it here; if you're interested come find me afterwards. I can walk you through what's necessary.

The cool thing is it's only about 100 lines of code, and that's because we already have a lot of the infrastructure in place. We have our backend system that we can plug this into. We already store a lot of information about the sandboxes. This is just another field in our database with a large JSON Blob. That's really cool.

Lessons learned

In my opinion, Terraform 0.12 is really powerful. It makes the language a lot more powerful, a lot more readable, a lot less strings to look at, but much more clean syntax, and especially the for-loops, the complex types make it a really powerful language. And even then, it still maintains the declarative side of things, so it's not a full programming language. I think it's the best of both worlds.

Terraform 0.12 also provides a lot better feedback on plans and on errors. Especially if you're going through an upgrade scenario, this feedback is very welcome. You're going to be changing a lot of things, and every second you can save by having a better error message is valuable.

And finally, I'm really a fan of Terraform Cloud's collaboration features. I can't wait to start using them in production for our platform. I got a chance to play around with it a little bit, but I'm really excited about starting to use these things.

Thank you.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

  • 12/13/2022
  • Case Study

Nomad and Vault in a Post-Kubernetes World