Creating a Terraform Provider for Just About Anything
Dec 19, 2018
Learn how to contribute to a Terraform provider or create your own from this walkthrough.
Terrafom is an amazing tool that lets you define your infrastructure as code. Under the hood it's an incredibly powerful state machine that makes API requests and marshals resources.
In this talk, Eddie Zaneski, manager of developer relations at DigitalOcean, will dive into the inner workings of Terraform and examine all the elements of a provider—from the documentation to the test suite. You'll walk away with the knowledge of how to contribute to an existing provider or create your own from scratch.
Zaneski will also take a look at some of the things he encountered while working on the DigitalOcean provider and the lessons learned from the community.
Manager of Developer Relations, DigitalOcean, Digital Ocean
How's it going, everyone? My name is Eddie Zaneski and I serve the developer community at a company called Digital Ocean. Quick show of hands here, who has used a Digital Ocean tutorial before? That's a lot of people. Despite what many people think, we are not a tutorial company, we are a cloud provider, So we've got all those things that you need to do cool stuff.
Today we're here to talk about Terraform. I'm also joined by a few awesome colleagues from DO, especially Tom right over there. Wave your hand, Tom. We have a huge Enterprise Vault deployment internally, so if you want to talk about roll-your-own Vault deployment, come see Tom at our booth over there.
» What is Terraform?
What is Terraform? Can I get a gauge of the audience, who here has used Terraform? Hands up. That's most of everyone. Who here has contributed to, or written a Terraform provider? We don't need to talk too much about what Terraform is, we do need to talk about how Elon Musk wants to terraform Mars, but all right.
I like to describe Terraform as a giant state machine. We can talk about "It's infrastructure as code," we can talk about a lot of things, but at end of the day, Terraform is a great state machine. There's that little
terraform.state file that you see, and the nice thing is that you can have any resource that is backed by some form of an API. It can be a JSON API, gRPC, XML API, any type of resource that is backed by an API can be turned into a Terraform provider.
What Terraform does at its heart is really just marshaling resources between a JSON payload and an internal Terraform struct called a resource, we'll talk about. As awesome as this tool is, its roots are simple and it's not that complicated to pick up. Terraform is a state machine, what is a Terraform provider? Like I said, any resource that is backed by an API can be turned into a Terraform provider.
» Examples of Terraform providers
There are something like 90 Terraform providers that are supported. Here's a quick example. You have a provider and then you have this attribute in there for a token and then a resource. You've all seen HCL before. You can manage resources, you can manage your servers. You can also manage things like Kubernetes. with a Terraform provider for Kubernetes. Instead of writing all of that giant Kubernetes YAML, you can now define your Kubernetes services in HCL, which is a lot better. There's not a ton of repeat, there's not a ton of nesting.
Has anyone done this yet? Has anyone used HCL with Kubernetes? Have you liked the experience? That's a thumbs-up. Not only can we do things infrastructure and service related, there's the GitHub provider that everyone's probably seen and the cool thing about this is that it's an API. We have a provider for it, so we can manage this resource via Terraform state. Can you imagine on your first day of work, you come in for engineering onboarding, and all you have to do is make a pull request to this single repo to get access to everything you want.
You can define your users in here, you can define your repos. Instead of having to hunt down that one person on your DevOps team who can create GitHub repos, you can just make a PR to your GitHub's repos repo, and it's just a giant Terraform file. Then you can add it, add all the users that are already in there as data sources and pull them in and grant access that way, create teams that way. You can start to see where we're going with this.
Then we can do something like the marvelous Seth Vargo has done, where he created a Terraform provider for Google Calendar. What if we took all of our Google Calendar events and represented them in some kind of Terraform state? He wrote an awesome blog post that y'all should check out. It worked really well.
What other things can we play with? What other types of providers can we make? Who here has heard of the Philips Hue light bulbs? I wrote a Terraform provider for the Philips Hue light bulbs. We use the attributes at the top to find the bridge that it's going to talk to and then we pull in a data resource. Who here has used data resources? Do I need to explain them?
With data resources, you have a resource that is being created and it's mutable, you're messing around with it. A data resource would be something that you don't want to modify but you want to import as read-only. There's an
import command, where you can import a resource but then when you run
terraform destroy, that resource gets destroyed. A data resource is just like a permanent read-only thing.
I pull in the
light_id of my kitchen and then I set the color to some kind of awesome fuchsia and boom. Now we're managing an actual resource in the real world with Terraform. What if we also did something like managing our to-do lists? I created another Terraform provider, this one's on GitHub. I used a to-do list client called Todoist, it's got an API behind it. What if we implement these things so we can see them in the real world?
I have a quick demo that I'll show you of this Terraform provider. We'll start with that
main.tf file and we'll pull in the provider and then you pass it an API key but it's configured to an environment variable so you can't hack me. Then we got a resource we'll call this
todoist_task and I'll call this
talk. Then we can have some content: "Give talk", boom. We save that:
terraform apply. It wants to create a resource, boom. If I pop over to my to-do list, we can see that it's created a resource. We can go in there, and we can say, "Give talk right now" and update it, and it will update the content. You get where this is going.
We can do things like mark it as completed. "Completed" equals "true," and we save that, and it's gonna disappear 'cause it was marked as completed. Let's bring it back, and then we can do something like, the data resource I was talking about, so we'll have
todoist_project, so I can categorize these as projects. I'll call this my
talks project, and then the name of it will be "talks." I'm going to show you how this is working, it's not just showing it off.
Then we have a
project_id. We'll do some interpolation, so
data.todoist_project.talks.id. Save that. It's mad about something, of course. Oh, it's 'cause it's already created. So
terraform destroy. The API's a little wonky, can't change your project once it's created. Created, boom. If we look down here, "Talks" popped up there.
We took a resource that already existed, which was our project, pulled it in as a data resource so we could reference that in another resource. That's the power of Terraform. Let's dive into how that works.
» Why would we want to create Terraform providers?
Why would we want to create these Terraform providers? Can I get a show of hands, who here has made a Terraform provider? That's a few of you. How many of you have made one for internal stuff? Just about the same amount of people, a few less.
Some of the things you might want to build a provider for: Maybe you're a cloud provider and you want to have a Terraform provider for your customers to use. Maybe your customers have asked for one, so they can manage your resources inside HCL and Terraform. You have a resource in an API that makes sense to manage with Terraform. Some nice things to do with this, you can skip a frontend entirely. If you're working with HCL in Terraform configuration files, you don't need someone to write that React frontend. You also don't need to create little duct tape apps that implement the API that you have. Everyone's had to do a little one-off app that they run and it does all the magic.
With Terraform you don't need to do that because Terraform is acting as your frontend API client. The next reason is internal. Building internal providers for Terraform is where I really see the power here. We have a Terraform provider called
terraform-provider-dointernal. This is a very shallow shim/fork of our public one and it adds a few additional fields to it. If Terraform providers are written in Go, they're simple Go plugins. They're binaries that get compiled and spin up a little RPC server that talks to the main Terraform process. It's really cool under the hood, so I have some resources at the end so you can read about them.
But the way this actually works is, we have an internal repo that we can pull in the public repo and with Go, you can just add a file with the same package name and compile it and everything magically gets compiled. You don't need to do some crazy overwriting, Go will just compile everything with the same package name.
We have a few additional fields that are represented, like hypervisor placement. This can influence where our VMs get placed physically through our server racks. You want to register into the Chef server, for internal management droplets. We can also expose some beta features to internal people because our teams in DO are using Terraform to manage some of our infrastructure.
I was talking with someone I met last night who works at Target and they have some big beefy F5 load balancers. If anyone's ever worked with F5 load balancers, you know that you define your rules via ASIC firmware. You have to go in, you have to write a bunch of ASIC firmware, compile that and you flash your load balancers so they have your rules.
These guys have built an API that handles all of that, so they can go in, make API requests, and that API will compile their firmware and flash it into their F5 load balancers. Now he's interested in building a Terraform provider, because now you can add your rules to your HCL and you don't have to have whatever frontend you were using before. You don't have to have your
curl request, you don't have to use Postman. Hopefully some of the wheels are starting to turn, and you're thinking of some internal APIs that you can build a provider for.
A couple people here have contributed to a provider. You might want to contribute so you can fix a bug, add a feature that's missing. Quick plug for Hacktoberfest: It's a thing we're running with GitHub and Twilio. If you've made 5 pull requests during the month of October, we'll send you a sweet T-shirt, so check it out.
You might also want to maintain a fork of a provider that exists, so we maintained a fork of the Cloudflare provider, because there were some enterprise features that weren't implemented yet. We have a legacy account, so we have some different older policies that aren't supported via the new API. If there is a provider that you need to make some changes on you can fork it and just manage it yourself.
» With Terraform, everything is convention first
The important thing to understand when it comes to working with Terraform is, everything is convention first. I learned Go through working on Terraform and lots of people have told me that was a very big mistake, because originally Terraform was written by some Ruby developers and they're definitely great at Go now. Fun fact, the original Terraform code was written inside of the Digital Ocean office by Mitchell [Hashimoto] because he was in New York for the summer and it was really hot and they wanted to come to Digital Ocean's air-conditioned office. Everything was done through convention. This is something that you have to accept. If you're a hardcore Go developer and you're going to see some of the functions and you're going to see some underscores in there with some snake casing you're going to be a little cringy. You just have to embrace that convention comes first.
» Where should you start?
Where you need to start is with a Go API client. For that Todoist provider I showed you I had to build a Go API client. You should look at it. It's definitely not well written because I don't know Go that well. But this is where you start, you need to have a strong API client, because the important thing is to separate your API's logic from your Terraform provider's logic. If there is any type of weirdness in your API, you want to address that in the API client. You don't want to have to put shims and duct tape in your provider. You want to abstract these things as much as possible. Your client needs to have superb error handling, 'cause we're going to talk about what to do when shit breaks. You should have some kind of logging mechanism so you can quickly identify when something is broken if it's related to Terraform or your API or your client. Start with a strong API client.
» Skeletons/guides available for getting started
There are skeletons available for starting with the provider. This guide down here is the most important resource to read so check that out for getting started. It's got a nice walkthrough where you build a Hello World provider, but the skeleton is simply a function in here that returns a Terraform resource provider. You return what we call a Schema in another provider we'll get to in a second. The skeleton is very basic and there's a main file that just registers your plug-in, so it's not worth getting into.
What you need to do to get started is read other providers. This is where you're gonna find the most things, 'cause outside of that guide there aren't that many resources out there. Look at the Digital Ocean provider, look at the Amazon provider, the Google provider is very well done. All of their top cloud providers are extremely well done so, look at those providers, dig into the different types of resources and then look through the Schemas.
» Schema is the most important thing to understand
Schema is the most important this to understand here. This comes after that line of the
Func declaration. You return a
Schema.Provider. Everything in here is done through resource maps. You have maps of a property name to a Schema declaration. In that Todoist provider, remember I said you had to clear the API key if it wasn't through an environment variable. This is what that looks like, you have this tiny little Schema and Terraform immediately knows how to take the variable and assign where it needs to go. It works magically under the hood. The important thing to note is that all the type enforcement for Terraform is done in the Schema, so while you're developing your plug-in and your Go functions and your CRUD functions, you need to trust... This is where the Go developers are gonna say, "Ugh!" You need to trust that the data coming in is what you declared it is in the Schema. You're hesitant at first, but the Schema level, the external, it won't slurp up that HDL file successfully if the Schema is wrong, so you have to trust that everything coming in is mapped to the Schema correctly. Or else you're gonna have a lot of garbage code.
The types that are available are Booleans, integers, floats, strings, lists, maps, and sets. Sets are where it gets really cool. You've seen in HDL you can have nested resources, so maybe that attendee on the Google Calendar invite. It had attendee, and then email and my email address. The attendee is a type set and below them you can have a whole other set of Schemas. This is really how you can be expressive with HCL and Terraform. There is something joked about called the Terraform Standard Library. This is everything that lives under
/terraform/tree/master/helper. There are a ton of packages in here that will make your life better. You're going to work with these nonstop. Some of the ones to call out are: everything in the Schema Helper, so under the Schema folder there's a lot in here, you've got to read through the code. It'll help, believe me it will very much help. The code is all very well documented.
Read the Schema, a couple notable classes or objects to read is the Provider, the Resource, and the ResourceData. ResourceData is what gets brought into your functions, we'll talk about in a second. There are validators that I didn't know existed until down the road that are like, "Is the integer between?" There are a lot of these helper functions that are already written so, if you're thinking, "Oh man, I really need a convenience method right here," there's a chance that it exists so, check the helper repo and the testing library which we'll talk about later. It's got a bunch of random helpers like, Random String and Random SSH Key, Insert and stuff. Read through the source.
The next call-out is the meta function. I'm going to show you what these CRUD functions look like inside Terraform. You're going to see something called meta which can be very confusing coming in and this is where the Go developers are gonna cringe. Because Terraform is so well done as an SDK in a library, it has to work with every type of API client out there that you're going to use. Use a lot of empty interfaces where you have to type cast and type assert things. The purpose of this configure function is to return what they call the meta. In this case, it's my Todoist REST API client. That gets assigned here, so this fires up when you first launch your provider. You're going to see meta a lot, just remember meta is your client that you define in your configure function.
The very basic bare-bones unit is the resource and you can see here it's just a map that you have in the top-level Schema and you map the resource name to a function that returns a resource Schema. You see this pattern a lot, the same thing goes for the data sources, your importers, they all have a data-sources map, an importers map. It's a common practice.
We've got a few slides with a bunch of code on them, so bear with me. This is for that resource Todoist task. This is the actual Schema that represents that Todoist task item. It returns the Schema and it's got that content thing in there that I showed you. It's got types, so the type that's declared and required, it needs to be there. You need to have content. All of these helpers, all of this functionality is built right into the SDK that you're working with. Same thing with the completed, it's an optional and it defaults to false if it's not there. Below that we have the bread and butter, the CRUD functions. So create, read, update, delete. All this is doing is martialing resources into Terraform state objects. We have a CRUD representation and Terraform knows how to manage your entire API.
This is what it looks like for that create, it takes in a resource data object. Call back to earlier, this is a very important one to read. This is just a giant hash map that has a bunch of functionality helper methods put on it. You see here we grab the meta out of this create function that gets passed in, cast it to a client and then we take that resource data and we're going to grab out the content. We're not doing type checking because it's done at the Schema level. We type assert that it's a string and then we can go on and create our task, check the error. The nice thing is, all these CRUD functions, they return an error. We'll talk in a little bit about what happens when errors are returned, but you don't need to do anything complex, you just return an error if something's wrong and Terraform knows how to process that error based on the life-cycle method.
The last callout is at the bottom here. That is the read function which we're going to look at next. What you want to do with these is make them as composable and reusable as possible. The read function will take in a meta in your resource data and it'll return an error so it's safe to do, but instead of doing a bunch of similar functions and code functionality, you wanna reuse these as much a possible. We're just passing in the same
d that we get above. We can reuse the read functionality.
The read is simple as well. We grab the client, we grab the ID out of the resource data and we make an API call to get that task and this is the martialing. We take that resource data and we're going to set the content to the
task.content. This is it, this is how Terraform works. Sometimes you have a bit more logic in here, but you are just martialing resources from some kind of API response into this Terraform resource data object. It's really easy.
Updating has a bunch of convenience methods in here. We have the
HasChange method, which will do a read on the resource and it will tell you, "Content has changed." Then you can go through and update the content. You can meta and abstract this a little bit, but you're really just working with these CRUD functions.
Same thing with delete. You take in the resource, you grab out the ID, you delete it. If it's successful, you return no error.
» Error handling
Error handling. This is extremely important. Going back to the API client, you need to make sure that your API logic is separated from your Terraform provider logic. If there are Band-Aids you need to put in place, fix your API, fix your client, try not to have this in your provider. For your provider to be successful and to be easy to use, you need to have extremely robust error handling and it needs to be fault tolerant. You need to handle that random Cloudflare 506 that you get back which is an HTML page for some reason from a JSON response. You need to have logic to handle that. You don't have to handle that specific case, but you need to have some error method that knows how this gets handled.
You want to make sure you log all the things. There's a great functionality built into the helpers for logging. You can just
TF_LOG=INFO, debug, different log levels and it will print out a bunch of stuff, so you're going to use this a lot as you're building your providers. Coming back, you need to quickly identify what's your API's fault versus what's Terraform's fault versus what's random. If we look at this little stack trace here, This is from the Terraform docs, you're going to see this type of thing a lot, which is a giant panic stack trace when something goes wrong.
They tell you the key part of this message is the first two lines, so just note here, you want to identify what is your responsibility as quickly as you can, just like reading any stack trace, and they jump right into the actual resource file in the line.
You'll see these stack traces a lot. Quickly identify what's your fault. Partial state is another thing that you'll have to deal with, and so this is where maybe your resource was created or updated, but only part of it was. Maybe the error happened after part of the resource got updated, and so there's a quick call-out from the docs here: "If the Create callback returns with or without an error and an ID has been set, the resource is assumed created and all state is saved with it."
That is very important. Read through this page at the bottom here. They have a very large declaration on what happens to handle these errors. Just to make sure we understand, if you look at the very bottom there. That
SetId function. As long as that is successful, and as long as an ID is set, no matter what other errors are happening, it's going to assume that your resource was created successfully. A lot of caveats to learn, so read the docs at the bottom there and build extreme, well-done error handling in.
Testing. This is the biggest pain point that I have run into while working on these providers, and it's not that it's not well done. If you think about, "How do you provide a testing framework to test against a real API that's creating real resources in real time?" This is where you're going to spend the most time is understanding these test frameworks, and the test framework is well done. It's just there's a lot here. So walk through it with me real quick. This
resource.Test function is how we declare a test. It takes in a resource test case. It has a PreCheck function. This is where you set up your API client and do a bunch of other stuff. You declare your providers. These are usually defined in one shared function. In one test config file, and then you declare a check destroy function, and so after all your steps will run Terraform.
Terraform will basically run
terraform apply on your test case. It will run through all of your test steps to make sure things are done, and then it will make sure that all of your resources are done, so you provided a function to assert that your resources are actually destroyed. The resource test steps here, this is the bread and butter of the testing, so every test step has a Terraform config. So you can see here, we want to do reusability, so this is an acceptance test, that's the
testAcc, so test acceptance check to do is
TaskConfig_basic, so you'll have a lot of different configs. They look like this. You can define a function and it takes in the templated thing, and you use format string to print it, but you're just testing a Terraform config in string form.
You don't have to declare your provider. That's all done at the start. You take a basic function like this, you plug it into the config for the test step, and then it has a bunch of helper methods that can do something like resource that, check resource attribute and it knows how to check that Todoist task test, the content matches whatever content you expect it to. You're spinning up resources, asserting that they got defined where they are, and then moving onto the next step, and so you can have many tests for a single config, and you use these steps to simulate things like updates and modifies.
HashiCorp has a TeamCity server, which is JetBrains' CI/CD server, and so all of their providers are continuously tested. One of the requirements to be an official provider is you have to provide them with the actual API key that they can use to test a provider. This is constantly spinning up resources, asserting that everything is working, and that they're assuring the quality of these providers.
Test flow: It starts with that precheck, then it goes through the test steps, then the destroy, then check destroy. You take one config per step, use steps to simulate updates and deletes. Reuse as much as possible while you're writing these early on. It's very easy to get a bunch of spaghetti helpers, a ton of repeated code, so just approach this with the mindset of, "I need to be reusable. I need to be composable." Abstract as much as you can into these tiny functions that you can work with easier.
The make file, you definitely want to copy and paste a make file from the skeleton or other providers. Fun fact, we didn't realize that you could run a single acceptance test. We thought you had to run the entire test suite for way too long. You can in fact run a single resource test. So you do like: Make test account, pass the
TESTARGS, and the name of the test. It saves you a lot of time, because these test suites can take like... I think ours takes 79 minutes to run, so they run parallel, but a lot of the steps are sequential, because you're modifying resources. The test suite takes a long time. Take advantage of the fact that you can run one test at a time.
Docs... It's very easy to get started with these. There is a magic website folder in all the providers, and so we'll look at this real quick. We've got the magic website folder here. This is the index page. So you just have a very simple nav bar with links to the rest of the resources, and then the individual resources are under the docs folder, and then you have the
d for data sources, and the
r for resources. These are just simple markdown files that you've seen before and are used to working with. It's just a simple markdown file and ERB that's gets parsed out.
The TeamCity job runs and it takes all of the magic website folders from all the providers, compiles them, and deploys them to the Terraform docs website. They have a really cool CI/CD job that's set up. You don't really have to think about it. If you want to get started contributing to a provider, this is a great place to start, so dive into the docs.
» Process notes
Process notes... I think that HashiCorp mentioned this. You want to engage them early. If you are working what you want to be an accepted, not internal, but community provider, engage HashiCorp early. They'll give you the resources. They have a program they'll get you involved in, and they have a Slack for people who are working on providers, so you can get access to other folks who are also building providers. Talk to HashiCorp early. They use Travis CI/CD for GitHub testing. Releases are done via Slack, so this is pretty cool. You pop into that Slack channel and you're like, "Hey! Can someone release version 1.0 of the DigitalOcean Terraform provider?" And then a nice HashiCorp employee comes along. They're like, "Yeah. I got you." There's a lot of magic that happens. They did a really good job structuring out this program to support the 90+ providers they have.
strconv package. This is going from like ints to strings and strings to ints. You're going to use it a ton, especially because all the IDs in Terraform need to be strings, even if your resource has an integer as an ID, it needs to be a string inside Terraform. You're going to do a ton of converting back and forth and Go will get mad at you if you don't do it right.
To get around this you can use custom
UnmarshalJSON functions. You can Google it. You can unmarshal into the struct like you're used to, but you declare your own
UnmarshalJSON function, and you can have a temporary resource in there that you convert from the ID into the string, and then it's magic, so use those. You'll save a lot of converting on the fly. Have a solid Go API client, fix design problems with your API if possible, understand Terraform modules. We get a bunch of GitHub issues where... There are a few caveats with modules where you need to, even though you're taking a list, you still always need to wrap your lists when you're passing it through a module. Understand how Terraform modules work, because your community and your users are going to take your provider, use it in a module, and then raise issues and bugs about it.
HCL 2.0 has null type, which is really great. You can set a resource to null instead of the crazy Go default that's 0 or empty string. Don't be afraid to copy and paste from other providers and your own provider. To add a new resource right now, I just copy from our droplet, which is our server resource provider. Do a find-and-replace everywhere to change the name and then worry about the logic, so copy and paste a ton.
Focus on composable and reusable functions, and then there's something called sweepers. This is like the "Oh shit" thing that runs afterwards to make sure that all of the resources are deleted, so you don't have stragglers hanging around. Some resources are a look at other providers. There are Terraform docs, guides, and source code. Again, read the source code. A couple videos. These are both two really good videos where they awalk through building a Terraform provider hands-on live. One of them is a little out of date, but the concepts are still there. The API changed a little bit, but watch these videos. If you're looking for someone to work on a provider or consulting company or you just want to pay someone to get it done, OpenCredo is someone that we worked with for a little bit. They kind of did some consulting education for us. So OpenCredo we highly recommend them as a dev shop to work on providers.
With that, I am out of time. Right on the dot. You can find me on the internet @eddiezane. The link for the slides is at the bottom here, so if you want to grab the slides. Other than that, thank you all so much for listening, and hopefully this is helpful.