Case Study

Terraform abstractions for safety and power at Segment

Calvin French-Owen shares what he and his team at Segment learned during their journey with HashiCorp Terraform—focusing on how to harness Terraform's power with effective safeguards.

When it comes to managing infrastructure, few tools are more powerful than HashiCorp Terraform. DevOps teams can create, update and destroy thousands of pieces of infrastructure—all with a single command.

But with great power comes great responsibility: This seems like a daunting, dangerous prospect. You might ask, "Am I willing to trade DevOps speed for added risk?"

But in reality, this is a false tradeoff. By using Terraform creatively, it’s possible to build a workflow that's simultaneously powerful and safe—not to mention far more reliable than making changes manually.

Calvin shares how Segment changed its module abstractions, modified its state management, and added techniques to ensure all infrastructure changes are safely reviewed and applied. He also offers ideas for structuring your own modules and state.

Speaker

Transcript

Today, I'd like to share with you a few of the things that we've learned over our journey with Terraform, particularly when it comes to harnessing Terraform's power and still working with it safely. So, we've got a bunch to get through today. Let's dive in.

First, I'd like to talk briefly about our journey of Terraform at Segment to kind of set the context for things. Segment is an analytics API that serves thousands of business across the US and worldwide. Currently, we're running about 349 services on top of ECS in production today. We hit about 14,000 containers at peak load. 90 billion messages per month. And, in terms of peak load request volume, it's about 100,000 requests per second. As far as our infrastructure is concerned, we running everything on AWS, and we're using ECS to schedule our containers. So, this talk will skew towards that. But, most of these lessons can be taken no matter how you're using Terraform.

In terms of Terraform itself, our journey began about two and a half years ago when we started using Terraform. This was back in the 0.4 days, so it was a very leading edge. At that point, the conventional wisdom was when something broke, someone on the edge team would just say, "Oh, upgrade your Terraform version, it will probably fix it". Since then, stability has gotten a lot better. We now have about 30 developers interacting weekly with Terraform. And, I'd say there are 30-50 applies or changes to infrastructure that happen every single day. Across all of that Terraform, we're managing tens of thousands of AWS resources, pretty much everything from our networking setup, containers, services, auto-scaling rules. At this point, pretty much everything we run is managed via Terraform.

And so in this talk in particular, I first like to discuss why is safety such a big deal. What makes it so important? Then cover a few Terraform nouns just to make sure we're speaking the same language. Then discuss the two key aspects of Terraform safety, safety with your state and safety with your modules. And then finally tie up with a few ideas for how to be safe elsewhere. So first up, why is safety such a big deal? Now, if you're like me, you probably originally thought that safety is all about preventing downside. You're preventing outages, preventing security leaks, preventing issues that might arise from some sort of infrastructure change, that does something that you don't really anticipate. But actually there's more to the puzzle, and there's a little more new-ons than just that.

I was recently reading this paper that came out maybe four or five months ago, which basically looked at how developers choose software. And in particular they're looking at this developer tools, and they're saying, " Okay, all these different axis whether its productivity, speed, ease of use, what makes developers adopt certain tools?" And the result they found was kind of surprising, it turned out that the overriding characteristic that defined developer use, was whether the tools could cause some sort of unknown risk. And in particular they found that, even if your tool was making developers three to five times more productive, or maybe it's empowering them to do things that they couldn't before, the key thing that you have to get right, is actually make tools less risky to use to promote their adoption. You know what, fellows out of that is an interesting lesson, is that you can only have adoption, if first you have safety.

Now, that's interesting with Terraform, right? Because with Terraform there's a lot of things that seem potentially really scary that you can do. You can tear down your entire infrastructure with a single command. Sure you still have to type, yes but ... Or maybe you can easily delete a load balancer or tear down some sort of necessary database instance, or maybe it's change your network rules so that suddenly things are open to the internet. There's a lot of scary stuff and so it's up to us, as users of Terraform to figure out how to use it safely. And the question we often get from people is, "Okay, yes, this Terraform thing seems really scary, suddenly I have all this automation around changing my infrastructure." But our answer to that is, it doesn't have to be.

And here are a few of the things that we've learned to make Terraform feel safer and therefore gain more adoption within your organization. So to start off, let's talk a few Terraform nouns just to make sure that we're speaking the same language here. If you've already used Terraform, this will be bit of a review, but I'll cover in the next three minutes and then we'll move on to the interesting stuff. So for those of you who are unfamiliar, Terraform is basically infrastructure as code or infrastructure as configuration. You write some configuration about what you would like your infrastructure to look like and then you make changes to that, which then get propagated to your production set up.

In practice, this looks something like this, we have on the left, a bunch of our different configuration, whether that's for instances, networking groups, security groups et cetera, we run the CLI tool called Terraform Apply, and it magically applies to our production, running infrastructure. So first off, there's the language that Terraform is written in, and it's the one I'll be using today in most of these examples, it's called HCL, short for HashiCorp Configuration Language. You don't have to know a lot about this other than that it's able to compile down to, and it's totally compatible with JSON. It's effectively representing all of your resources as key value pairs.

We can configure that HCL with what are called variable blocks. And these are just dynamically configured inputs. As an example, here I have my access variable. First we just use the keyword variable and then we give it a name, and what we can do is we can add a type to it, default values, and a description for other developers who might use that variable later. Variables are used to configure what are called resources. And the resources are just configuration for any cloud entity. This might be an instance, it might be a load balancer, but it can be other things too, it could be a Datadog alert or a dashboard or it could be some sort of machine in each, really anything that's sort of an entity, or a thing that lives in the cloud, that's backed by a resource.

Resources take inputs as configuration, and then they produce outputs once they're created in infrastructure. They can interpolate variables to adjust that configuration. As an example here, I have a bastion instance, in particular I have three key parts that define that. First, I have this resource keyword, I have the type here, which is AWS instance and then I have the ID, which in this case is bastion. That ID is actually really important because it's what Terraform uses to track this object in its state. I'll get to more of that shortly.

Additionally, you can see that we have these inputs that are defined here. These inputs govern how this instance is created. You can see that I passed an AMI, I passed instance types from various variables, and I've also configured it manually with a few literals here, like I want this thing to be monitored. All of these govern how our instance will actually look in production. And then finally you can configure outputs. Outputs allow you to take those created fields on a given resource, in this case it's the instance's public IP. And then move those to be used elsewhere in your infrastructure in your Terraform.

Now resources themselves are a little unwieldy to manage. So we combine them into higher level groups that we call modules. Just a reusable collection of resources that can be passed its own set of inputs and result in its own set of outputs. Here's an example from one of our services that's running in production. This is for Maglev, an internal linked service that we have. The general idea is that, first you keep this source field as the first field in your module, which governs, "Hey, this is the source code that I'm pulling this module for. This is the definition for it."

And then additionally you have all sorts of other inputs down below that you use to actually configure that module to boot up your service. Maybe in our case that's set of ECS services and a load balancer, et cetera. I'll talk more about how those all fit together later. Finally, once you've got your code, you've got your resources that are grouped into modules, you supplied your inputs, your variables, you run a plan. And this creates the diff between what exists in your infrastructure today and what your desired configuration is. Once that plan is generated, you can apply those changes, make the changes in your production infrastructure.

You might be thinking, "Okay, I've got, resources, modules, plans, et cetera, but how does Terraform know which changes to make?" For that it uses what's called the state file or .tfstate if you're running it locally. The way you can think of this is a giant JSON file, which describes all of your infrastructure. Here this will list everything from your instance IDs to your EBS volumes to whatever it is that is configured about your instances through Terraform. And under the hood it will actually construct a giant dependency graph sneaking how those dependencies and resources fit together.

So at first you need to create a network, it will create the and then you can pass those subnet IDs into individual instances and auto-scaling groups and then it will create those next. But to you as the user, you don't really have to worry about any of this. You just run a plan, generate the diff, apply, apply the diff. Under the hood it loads desired configuration, loads whatever state you have stored, calculates that diff and then uses current APIs for whatever the services to update the current state to match your desired state. And finally it updates the state file. So in that show, Terraform applies diffs in your configuration to manage your infrastructure, and match your desired state to be what's configured.

So part one, let's talk about state. In our experience, state has been kind of the major ... How do you say it? Foot-gun I guess of dealing with Terraform. Now for the people who are just starting with Terraform, they kind of start playing around with it, and they get to this realization where they say, "Well, Terraform wants to manage everything. It wants to know about everything in my infrastructure and if there's something there, possibly tear it down. So the main question I have is, how do I keep from destroying it in the existing infrastructure?" And our answer to this has been setting up different AWS accounts, which actually my teammate, Gary came up with originally.

The general idea here is that, if you're starting on Terraform from a completely blank slate, what you can do is, you can actually set up new AWS accounts for your dev, your stage, and prod, that all completely isolate. These don't talk to your current set up whatsoever. There's nothing in them, they're completely greenfield. And then you take your production account, and you actually peer to what's kind of living in your old production accounts, the same thing with your stage and dev. And what this gives you, is it gives you confidence that you start applying this Terraform state to these new accounts, knowing that it can't tear down anything in that old account. It just doesn't know about them, can't destroy them no matter how hard it tries.

And the beauty of that is that it gives you safety. You feel really good about booting up these new accounts knowing that no matter how you modify the state, it's not going to tear down something that's running in production. So if you're starting with Terraform, I'd recommend booting up that new account, set up of peering connection and then just start creating resources and moving them as you need from one account to the other. In fact, we still do this internally using Terraform Enterprise, where we keep individual states on a per team basis, and each of those states applies to a certain team and a certain environment. For instance here we have Cloud Sources Production as well as Cloud Sources Stage.

So if you're using Terraform, it's really important to nail at least basic state management. And there's one golden rule of basic state management, which is just, always use a remote state or a remote backend. You never want state to live locally. We've been down this road, and we know how it ends. We used to commit all of our state into GitHub and when someone had to make a changes, they'd run a plan and apply it locally, and they would get kind of this giant diff that you see here. And as a result you start ending up with bad things like merge conflicts when two people try and run it at the same time. You start having to merge, giant thousand-line diffs. You start ending up with people overwriting each other's changes and maybe not seeing them. All in all, it's a very bad deal and one that you basically just never want to issue.

So we found two workarounds for this, we've used both in time. There's one, which you can use Terraform Enterprise, there's kind of a remote backend to manage all of the state for you. You can use S3 as remote backend, you can even use Consul as a remote backend and honestly it doesn't matter a whole lot what it is you use, and you can choose whatever is right for you. But no matter what, you're going to want to have this remote state set up. And the reason it's so easy to fall into this is because initially when you start developing with Terraform, there's maybe one or two developers who are sharing all the state, it's really easy for them to just kind of pass the state back and forth and have the status-quo of committing it all the time. As you scale out to more people, it's no longer the case.

So what remote state provider should I use? This is honestly very independent depending on your business. If you're looking for the absolute lowest price you can run S3 or Consul yourself. If you like really custom configuration around certain workflows you have, those probably work really well for that. For us, internally, we started switching over to Terraform Enterprise pretty much everywhere just because we like a few of the key features from it. One being the out-of-box dashboard and kind of plan and apply for, the second being that people are no longer applying locally, where if you have two people who are applying at once or making changes and then forgetting to commit them, you start running into issues, basically now we have one kind of central pipeline, which runs and governs our infrastructure management. And for us, that's been something, which is definitely worth it.

We're also excited to see this Terraform Enterprise, which just came out today, which is backed by APIs, so it should help take care of some of these configuration problems. Of course, there's more to state management than just that. That's only just moving your state remote. It still doesn't split up the problems of having this giant blast radius, where you can blow away a bunch of your infrastructure. We move to two kind of iterations of this. At first, we had individual states per service. In particular, we had this kid of core state that was set up where we had contained out VPCs or networking, security groups, basically things that didn't change that often, that our core infra team would be working on.

And then on top of that we had individual services, which each were backed by their own state. So, your AUTH service would have its own state, API has its own state, CDN has its own state. If you make a change to a given service, it can only affect that service in terms of blast radius. And this is kind of a read-only relationship between this co-state and these other services. But as we have evolved little bit, we realized that having each of those own states, ends up being a little tricky to manage as well. In particular, we run integrations for about 180 ... it might be even over 200 now, different analytics tools. And each of them have their own single worker, which is running a customized piece of code. Unfortunately those workers share 95% of the same code as kind of a common library and a common runtime.

So if we want to make a change across all of them, we have to go and update 195 of these different states. It's not too fun. So instead we've split out into having shared states that are grouped by team. The thought being here that, a given team is kind of always on top of which changes are being made, so it makes sense to have that team be able to make those changes unilaterally and then plan and apply once, such that you are able to make those changes all in one. But it's not hurting other teams who are working with that code.

Now, as far as how we manage read-only state, there're many ways to skin a cat here. The first is to use this Terraform remote state variable or resource rather, which we used fairly early on. The general way that this works is that you define a Terraform remote state resource in the backend, which it's associated with. And then from there you can pull a remote state that you've previously stored as part of some other plan and apply, or some other Terraform that's fronting it.

In our case we pull ... Had one common one per environment. So we had kind of one core Terraform state for stage, one core Terraform state for production, one for dev. And then down below you can see how that's actually referenced, here we've grabbed this file first, which is the very first thing that happens when we're pulling this file first, and then next we're able to actually reference its outputs by saying, "Hey, Terraform remote state, get me this zone ID, which exists for this account."

The second that was introduced after that kind of became an innovation was data sources. I'd say if you're using this today, this one is a much better way to go. Data sources allow you to pull data from your production AWS or Google Cloud accounts and actually recognize certain values, which might have been built outside of Terraform. In our case we have a pipeline, which built Packer images and then puts those into our AWS accounts, we're able to pull those same AMI IDs and then use them within Terraform without having to do all sorts of trickery around managing this remote state. If you like to use it, it is as simple as then taking this data resource, saying, "Hey, I want to reference a particular output of it, " it will automatically be refreshed every time you run that data source.

Of course, including these data resources across all states for each team gets a little tiresome. So we've iterated on that as a third technique, which we call, shared outputs. In this we keep a shared infra module, which essentially is just a set or kind of a container module to hang our hat on and keep some variables around. And with this we see ... First reference this module, passes some configuration and then we're actually able to reference outputs, which are backed by data sources. That ensures that we basically don't repeat ourselves, we use those data sources once, configure them and then my teammate Julian actually set up a bunch of these, where you're able to pull all of these outputs depending on what it is you actually need.

So, in summary for state safety. You want to use separate AWS or Google Cloud accounts. This allows you to ensure that you're not making changes between one account and the other. You're not accidentally modifying some database in prod while you meant to be modifying the one in stage. We keep a state for environment, just for that purpose. You might want to consider states per service or states per team. At Segment, we found it to be a bit more manageable with states per team but if you're really looking to live in blast radius for particular services, you might want to think about giving them their own state. No matter what you do, if you remember one thing from this talk, please use a remote state manager like Terraform Enterprise or S3 and limit your blast radius where possible. In our case we use a read only state with a combination of data sources and shared outputs but there's multiple ways to do it and all of them have kind of slight pros and cons.

So part two of safety, which is the other kind of big aspect of Terraform that's a little hard to wrap your head around at first is modules. And we have evolved our modules a lot over time at Segment. And one thing that, I think I really appreciate, I guess I've been writing a bunch of Go recently, is that Go is actually very prescriptive when it comes to writing Go and what good Go looks like. They have all sorts blog posts on how to name packages, how to name variables, how you should write Go, how Go should be packaged. In some ways it's almost kind of annoying, but in terms of writing go, it actually makes it feel very safe. Unfortunately Terraform doesn't quite have that yet. Terraform modules still feel like the Wild-West in some ways. So I figured, why not share a few of the ideas about how we write modules at Segment.

Now, as I mentioned before, we have individual per team repositories, each of which are backed by their own state. And we have this one repo, which we call Terracode, which kind of combines all of our shared modules and variables. It's the central place where primarily our infra team is making modifications and then individual teams are consuming those modules. So you'll see at the bottom, they end up sourcing for this kind of GitHub, Terracode repo, modules, whatever modules they need. If we look inside this modules folder, we'll see that there's actually a lot of modules here depending on what it is you need to boot up. And pretty much anytime that we boot up something more than once, we'll end up writing a module for it.

We keep modules around for services, workers but also for things like booting up a Kafka cluster, or a Consul cluster or may be an aurora or RDS instance, or even our bastions or Nginx. Honestly in our case we found it better modules so that you can actually start reusing functionality than duplicating resources all over the place where now you have to start tweaking each one, and maybe reorganize its API. So let's dive into one of them.

Let's take an example of our worker module. Now, worker at Segment has a very specific meaning. Remember we're this kind of this analytics fan out, and an injection service. In our case, a worker typically consumes from a queue and then either makes ETP request elsewhere or publishes to another queue. The worker first has a Readme, which is common across all of our modules. And the general idea here is that we're creating a high level document, so if you're a developer who wants to use this module, you don't have to worry about what's going on internally or under the hood. You can just look at this Readme, figure out the inputs, you need and then look at the outputs. We figured that generating a bunch of these by hand is actually a really annoying thing to do, particularly if you have tens or hundreds of modules.

So my co-worker Amir, created this library called Terra-docs, which will actually patch the HCL, look for any variables and then automatically generate this as markdown or via a CLI output. If we look a little deeper, we separate our inputs and output into different files here. Here you can see the output, the input.tf, which looks pretty standard, it just has a list of variables along with the description for each of them, which is then pulled by that Terra-docs tool I was just talking about and given to the Readme output. Output is named similarly. The one wrinkle here is that actually in many cases, what we end up doing is we end up creating sort of nicer output variables with interpolations so that developers on the team don't really have to worry about doing all of this passing and interpolation themselves.

Additionally, we found out that for complicated JSON structures, in particular our task definition, which I'll talk about a little later. We found it best to actually output that rendered JSON directly as an output from the template file. Well, it might not actually buy you much in terms of what your infrastructure is doing, what it does is allows you to actually run that JSON through a passer, or a syntax checker to verify that yes, everything works good, and I've set up all of my, maybe hundreds of lines of JSON, in a way that makes sense.

The other thing that we do is we specify several levels of API depending on what the client wants. And in particular this interpolation here might look kind of crazy but bare with me, I'll walk through the internals, just know that to the outside developer, it's really easy to use. So in particular this statement here is calling coalesce first, which takes the first non-empty or non-falsey value, and first it looks at the CPU variable, is this passed in? We know that if CPU is passed in, chances are that the developer who is using this module really wanted that particular CPU value. So it's the most fine grain override. Then we switch to kind of a middle override. The resource allocation variable.

For this we keep low, medium, and high kind of T-shirt sizes, which my teammate [inaudible ]. His idea was that, look most services fall into one of these three buckets. Instead of having our developers every time think about, "Okay, here's the memory I need and here's the CPU I need, and here's how bin-pack them efficiently, why not give them a few default knobs that they should use and kind of, for all the rest they can use very custom variables but encourage these resource allocations." And then finally if none of that has passed in, we'll look at just 64 shares of CPU, something sane that's a default.

Additionally, we give each resource what we call this enabled variable since you can't directly enable or disable module. Each resources gets this passed in through the module as a way to specify whether we want something in dev but not prod, or prod and not stage or vice-versa. I wanted to leave this up here just very quickly to get a sense of what the task definition looks like. This is an ECS task definition, as you can see it has lot of crazy interpolations happening here. This is the case where you actually want to render JSON output coming back from your resources. In our case, Julian and Shaw have done an amazing job of like hiding all of this complexity, which ordinarily you need to understand as a developer if you're booting a task definition on ECS, and instead just reduce it to an image, a number of CPU share and a name for your service.

Additionally, we bundle things like auto-scaling rules and IAM roles per services. So you get both of these completely out of the box. The idea is to give an obstruction to developers so that they don't have to do anything to build these services and only have to specify the inputs that they really care about. So what does this look like in practice? Our first resource from our shared worker module, the one I was referencing before, and then we pass into doc or image that we need. In our case this is tied to integrations consumer 0.4.29. We give it a cluster that we want to run upon, in our case we pretty much run everything on the shared cluster that we call Megapool, which is already configured with all the right defaults. But if people want to, they can configure it on their own clusters. In some cases we have right heavy workloads for things like writing files to disk or running Kafka. By default, though, everything just gets megapooled.

We have this kind of T-shirt size for resources, whether it's low, medium, or high. We also allow people to tweak the auto-scaling if they so choose. In this case you can see that this worker depends strictly on the Kafka partition counts for this particular topic, so we limit both the min and max to match that count exactly, so we have the same number of consumers as we have Kafka partitions. Finally, we have service specific configuration down below. And what this allows to do is to pass in things like configuration flags or environment variables, basically anything that developer would normally need. Notice that some of this are actually pulled from other modules, some of them are put in as literals if they don't change. And then some of them are interpolated as variables depending on the environment.

As a last step, let's see how we add IAM permissions to this worker. Remember each worker already had an IAM role associated with it, means when it runs, it automatically gets this role that it assumes in production. What we figured is that, we actually want an easier way to manage these permissions for what API this worker should be able to access. Here you can see an example that this module is granting read access to the Segment archives bucket, and it's associating that with this integration consumer task role that we just talked about. If we look at this under the hood, this is actually generating bunch of different policies depending on which permissions this object needs. Whether it needs just list, read-write or read and write.

And then even further under the hood what that's doing is it's actually generating these AWS style policies where it can list the buckets, it can get certain one for use to prefix, it can only write them, maybe you only read them. It basically manages these best practices in terms of policies for you. And what's great about that is if you're an end developer you don't need to know about AWS policies, you don't need to know about how they work, you don't need to know about scoping them properly, all you need to know is that, "I want to use this bucket policy module that I've got, and I want to specify the bucket and level of access," and you're the races.

Now, we've actually has amazing success with this, and I'd like start open-sourcing this some time over the next six months. We pre-baked a lot of these policies for individual developers so that they don't have to worry about them. We've created default policies for EC2, ECS, KMS, basically every service we use has a set of, sort of default policies, which give you everything you need out of the box. Additionally, if you're looking for more module ideas, we've open-sourced what we call the Segment Stack, which is a set of Terraform modules that you can use out of the box to deploy a VPC services on top of ECS auto-scaling groups, a multi-easy cluster with nothing more than three lines of configuration.

So to wrap up safety with modules, definitely use them as logical units of resources. You want to use simple defaults everywhere to hide complexity from your end-user definitely variable as much as you can to allow users to override it. If you write it more than once, make it a module, or maybe even more than once, depending on what is you're using. Share your modules, which should be easier with the Terraform module registry. And then additionally you can use some of these tools we wrote if you're curious about learning more Terraform-docs, and the Segment Stack.

So in closing, I'd like to leave you with a few ideas about what safety looks like elsewhere, potentially outside of your running infrastructure. So remember, I talked about earlier that safety is a requirement but also enables adoption. And when you have adoption of something like Terraform across all of your services, across all of your development team, what does that buy you? What if you have a common substrate for defining infrastructure and cloud services? Well for us this takes a few different forms, one of them is alerting. In our case we used Datadog to manage all of our server monitoring, metroscript pipe there and pulled from Cloud Watch, we hooked that to Pagerduty and Slack and automatically alerts every time something goes wrong with a given service.

We're running into this issue where people would often forget to add alerts for a service. They say, "Hey, I realized we have this problem in our infrastructure and what should have taken three minutes to find the root cause of, if had the proper alerting, ended up wasting us three hours because we had to hunt through this rabbit-hole of different services to finally find the one that was unhealthy. So our case we said, "Well, we've already got our services to find in Terraform, why not add alerts there too? Because Terraform remember supports not only infrastructure like Google Cloud, Amazon, Azure, it also supports providers like Datadog, so we just hooked up our idea of a monitor here, set it off and suddenly in a three-line change you get alerting out of the box. You just include the Slack channel, the Pagerduty on call rotation and the dashboard link, it's only because all of our services use the same services module, they get alerting for free, which is pretty amazing.

What about cost management? In our case we published this blog post or my co-worker [inaudible ], maybe six months ago talking about how he managed to save over a million dollars on our AWS bill over the course of the year. As a follow up to that, we talked about how we actually manged to identify, which resources were costing us the most money in particular, a pair of those particular lines of business. In our case we want to know, " Okay, for 1,000 events that come through the API, what does that cost us in terms of API compute, but also what does it cost us to send Google Analytics to put into our Redshift instance, to put it in dynamo," et cetera.

In order to do that, we kind of had this really complicated pipeline, which we'd pull from the AWS billing CSV, which is like, literary this million-line CSV file, which contains instance hours and now instance seconds I suppose, and it would basically roll these things up. We still didn't know how to tag particular instances or containers and co-relate them to lines of business. And so you might be thinking, "Hmm, well we've got all of our infrastructure defined in this thing called Terraform, and we want to tie that infrastructure into business lines, why not add a product area variable, and we'll start tagging our infrastructure that way?"

And that is what we did, similar to what mentioned in the keynote this morning. We basically took this variable and instead of using a tool like [inaudible ], because it didn't exist, we just had something in CI, which would check those scripts and ensure that product area was tight. As additional flavor for where this can go, you can take things like keys in secrets management and for every service just automatically generate the right policy so that they're able to access whatever your key in secret management store is. If services at the module level are your new obstruction, you can keep adding stuff under the hood and no one on the development team needs to worry about it.

So in closing, I'd like to leave you with one last thought. So about six months ago, I wrote this blog-post, which the whole idea was, what is our package manager for the cloud? It seems like we've got all this amazing cloud resources that we can boot up, and we've got all this great mature software, things like Kafka, Consul, in our case Mongo or Postgres. What we want to be able to do, is we want to be able to boot this up in any cloud in only a few minutes without having to worry about much configuration. And my thought for what that could look like was something like Terraform, where you could potentially keep a module, and because Terraform works across cloud. Terraform could be this uniform obstruction layer that is able to apply to google cloud and Amazon and Azure and wherever your software is running, effectively making you cloud agnostic.

I guess I've been upstaged a little bit because this is in fact what this Terraform module registry that was announced today brings you. But I think what's more interesting here is that when Terraform is your substrate, effectively any cloud becomes your cloud. And while that's an interesting idea, there's a whole host of other things that exist just outside of Azure, GCP and Amazon. And in fact when Terraform is your substrate, any cloud product, becomes your cloud product. If we want to switch our Datadog for some other type of alerting monitoring, we can do that under the hood and none of our service definitions have to change. If we want to switch our Pagerduty for some other on call rotation tool, we can do that and not have to change anything. Terraform provides us this uniform obstruction layer, that makes managing infrastructure both powerful and safe. Thank you.

More resources like this one

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules