Closing Keynote: Terraform at Google
Oct 07, 2019
Learn about the workflow Google engineers use for Terraform plan-and-applies, and hear how the company migrated from Terraform 0.11 to 0.12.
Google Cloud Platform has seen a 404% increase in HashiCorp Terraform adoption over the last year (Vault usage was up 500%). Terraform is also one of the tools that Google uses internally to manage infrastructure on-premise and in their cloud.
Watch the closing keynote by Google developer advocate Seth Vargo to understand just how extensive and widespread Terraform usage is at Google. The talk will cover the workflow Google engineers use for Terraform plan-and-applies, and it will also share how the company migrated from Terraform 0.11 to 0.12.
Developer Advocate, Google Cloud
Welcome. Thank you all for staying. I realize this is the last talk of the conference. I'm standing between you and alcohol. I also recognize that many of you have had so much technical content—so much architecture content—that the last thing you need is more of that. So for this closing keynote, I want to keep it technical, but I want to keep it at a high level. I want to talk about how infrastructure brings us together.
As Rosemary said, my name is Seth. I work at Google; I'm an engineer there. I work on our developer relations team. As she also said—previously, I worked at HashiCorp. I was an engineer at HashiCorp. Armon might be able to confirm this—I think my official job description was wearer of many very distinctly-colored hats.
I worked at HashiCorp for four and a half years. I did everything from training, education, evangelism, advocacy, lots of engineering—as Rosemary said. You might recognize me from the DevOps vs. SRE series—or from some of the stuff I've done with Vault and secrets management.
I left HashiCorp, the company, about two years ago to join Google. But I didn't leave the HashiCorp community. I still spend time working on pull requests, working with the community, and building content that helps you use these tools in production and development environments.
I like to quote my colleague Kelsey Hightower here, which is, “same job, different T-shirt.” To prove that to you—the top two are from when I worked at HashiCorp, and the bottom two are from when I worked at Google. I still make the same hand gesture every single time.
I truly believe in the power of community and openness, and the open source bit that comes with that community. Honestly—to be selfish for a moment—I built my career off of community. I would not be on this stage if it was not for the open source community. Or getting involved in the open source community back when I worked at CustomInk and Chef Software. Not only have I built my career on it, but I've made some of my closest friends through the open source community.
I've gone to conferences and had conversations with people that feel like family, and I realized, "Wow, this is the first time I've ever met you in real life." That's the power that open source and infrastructure as code has in our community. Tools like Terraform enable us to collaborate in almost a whole new language—and in a new way that fundamentally changes the way we as a community and industry operate.
» Key HashiCorp usage stats within Google
To that end, I could stand up here and give you all of the things that my marketing organization wants me to talk about. But I don't want to do that. I don't want to stand up here and tout amazing metrics. Like we added 64 new resources to our Google Terraform provider. That's a 137% increase in coverage since last HashiConf. I don't want to stand up here and say that we've seen a 404% increase in Terraform adoption. Most amazingly, Vault usage is up almost 500% since last year at this time.
I could stand here and make crazy announcements. Like our Graphite team has worked tirelessly to bring Alpha and EAP support to our Magic Modules Project. Meaning you're going to be able to consume Alpha APIs and EAP APIs on Google Cloud using our Terraform provider. But I don't want to do any of that.
I want instead for you to forget I work for a cloud vendor. I want you to forget for a moment that Google is a cloud provider, a vendor, a sponsor. Google is also a company with thousands of engineers who build and deploy software every day. Sometimes I have to build on-prem. My on-prem is a massive infrastructure. Sometimes I have to build on cloud. That's the Google cloud that all of you are familiar with. You’ll find that Google, the company—the organization—faces the same or very similar challenges that you all face with multi-cloud and hybrid cloud.
Today I want you to forget that Google is a vendor, and I want you to forget that I'm doing the closing keynote. Instead, I want you to listen to me as someone who works at a company that has thousands of engineers, that uses tools like Terraform to do infrastructure as code. And I want to tell you a story.
With that, I'm going to deliver this powerful message; Terraform is one of the tools that Google uses internally to collaborate on, provision and manage infrastructure on-premise and in our cloud. We use Terraform 0.12, and we use Google cloud storage as our storage backend. We'll talk a little bit more about 0.12 later. There's a little asterisk there with a linky face.
» Consuming Terraform at Google
So how does this work? Well users, our software engineers, and our site reliability engineers, they invoke Terraform directly from their workstation. They run terraform plan, and they run terraform apply right from their local laptop or local workstation. Alternatively, users have the option of automation, which is terraform apply-auto-approve in a loop. Some of you may already recognize this as
tfyolo—it is an actual shell alias that I have.
This is interesting. This is an organization with thousands of software engineers, and I'm telling you that we let them run Terraform from their local laptop. And there’s this automation which is effectively unmonitored. If you push that code into source, it's going to auto-approve.
That brings up an interesting question; if developers can run and execute Terraform from their local laptop, what is the security model? How do we prevent users from doing really bad things, either intentionally or unintentionally?
I'm going to show some examples. First, I'm going to apologize because I'm going to make fun of Terraform here for a little bit. There are certain things Terraform does that a malicious user or an inexperienced user can damage a system. For example, here I have an old resource where I'm using the local-exec provisioner. I'm using it to write exit into my bashrc. This means anytime someone tries to spawn a Bash session, it will just exit. It's a great time—ask me how I know. These are the things my coworkers troll me with.
» Provisioners are out
Because we let our users execute Terraform directly, we can't really allow provisioners. Provisioners are out. The way that we do Terraform execution internally, we don't allow our users to consume provisioners. It's important to note that we don't patch Terraform here. We run Terraform open source, but we run Terraform open source in a sandboxed environment.
When Terraform executes most of its things—like providers and provisioners—they’re executed as RPC calls over a localhost to a plugin. I can't really discuss how our sandbox works. But you could replicate our sandboxing behavior using something like Fargo or AppArmor to block those local RPC connections from executing–thus preventing that provision. That's how we prevent provisioners from executing.
» Trusted providers
But you might be thinking to yourself, "Well, I don't know. What about not a provisioner. There's this cool local file resource that can do the same thing, Seth. Haven't you not increased your threat model at all?" And I was like, "Yes, that's a great point."
But I don't want to pick on the file system too much. I'm doing a little bit of picking on the file system. What about this one? This is a fun one. This is a data source that reads your own local state file and then posts it to a random HTTP endpoint. I know that none of you have any secrets in your state file. But if you did, and someone put this in a module on an open-source repo that was otherwise harmless, they could potentially get access to those credentials.
We have a lot of trust in our employees and they can use third-party modules. Unfortunately, that means that we have to have an allow list of providers—and resources within those providers. There's only a certain subset of things that we allow execution from—from a security perspective. We do that using the same sandbox
Many of you may not know this, but Terraform is also incredibly extensible. You can build your own plugins. You can package them, and Terraform can execute them the same way that it executes the Google, Amazon or Azure provider. We use internal providers to talk to our inventory management systems and our asset management systems.
Whenever someone is executing Terraform, we can keep track of what are they doing, why are they doing it, and some internal systems. We've written custom providers for that. Those are in. Provisioners are out. Only a subset of providers are allowed, and we have our own custom providers that are obviously enabled in our sandbox.
To recap, we run Terraform open source. We do compile it ourselves from source, so we don't download the binaries from releases.hashicorp.com, which is painful for me because I wrote that service. We instead execute the compiled binary that we compile from Terraform open source within the context of our sandbox.
There's another interesting aspect that I think is unique to Google, and that's that we run in a monorepo. How many people run in a monorepo? How many people run in a monorepo and don't work at Google? That's what I thought. One interesting thing about the monorepo is that the open source Terraform source is checked into this mono repository. That's where we build and compile Terraform from source.
» Single version policy
In addition to a monorepo, we have this thing called the single-version policy. This means at any one point in time, we only want one version of a particular piece of software to be available in that monorepo.
No worrying about versions
There are some good things about this. For example, we don't have to worry about versions. There's never any conflict—if are you using version 11 or version 12—because there's exactly one version of the Terraform source at any given time.
No version conflicts with the team
We also don't have to worry about conflicting with the team. How many folks have accidentally had someone run with a slightly later version of Terraform? Then Terraform says, "No, my state file is older, can't use this.” Then everyone is forced to upgrade. We don't have that problem because everyone uses the same version of Terraform, and it's controlled centrally.
No need to pin providers
Additionally, the same way we don't have to control Terraform versions. We don't have to control provider versions. We use exactly one version of the Google provider, and exactly one version of provider X and provider Y because there's only ever one version at a time.
Delay for new features
Now that I've sold you on a monorepo let's talk about the less good things. Monorepos tend to have this delay for new features. Because if we're going to introduce a new feature, we have to make sure it works for everyone, because we can't break anyone. Oftentimes that means there's a delay to introduce new features and new versions.
A bad release breaks everyone
If there's a breaking change, whether it's intentional or unintentional, that bad release is going to affect everyone because everyone is going to be on that new version.
Bad bugs are disastrous
To prove to you what I mean by disastrous, here's a real issue on GitHub open source in the Terraform Google provider. Potentially, a seemingly harmless change caused projects to be deleted and recreated. You might not be familiar with GCP. In GCP, a project is like the folder or container that holds all of your resources—all your data, all your infrastructure.
When you delete it, you delete all of that. This was a potentially minor change that could have had catastrophic impacts. Fortunately, we were able to catch this before it rolled out broadly, but this would have affected everyone.
Since upgrades are incredibly risky, how do we test them before making them available to everyone? Now—slight trigger warning—there’s some computer science and some mathematical symbols about to happen. I said this is a non-technical keynote, but there's going to be some math. Brace yourself.
» Proof by assumption
So I have this theorem that I would like to present to all of you, which is that upgrading Terraform from version one to version two is safe. I'm going to attempt to prove this to you through some very hand-waving mathematics.
First, we will start with a proof by assumption. That's not a real thing, for those of you following along at home. Our proof by assumption is that terraform plan is a perfect simulation of the world. We know this not to be true, which is why it's a proof by assumption. But we're going to go with it anyway. Meaning that if the terraform plan output says it's going to do something, the terraform apply will also do that something. That's this proof by assumption that we make.
So given this theorem, and given this proof by assumption—assuming that plan does the right thing, apply will do the right thing—we can represent a Terraform execution as a function. If we imagine that terraform plan is a function that you're calling from your code, the input to that function is it's configuration; the set of Terraform files that it's consuming. We can represent this—like you see on the screen. It accepts one argument—a configuration.
I warned you all, there are some math symbols. We know that from calculus and theoretical computer science that two functions are equal if for all inputs, they produce the same output. Everyone with me so far? In other words, we know that Terraform one and Terraform two are equal if the Terraform one plan output is equivalent to the Terraform two plan output.
It's important to note that the output doesn't have to be no changes. The output just has to be the same. If Terraform one says, "Yo, I'm going to destroy 6,000 resources," and Terraform two says, "Yo, I'm going to destroy those same 6,000 resources," we consider those equivalent because the behavior has not changed.
So, when we're asking ourselves, “How do we know if it's safe to upgrade Terraform in our mono repository at Google?” We have to ask ourselves, “Does the upgrade from version one to version two produce the same plan output for all of our Terraform projects in our monorepository?” For those of you that don't speak theoretical computer science, I have translated it to Bash.
We have this function called
safe_to_upgrade, which finds all of the projects in our monorepo, which is represented by
... because it is complex. Suffice it to say that we have a way to identify all of the Terraform projects in our mono repository. Then we very simply iterate over all of them. We run Terraform version one plan on those configurations and Terraform version two plan on those configurations. If that output is the same, we succeed. If that output is different, we either produce a warning or a failure depending on what type of upgrade we're doing.
It's important to note that this is not a complete test. It doesn't test new features. But it verifies that the update doesn't break any existing configurations, which is what we care about when upgrading.
SREs run this across all of the dev instances of all of their projects which don't have access to user data. We use modules like many of you do. In production—where customer data and user data is available—only the SREs that have access to that project, and the developers who have access to that project—and have been approved—can run those Terraform configurations in production.
But because we use modules, and in development, SREs have access to those same modules, we execute the configuration across those modules in development mode. We don't have SREs who have access to the keys to the kingdom. Instead, they can only see the things that exist in development. But it's still a good test.
» Migrating to Terraform 0.12
This finally brings me to Terraform 0.12—like most of you. 0.12 Was amazing. Like all 14 months that we were promised 0.12, it was great. I'm going to stay on this slide for a little bit to let that sink in.
We had a plan. Because like many of your internal users, and many of you were like, “Man, I can't wait to use that new 0.12 feature.” We had users internally that were like, "When is 0.12 coming? I'm on that D list.” It was annoying—we came up with this plan, "We're going to run
0.12checklist, which is the Terraform 0.12 checklist command. We're going to run it across all of our internal repositories. We're going to make sure that it works. Anyone that is not working for, we're going to work with those teams to figure out what they need to do to upgrade.
Then we're going to provide 0.11 and 0.12 binaries to our users. Now note, we're only providing the binary, not the source code. I'll talk about this more in detail in a little bit. But our single-version policy applies to source code, not necessarily to binaries.
We're going to continue patching Terraform configurations in our mono repository until the 0.12 upgrade command succeeds. Then we're going to go to all of the users, and send out an MSA and say, "Hey, it's time to upgrade." Then we're going to switch the default from 0.11 to 0.12 and deprecate 0.11.
That journey looked something like this, as you can see on the slide. But to quote the famous quote that I don't know who to quote, "In theory, theory and practice are the same. In practice, they're not."
There are two main challenges with this approach. The first is the Terraform 0.12 isn't backward-compatible, and Terraform 0.11 isn't forward-compatible. I don't mean to talk badly about the team. They put a ton of effort into making this as easy as possible—the checklist command, the upgrade command. These changes needed to be made to Terraform's core to support the sustainability of the project. That doesn't make it any less painful. But I do think that HashiCorp went above and beyond what even another company might have done to try to make this transition easy. But it was still a little bit painful.
Step zero was straightforward in our transition. We added this to our CI/CD pipelines for all of the Terraform projects—which was run the checklist command. If it doesn't succeed, we're going to print either a warning or a failure, depending on the project. If it succeeded, we're going to say, "Hey, you're probably ready for 0.12. You don't have anything to worry about." And we did this well in advance.
** May 28, 2019**
Next, we froze all of our Terraform providers. These are like the Google provider, our internal provider. Because as we know from every software development story, we don't want multiple things in flight at the same time. If something's broken, we need to be able to zoom in to a very specific iteration, or a very specific thing that broke.
We wanted to make sure the providers were completely frozen. We decided that we were going to freeze providers on May 28 of this year, and we were not going to unfreeze them until 0.12 is rolled out. This meant that if there was an update to the Google provider, for example, it would not be rolled out until the 0.12 upgrade was finished. Also, a slight side note. I don't know what heathen is keeping this diary. We all know that ISO 8601 is the only proper way to represent dates.
June 11, 2019
We started making many LSCs—large scale code changes—to prepare projects for Terraform 0.12, while 0.11 is still the default. We built some internal automation that would test the 0.11 to 0.12 upgrade at scale. Most users didn't experience any interruption to their workflow. The SREs were working in the background to refactor projects, submit change requests, or pull requests to our model repository to prepare these configurations for 0.12.
** June 20, 2019**
We found some bugs—at the scale that we're using Terraform, there are bugs. I think anyone on the Terraform team will admit that there were bugs between 0.11 and 0.12. At the scale that we're using Terraform, we found all of them. We submitted some patches. We submitted some issues upstream. We also wrote some really dope regular expressions to get around some of that stuff. When you write a good regular expression—I don't know if you all feel like I do. When you get that thing right with the backslashes and the question marks and the capture groups—there’s something super-satisfying about a good regular expression.
July 8, 2019
We brought the source code for Terraform 0.12 into our internal mono repository, replacing the 0.11 source code. What does that mean? At this point, we can't build 0.11 binaries anymore. Because we build from source we have exactly one version of source, and we've replaced that source with 0.12. Any new binaries moving forward will be 0.12, but we still have the 0.11 binaries available for users to consume. We made a 0.12 preview build, and we made that available to our internal power users, who were like, "When is 0.12 happening?"
July 16th, 2019
We updated the default Terraform binary to be 0.12. If you were starting a brand-new project from scratch and you executed Terraform, you would get 0.12 Terraform. The Terraform 0.11 binary still existed, but it was marked as a legacy and our internal package system. If you tried to execute Terraform 0.11, you would get a warning message that said, “Hey, you are using a deprecated version of this package. You should upgrade as soon as possible."
There were still some projects that couldn't upgrade. They didn't have time or bandwidth, or they were on some other deadline. They were able to pin to this older version, while we continued to upgrade other projects.
July 26th, 2019
We made 0.12 officially available for use in automation systems. That's our
tfyolo. You might be asking yourselves like, why did he take so long? Well, think about it. Prior to this, when you were executing Terraform, you had to do so in the context of a human being in the loop. If someone ran Terraform, and all of a sudden Terraform was like, "Yo, I'm going to delete everything," a human could stop that.
But on July 26, we brought Terraform 0.12 as the default into our automation systems, which is continuously actuating on Terraform. It's continuously executing terraform apply -auto-approve. At that point, we had to be very true, and we’d have really believed that the software was not going to damage things—because there's no longer a human in the loop. Again, this is our while true do terraform apply -auto-approve done.
August 8, 2019
We sent an internal MSA email to all to all users that the Terraform 0.11 binary is going away…soon.
September 1, 2019
That was 10 days ago for those of you following along at home. We removed the Terraform 0.11 binary from all of our internal source repository. At this point, all of Google is on Terraform 0.12. From May 16 until September 1, we were upgrading from 0.11 to 0.12.
It wasn't all sunshine and rainbows. For example, there are some potentially small changes in 0.12 that have disastrous consequences when it's used at scale. This is probably the biggest of them, which is that in 0.12, there was a type system.
» Infrastructure brings us together
Finding and fixing those bugs was a multi-person multi-team effort. I'm not just talking about within Google. I'm talking about Google's partnership with HashiCorp, working with open source communities, working with our own customers that use Google, or they use Terraform to provision infrastructure on Google cloud. Working with our internal customers—they use Terraform to provision their services on Google on-prem and Google cloud.
To give you like a little bit more context as to why I'm giving this talk—Terraform, and specifically in this context—the 0.12 upgrade—brought us all together. Terraform 0.12 wasn't an easy upgrade task. Many in the audience that have gone through the upgrade at a large enterprise probably have experienced this as well.
It's totally a worthwhile upgrade. You should totally upgrade, but you should also know that like any software upgrade, it's not all sunshine and rainbows.
To give you an understanding of the scale at which we're using Terraform at Google and also the number of humans that were involved in collaborating on this upgrade—the 0.12 upgrade took over 10,000 human-hours scaled across four very broad teams at Google. Site reliability engineering, software engineering, professional services, and corporate IT. I can't tell you exactly how many lines of code of Terraform we have internally at Google. But hopefully, these numbers give you an understanding of the size and the scale at which we're leveraging the tools.
So that brings me back to the very beginning, which is that infrastructure brings us together. It brings together teams internally. Whether you want to call it DevOps or SRE or infrastructure as code, it forces collaboration to happen. It brings customers and vendors and cloud providers together because it gives us a single language to communicate.
Terraform is fundamentally how infrastructure automation humans collaborate. It's important that every once in awhile, we have to take off our vendor hats. We—the collective we—the community, HashiCorp, cloud providers, individuals, that guy named Joe, that girl named Sally, we have to make this work. It's fundamentally critical to our industry and to our success as practitioners that we make this work—and we have to make it work for everyone. It's not working…for everyone.
There is one more thing. I am very excited to announce that HashiConf 2020 will be announced by Rosemary in a minute. Thank you all.