Learn how Discover Financial manages over 2000 workspaces in Terraform Enterprise to create a self-service infrastructure as code environment.
Thank you very much. Happy to be here. My name is Angel Naranjo, and I'm from Discover Financial Services. I'm here to do a presentation on managing Terraform Enterprise with 2,000+ workspace at Discover Financial.
To get started, you may be wondering to yourself: 2,000 workspaces seems like a lot of workspaces. How exactly do you go about managing so many workspaces and infrastructure? Make sure you're asking the right questions.
When you're managing so many workspaces — the workspaces have over 2,000 state files, including all your key infrastructure. At Discover, we use Terraform and Terraform workspaces to manage our cloud infrastructure, including GCP and AWS.
With all this key infrastructure, these state files hold a lot of sensitive information. So, you want to make sure when you're managing Terraform Enterprise (TFE), you have no unanswered questions. You want to make sure everything is secure and managed properly since these state files are critical to your organization.
To get started, I want to do a quick overview of the type of questions I'll be going over in today's presentation. We're going to be starting off very big with the idea of managing Terraform Enterprise and going down very small to the actual workspace level.
Make sure your environment is set up properly so all the workspaces that are being created within Terraform Enterprise are secure. To do this, we're going to look at two things: The Terraform Enterprise architecture we have at Discover Financial Services, and how exactly do we handle disasters in case something happens.
This is the architecture we have currently at Discover Financial. For us, we decided it would be best to deploy our infrastructure within AWS. If people are familiar with Terraform Enterprise, this is very similar to what is recommended by HashiCorp. We always try to follow what is recommended by the vendor to make sure everything's managed properly.
Going on to the disasters themselves, you want to take into account two different things. First, we want to take into account the Availability Zone (AZ). In AWS, the EC2 instances are within one single AZ. We want to make sure in case that AZ goes down, we're able to redeploy the EC2 instance in a different AZ to make sure that we have the workspaces available — since 2,000+ workspaces, that's a lot of teams that lost a lot of resources. We have to make sure these workspaces are available to all of our users.
This is more of a major situation, and we want to make sure our workspaces and state files are available when a region goes down. We do two things: RDS read replicas to make sure the workspace and the workspace name, the workspace metadata, is available in another region in case one region goes down, so we're actively transferring over all the data to a replica in a new region.
Secondly is the S3 bucket. Within your Terraform Enterprise environment, all the state files are stored within an S3 bucket. You want to make sure you utilize S3 replication to actively remove these state files into another region.
We have all of our data available in another region, but that data's going to be useless if we don't have a way to redeploy Terraform Enterprise in another region quickly and successfully.
How do you make sure you deploy your Terraform Enterprise environment in a quick and successful manner in case a region goes down — so that you're able to easily have your workplaces available to be consumed?
To do this, we have set up a deployment automation. We set up a Jenkins pipeline. All our Terraform Enterprise infrastructure is managed by Terraform, and this pipeline is set up so that you first generate a plan of all the infrastructure you want to recreate. That's recreating the Route 53, the EC2s, the load balancers.
You generate a Terraform plan. And, before you move on to the actual apply, our pipeline is set up in a way that we want to scan the actual plan. We want to make sure whatever the engineers are redeploying is secure.
Let's say, for the S3 buckets. We want to make sure those are encrypted — and not only the S3 buckets — but make sure the EBS volume and the RDS is encrypted. Only when all this infrastructure is properly configured and encrypted, our Jenkins pipeline allows us to move onto the next step, which is the Terraform apply.
Our pipeline checks to make sure everything is good and then deploys our infrastructure within AWS. Here, two things happen: The state file is stored within a separate S3 bucket. Then, all the infrastructure gets redeployed in AWS — including the Route 53 table, the RDS bucket, S3, and, finally, the EC2 instance.
When the EC2 instance gets created, this triggers a Chef cookbook. We don't just want to automate the creation of Terraform Enterprise. We want to automate the actual deployment of the Terraform Enterprise application. So, the Chef cookbook gets triggered when the EC2 is created — and that gathers all your files, the air-gapped files, the replicated files of your Terraform Enterprise environment and deploys it within the EC2 instance. So, as soon as you trigger the pipeline, at the very end, your Terraform application is available to be consumed.
Let's go back a little bit to the actual scanning of the Terraform plan. What happens if the actual plan detects something that it doesn't like — let's say the S3 bucket is not being encrypted? We set up our Jenkins pipeline to make sure if that does happen, the pipeline fails.
With so many workspaces and so many state files containing a lot of sensitive information, we want to make sure wherever we deploy our Terraform Enterprise application it's as secure as possible. That's why it's very important we have fail-safes in place — in case an engineer forgets to encrypt their S3 buckets or the RDS database — that doesn't let them create it.
This is a quick recap of what happens on your first deployment to a new region. Either the pipeline sees what it likes, and it deploys the infrastructure within AWS. Or it sees something that's not configured correctly, and it fails. Well, that's just the first deployment. That's to make sure the actual infrastructure is very secure. But that's not the only thing that we have to be checking to make sure it's secure.
For any subsequent deployments, we add an extra step at the very beginning, right before the Terraform plan. This is a Terraform taint command. We want to recreate this infrastructure on a monthly basis or — depending on your organization — a weekly, biweekly basis such that the EC2 instances are tainted, so we want to recreate them. We do this for two reasons — the first being the AMI.
At Discover, we're always updating the AMIs that are being used for our EC2 instances. These updates include security patches. We want to make sure we're taking advantage of these security patches since, as I keep mentioning, these state files are critical at Discover, and they hold critical information. So, we want to make sure there's no way to get unauthorized access. That ensures that our AMIs are secure.
Secondly is the Terraform Enterprise application itself. As many of you might be familiar, Terraform Enterprise gets updates from HashiCorp on a monthly basis. We want to take advantage of these updates, whether it being security patches or actual new functionality to the Terraform Enterprise architecture. That's why for every subsequent deployment, we want to make sure we are refreshing the EC2 instances and recreating them. That takes into account where it comes to the Terraform Enterprise architecture. Let's dive a little bit deeper.
What about Terraform Enterprise organization — the place where all the workspaces are going to be living?
The way we set it up is a single organization within Terraform Enterprise with multiple teams. We want all of our teams to live in one environment, and they manage their own workspaces — where each team only has access to their workspace, and they just live within the same organization. The reason we went through a single or a multi-team structure is that we want to take advantage of two things offered by Terraform Enterprise.
Firstly, the Terraform private module registry. At Discover, we have teams that create Terraform modules for all the Terraform code — either for AWS or GCP — and we want to make sure our teams are utilizing those Terraform modules when creating their infrastructure. Some teams might not be super familiar with Terraform, so we want to make it as easy as possible for them to consume Terraform.
Secondly, it's the Sentinel policies. With Sentinel, we want to make sure when infrastructure is getting created through our public cloud providers that we set some guardrails on. With so many teams at Discover, a lot of infrastructure's going to be created — EC2s, Lambdas — and it's hard to keep track of all this infrastructure that's being created.
With Sentinel, we have policies such as mandatory tagging. We want to make sure when people are creating infrastructure with Terraform and deploying it through Terraform Enterprise, that each resource is tagged properly — including the team name, cost center, and team email. We want to make sure that in case something happens, we know who owns what infrastructure within our cloud environment.
That's very good. But if you notice from the workspace organizational structure, all these teams live in a single org. So, how do we handle team permissions within those workspaces within that single org? You may have noticed we separated into three different sub-teams per team. Read-only, developer, and owner.
For read-only, there might be times where someone wants to audit someone's workspace. We want to give people access to be able to read the state file and variables within the workspace, but not make actual changes to the workspace. They would be given access to TFE read-only.
Secondly, is developer: This is where most of the developers and engineers for a certain team will have access. The developers are the ones that are constantly making changes to their Terraform code and redeploying the infrastructure. So, we want to give them access to be able to do runs and applies within Terraform Enterprise.
With owners, even though we have the uppers that make changes to the workspace, we don't want them to be able to delete the workspace. These state files are key to Discover, so if they delete these workspaces, they could cause issues.
So, we have this separate team of owners where only a select few people per team should be given access to make changes to the actual workspace configuration — and delete the workspace when it's no longer needed.
We have multiple teams in the Terraform organization. But how do we onboard more teams? We want to attract more teams to start using Terraform Enterprise instead of Terraform open source. So, we want to make sure we have a simple and simplified process to onboarding teams.
We set up a simple form that the users fill out, including their team name and the team owner. Once they fill out this form, it gets sent to a Jenkins pipeline — and this pipeline does the actual team creation via Python. In Python, we create the TFE teams and the associated LDAP groups.
You may have noticed under the Python symbol we have something called the TFE SDK. At Discover, we have this custom Python module setup as a way to interact with the Terraform Enterprise API. Terraform Enterprise operates an API to create workspaces — create teams — and we want to be able to easily add these functionalities to our workspace and our pipelines.
Most of our developers are used to programming in Python, so we developed this custom module to do a lot of the Terraform Enterprise APIs. I'll be going more into this a bit later, and you'll be seeing this TFE SDK as we go on through the presentation. But this is something to keep in mind.
Lastly, when we create the Terraform teams through this TFE SDK, each team is associated with a specific LDAP group. Let's say the read-only has a read-only LDAP group, the developer has a developer LDAP group.
We have these teams created, and the way we give access to these teams is such that the user has to request access to the LDAP group itself. Once they have access to this LDAP group, they are automatically added to the Terraform Enterprise team. This makes it a simplified process and an automated way for users to get added and removed depending on what LDAP groups they have within the organization.
Now we have multiple teams within Terraform Enterprise, and multiple teams are getting onboarded to Terraform Enterprise. But this might cause a lot of workspaces to be created. As I mentioned in the title of my presentation, there are over 2,000+ workspaces currently within Terraform Enterprise — and now we have such a simplified process to add more teams, more workspaces are going to keep being created. So, how exactly do we manage the workspace creation process?
For the workspace creation process, we do what's recommended by HashiCorp. You want to take into account four things:
You have to take these four things into account when deciding whether you want to create a new workspace or make the resources that you want to create as a part of an already existing workspace.
We want to take into account how often this resource is being updated. If this resource is being updated on a daily basis or very often, we don't want to put it in the same workspaces with other resources that are not updated as often.
Some resources are critical to your cloud infrastructure. If they're destroyed, they will have a big blast radius, and they'll damage a lot of your infrastructure. You want to make sure these are separated from the other resources.
You might want to recreate this whole application, but it includes multiple parts. Maybe, here, your engineer should only have access for creating certain parts of the infrastructure and not other parts. So, you want to make sure the engineer should only have access within that workspace for those specific resources.
Finally, it's common configuration: Some resources and infrastructure within AWS and GCP are going to be recreated. We want to make sure those resources are going to be commonly used for different environments. So, we want to make sure that those resources are in their own separate workspace.
These are the four things that you need to take into consideration when creating your Terraform workspaces. Only when you do so can you decide whether you want to create a new workspace or be reusing an already created workspace to add more infrastructure and resources.
We want to make sure we properly manage when workspaces are created. But over time, some of these workspaces can no longer be used — they delete all their infrastructure but they don't delete the workspace, or maybe a team is creating a couple of test workspaces that are hogging up your environment. You want to make sure you have a clean and simplified environment within your organization, especially if you're managing over 2,000 workspaces.
For this, we have a cold storage process for when we want to delete workspaces that are no longer being used. We decided on a two-year policy. If your workspaces haven't been updated or touched in two years, then my team will delete that workspace and store it within an S3 bucket to be archived.
But we want to make sure that we exclude certain resources. Some workspaces hold key infrastructure that may not be updated on a daily basis, but they're key to your cloud infrastructure, including the IAM policies, VPCs, and networking.
You want to make sure that, for any automation for which you have cold storage there's a way to exclude certain infrastructure. This is just a general list. Maybe one team might come up to you and say we want to make sure that this workspace doesn't get deleted as well.
You want to take this exclusion into account when you're putting workspaces into cold storage. To put these into cold storage we have a custom Python script that utilizes this TFE SDK — that I mentioned earlier — that grabs all the metadata and state files from that workspace via the Terraform API.
Once it gathers all this information from Terraform Enterprise we compress it to a ZIP file, and upload it to the S3 bucket. We don't want to delete the workspace. We want to have it archived somewhere in case someone wants to reuse that workspace even though it hasn't been used for over two years.
That's good, but now that we have it archived, how exactly do we retrieve the workspace? You want a simplified process, so that if a team requests for archived workspaces to be re-added back to your Terraform Enterprise environment, you're able to do this quickly and easily. To do this, we have a cold storage recovery process which is very similar to our cold storage process, where we utilize a TFE SDK to recreate the workspace.
We have a Python script that extracts your ZIP file from your S3 bucket and decompresses it. In there, you're going to see a lot of information: The workspace name, the Terraform version it's using, and all the state files. This is where the SDK comes in and grabs all that information. It recreates the workspace, and it adds all your state files to that workspace properly.
Once that's all completed, it's added back into Terraform Enterprise as if it was never gone to begin with. This is very important because we want to make sure that when we re-add the workspace, they don't have to change anything in their infrastructure to start consuming it again.
It's very important to make sure the naming convention and Terraform version is kept the same, so it doesn't break anything in your customer pipelines. This leads us all to our very last thing for this presentation.
Throughout this presentation, I mentioned the TFE SDK throughout multiple different pipelines. Maybe in the future, when we're managing our infrastructure, we're going to be adding more pipelines. We have this TFE SDK where we have all of our common code. We want to make sure that for all these pipelines we create for managing our infrastructure, we're creating it in such a way that we're able to reuse it in multiple pipelines.
We made this custom Python module that we version when we have new updates and new functionality that we want to add to it. A lot of pipelines will utilize this TFE SDK as a way to interact with the Terraform Enterprise environment.
In the future, we might want to create a new pipeline as another way to manage our environment. We want to make sure that when we update this SDK, it doesn't negatively affect our already-created pipelines. We want to make sure when you have code dependencies that are being used through multiple pipelines, you version it such that you create a new version for your newer pipelines — and your old pipelines using older versions are unaffected.
This is very important because when you're managing so many workspaces, you want all your pipelines that you're using to manage these workspaces to be unaffected. This leads off to the very last thing.
There are three things that I hope everybody learned, and I want to reiterate as a part of my presentation.
There are a lot of teams within Discover that are utilizing Terraform Enterprise. Because of this, they're going to keep giving you a lot of customer feedback. For example, on the workspace archiving cold storage tool. We want to take that into account because that's key to telling you whether you're managing your pipelines properly within your Terraform Enterprise environment. If you're not properly archiving your workspaces, your customers will complain.
When we refresh our Terraform Enterprise environment, we want to be using the latest version of Terraform Enterprise as a way to get security patches and new features.
Once Terraform Enterprises adds new features, such as Run Tasks or Drift Detection, we want to make sure your team is taking advantage of them and is experimenting with these features to make them available for your customers to use — because your customers will come back to you and say, “I saw this new feature within Terraform Enterprise, how do I use it?” This is very important when you're managing your workspaces as you want to keep all your customers happy.
Automation is very important when managing all these workspaces and the Terraform Enterprise environment. If you do not automate everything, you are running everything manually. Running that manually is going to be a slow and tedious process, including the process of refreshing your Terraform Enterprise environment. If that's a manual process, that's going to take up to one to two hours instead of 30 minutes if it's automated.
If that's the case, then your customers will complain and say, “Why can't I use Terraform Enterprise right now?” It's down. It's unavailable to use. That's something you want to avoid. That’s the same for your team creation: Make sure when teams are created the entire process is automated, so the teams are created fast and are easy for your customers to consume.
No one wants to ask to be onboarded onto Terraform Enterprise if one of your engineers have to manually create the team and it takes 1-2 days for them to create it. Then, the new customers that want to be in Terraform Enterprise will say — you know what, never mind. I'll use Terraform open source instead.
Ensure that you're automating all your processes properly. Then, in the future, when you add new functionality as a way to manage all these workspaces, make sure that's automated as well.
And with that, I end my presentation. I want to say thank you. I hope everybody learned something new today. I enjoyed my time here at HashiConf. I hope you guys do as well.