This case study reviews the lessons learned and a variety of use cases for Terraform and Vault in managing a multi-billion dollar PCI compliant environment.
GitLab's Nicholas Klick and DataRobot's Apollo Catlin once worked together at a company with a multi-billion dollar PCI compliant environment managing over 20 million credit card records. The complexity of maintaining this compliance architecture at that scale was challenging, and the legacy environment they encountered involved a lot of manual steps. They set to work implementing HashiCorp Vault and Terraform to automate more compliance workflows and simplify the auditing process.
Terraform's infrastructure as code documented in version control means your teams automatically have the documentation and the visibility mapped to each of your PCI requirements. Vault's audit logs tell auditors where secrets are going and who has access to what. Terraform can even carry out live system firewall verification and verify that your cloud storage (such as S3 buckets) is provisioned in a secure way by default.
Whether you are looking to make a cloud migration in a PCI environment, develop more agile means for reviewing in-scope infrastructure changes, or aim to automate your compliance testing, the lessons from this talk should prove valuable.
Nicholas Klick: Thank you all for coming out today. Quick overview—we’re going to do a brief intro and then talk about why Terraform and Vault matter for compliance. Then get into some examples. First, we're going to talk about how we got here. How did we get to using Terraform and Vault for our PCI compliance? Reasons?
Apollo and I met at a company where we were managing a payment pipeline that was transacting billions of dollars a year. We also managed a credit card vault with over 20 million records. Unfortunately, this legacy infrastructure was not without problems—as you can imagine.
Manual updates were happening in a lot of places—and there was a lot a drift in the code because of that. There were black boxes and silos where particular infrastructure engineers were the only ones that knew what was going on. It appeared to us that compliance was an afterthought. It was also unclear where the PCI scope was around the different systems.
Looking at that situation, we decided we needed to seek out tools that would help automate our infrastructure and make security and compliance processes a bit more manageable—and Terraform and Vault came to mind.
Terraform and Vault have numerous high-level benefits. For one, Terraform is good for transparency. If you're getting all your infrastructure into code, you can review that code. You can know exactly what your firewall configuration is—how all your EC2 instances or your RDS instances are set up. It helps with preventing the drift in the code and allows for easy segmentation of what you consider to be in scope versus out of scope—and also scoping different environments and accounts from each other.
The Terraform workflow of—plan, apply, verify—is a great fit for CI/CD pipelines. We'll have some examples of that later. It's easy to see and test changes before you promote them to production. The plan files themselves can provide that changelog documentation that your auditor might be looking for.
Apollo Catlin: Vault has different compliance benefits. If you are sticking with the building policies in HCL, you get some dependability with infrastructure as code. Applying those—that's also easy to automate—something we might show. It makes the production security a highly composable, easy to build system. Instead of starting with that shared key that's on every single production server. You're going to need to move past that when compliance time comes.
If you have a system like Vault in place, you're not necessarily relying on totally homegrown built processes for managing this stuff, and you can automate this tool without any additional custom work. Furthermore, Vault provides a strong, inherent encryption as a service in all it is doing.
Another important point for compliance is policy mapping and all of its auth frontends. You can map these things together and build a relatively powerful RBAC system that's independent of any backend authentication system that you might need to connect to—the auth backends of your LDAPs, basic user, password groupings. That allows you to access Vault, and you can map that to your secrets backends. I'm not talking about the KVM ones, but some of the more advanced ones like your database backends––your AWS backends. GCP’s able to regularly generate tokens that are compliant with the policies you and your auditors have come up with.
Once you're starting compliance, security is going to come along. You're going to have some increasing security requirements. Some that really saved us include built-in key rotation. You can use those functions relatively widely. You can use this with basic Vault Auto-Seal key rotation mechanisms.
Or you could easily write scripts to rotate PKI certs and develop new PKI certs—or build in PKI certs—new SSH certs—and automate on top of that. But it's easy to rotate with that stuff. In PCI compliance, we have to re-key our credit card vault annually. With Vault, we can do it faster. We can do it more frequently—but ultimately, it saves us a bit.
The Auto-Unseal in the Cloud feature— great for high security, but it gives you that operational compliance to sleep through the night if Vault just happens to go down. You don't lose your secrets for that long. This exists in the Vault Enterprise product. But similarly, you can have automated sealing mechanisms that can be triggered off of your SIEM—or other security log information. You can automatically have Vault locked down if you need that.
The PKI toolchain in Vault is great. You should all be using it. It's nice instead of running OpenSSL scripts all the time. Great tool.
Terraform and Vault help meet specific compliance requirements. We might say PCI—and we're going to say compliance—interchangeably. We don't mean certain compliance that you might've run into like Sarbanes-Oxley, which was mostly financial. But this does apply to PCI, SOC, and any of these compliance frameworks that have technical implications for real operations.
Vault—I think I may have mentioned this—makes secrets management highly operational instead of checking things into a Git secret file. Just one similar password manager, which is not necessarily an operational tool. There's a lot power there.
Key rotation—build employee RBAC authentication mechanisms. The two big ones for us are always going to be encryption at rest, encryption in transit and certainly encryption and transit connecting Vault up with something like Consul Connect can give you that.
There are some nice features that you should look into. SSH certs—if you aren't using SSH certs, you should be. You're probably using SSH keys—taking that next step to add the SSH certs; strong move.
Building ephemeral database credentials, that's great. Just saying, “I need to get into the database now. My role dictates that I can get in there.” But let's only give you creds for enough working time. Generally, I see people doing operational patterns of, "I'm going to get an eight-hour database or production system credential,”—basic working day length—and build that pattern into your workflow instead of having long-lived access credentials.
Auditing; logging Vault changes with syslog file. There are some more advanced ones in Vault Enterprise. Hooking that thing up to your security systems; anything your SecOps team might have—a SIEM; auditing those changes out of there. This going to be something that your auditor might want on an annual basis. You might want to be looking for an auditor that's looking for some of that stuff on a more regular basis.
Similarly, we can automate the policy application and that changelog going through the code. Any good auditor is going to be accepting that as that system production changelog that you're going to be asked for—at least on an annual basis in some of these, certainly more frequently with others.
Nicholas Klick: Terraform can help meet a lot of the general compliance requirements as well. It supports the audit. Your auditor is going to ask for your server configurations. They're potentially going to ask for your firewall, and that's all being reviewed in your Terraform code. Every time you're changing the firewall, you have MR or PR to review that. It demonstrates those documented firewall changes.
Terraform also helps with continuous compliance because once you have your compliant network set up, you can write tests against it—we'll have examples in a bit. And because you're avoiding that drift, you can help maintain continuous compliance.
Then you have clear network security and visibility because your infrastructure is in code. With role-based access or user group access, because you're setting that up in Terraform, it's clear, and all the changes to that are documented.
PCI has a bunch of requirements. In our situation, we were hosted on AWS, and these were specific AWS resources that we utilize—and each one of them mapped a specific requirement. Because we're using Terraform to set up those resources we had that documentation and the visibility to map to each one of those PCI requirements.
We've been talking about high-level theory, let’s dive into some specifics. First of all, I've got a couple of examples where we're integrating a few different things. We've got some Terraform code, we have some Test Kitchen and aws_spec scripts to write specs against the Terraform code. We're running all of that through GitLab CI/CD, and we're generating AWS resources.
In our GitLab pipeline, we have four stages; verify, validate, plan, and apply. We’re gonna dive into that code a little bit. You can see in the verify stage where we're utilizing Ruby 2.6 for image and we're also installing Terraform—and doing a
bundle install to make sure that our Test Kitchen is set up.
We're also running our actual
kitchen converge and
kitchen verify and those are going to execute the aws_spec scripts. This is further down in the
gitlab-ci.yml. We're running
terraform validate to make sure that the Terraform code is valid. The plan is executing terraform plan—and apply is apply. It's straight forward.
In this first example, we're going to be looking at a compliant firewall. This is going to test a live system verification. Terraform is generating the resource, and we have a spec that makes sure it always remains the way that it is. By putting this in your CI/CD, you can prevent the accidental firewall changes because your tests are going to fail if you make a change that isn't compliant. I'll go into that code.
Here I'm creating an AWS security group—I've got two different rules in it. One is allowing all outbound to 443. On the SSH port, I'm only accepting traffic from inside that VPC. In the aws_spec code, I'm writing tests against that to make sure that—in the security group—the outbound should be opened to 443, inbound should only be opened to my CIDR space for that specific VPC.
When I run it, you can see at the bottom of that there's also another test in there for an S3 bucket—which we'll talk about next. But there you can see that the three tests will run. In GitLab, the pipeline looks like that when it goes out.
Apollo Catlin: Aws_spec—like some other tools I'll talk about in a minute—can run however you need to. We're ultimately relying on a kitchen-terraform server layer under this. You can always write tests in two ways with these systems--you can write them to be running against the live system as we did. You can also write the same things to be running in a bootstrap test environment—reusing your same production code—same what we did here.
With the first example, we ran tests against what we would expect out of existing live code. We can take that, add a test fixture for kitchen-terraform to run. It creates a resource off of what should be reusable module code. This is a pattern I would recommend for the ability to test before your code is hitting.
Kitchen-terraform will create resources and check them—you're not working in a mock environment, which can always become a whole abstraction problem. The value we can get out of something like Terraform is creating in these resources and testing them. But we can also test our code before it goes out.
We have a pipeline, we know it failed. We had a dynamically created test bucket for this—and Kitchen is built so that, even if this test fails, we still tear down that resource. That's because we are testing an instance of an S3 module in this case. We failed the test for having the server-side encryption. We can patch that, go back in here, get clear and go forward with that.
This is a basic test pattern. But with that same simple workflow, you’re going to be elaborating for tons of environments, tons of specific cases. But it is very quick, very easy to build something up like this. You can get tests that are testing your live infrastructure or testing the code you're writing before it goes out.
We don't quite have a demo here. But it's almost the same pattern you would be running with your Terraform and that lifecycle. You can employ it for maintaining your Vault policies as well. You'd have a folder of your Vault policies, you'd be pushing those—writing those--continually.
Testing policies definitely gets a little more complicated if you don't have something like Sentinel. It's certainly something doable, but I would say the power here is that you're continually running that thing; just in case someone has done some manual policy changes, those would be potentially overwritten—and you're not ending up running any policies in production that haven't been vetted and gone through CI and approval process.
This is one of the things that auditors are going to be looking for in your PCI and SOC; ISO setups—that changelog around these types of access changes and permissioning.
I want to talk about a few specifics that will help in this infrastructure as code Terraform world—that will help you in any stage of starting a compliance project. You can certainly map it to the crawl, walk, run pattern. If you're here because you're starting a compliance project or you're moving into some more advanced stages—you don't have Terraform yet—I think we have a few things that might help you out.
Before we start going on to this, compliance is not about checking boxes. It is about continually improving your security and your compliance stance. Compliance is going to change. Even if you had a perfect year last year—you got everything done in your compliance report—there are going to be changes. That's most likely on any given year. Being able to build from that is certainly powerful once you’ve got stuff going.
But when you're starting, you want to put yourself in a position to start that continuous improvement. Taking any big project on like this that touches not only how people are using tools, but how the tools are executing— it becomes more challenging the larger bite you want to take off.
Getting yourself in a position to start small and build up—or if you already have a relatively sufficiently sophisticated system—getting in a position to continually improve is going to help you out as well.
I want to say about the “checkboxes” thing, that it does matter when you're choosing an auditor. It's pretty much always something you have some level of control over. But you want to look for people that do not have a-one-size-fits-all compliance solution—and they're just handing it to you.
It's especially something that you might run into with the more technical individuals involved in the compliance processes. Generally when you're getting those types of, “you just have to do the stuff,” you are not going to be getting an auditor that's going to understand your systems. One example that I've seen a lot recently—and I've heard a lot of people talk about—is the move to Kubernetes; where your auditors are generally just going to understand—maybe—VMs, but sometimes they won’t even understand the cloud.
A good example is also VPCs—security groups, subnets, subnet routing, right? This is ostensibly where your firewall lives. But I've also run into auditors, they're like, "You can't run in the cloud without a firewall appliance,” which isn't exactly how that's going to work.
You want to find an auditor that's going to work with you and help understand your systems and understand your processes—and be able to map things like changelogs to the code that you're changing and the approvals there. This is in contrast to someone wanting an Excel sheet that you continually fill out and add a new line, and everyone has to write.
With terraform plan, this is the quick, easy, great thing to start with. If you are using Terraform already in the system, it's a good idea to get a cron set up on that. We've set them up in GitLab scheduled pipelines before. It's a great, simple process.
If you have Jenkins, you can do it there—or if you have any other centralized automation solution. You don't need to explicitly use Cron. But running this stuff regularly and alerting if your plan file is showing anything is great. You have that power to know stuff is changing outside your code and outside the review process. It's a great quick drift prevention thing. If you aren't doing it, you should have some really simple thing that's running—sends you alerts, slack alerts, emails, whatever you like.
Sentinel's great. I've played with it—the Sentinel simulator—and it would be great to use the Enterprise thing at some point, but that's a great tool. You can be going entirely to a compliance as code framework with that product.
Other good options are your ServerSpec, InSpec, AWS spec. As I mentioned, there's I believe in Azure spec as well. Those patterns fit well in testing, but you can also write tests that are oriented towards your production infrastructure instead of that test as well. You can generally run the same test code against both those tests—test instances and your production infrastructure.
Terratest is another great one that is oriented towards that verification of your production systems. And also I know Goss is another I should be recommending. I haven't used it, but I've heard it's good. Terratest is not as focused on testing temporary test infrastructure—but is powerful and deep for testing any of your live infrastructure. You can build those tests around the idea of meeting all of your compliance goals—build a suite out of it. It does work quite well for that.
If you're beginning—and you're here—you may have started compliance. You might not be using Terraform yet. You might have come here because you really like Vault or using Consul. Terraformer is a tool made by a team at Google. There have certainly been a lot of importing tools that have come out in the past. Terraformer is one of them.
You can do the piecemeal HashiCorp Terraform import process. But if you're trying to start an infrastructure as code process and get that testable. Terraformer is a great way to import your existing infrastructure into infrastructure as code and start moving from there. It's a nice way to start—and I would probably recommend starting with it. Firewall review is one of the bigger ones you run into PCI. So, do an import of all of your subnets, all your VPC, all your security groups, all your route tables. Get it an input of all that stuff in the state, have it export the files and start testing it.
Terraform and Vault— it's easy to get these tools and start using any of the features—any of the things we talked about here. You don't need to go with both of them. You don't need to do both of them at the same time. You don't need to do everything in them. But starting a process of getting a highly reviewable compliant infrastructure will save you a lot of time once you're getting down to those last couple of days before your compliance checks come in and stuff.
These tools also have strong, built-in best practices. You can be focused on the Terraform application, the Terraform life cycle workflow, the Vault processes. Even simplifying a lot of the security things.
That's a good point too; generally, I would say to everyone in here who is building a compliance system, you should be striving not just to get compliant, but you should be making this a lot easier on your developers and the users who are going to have to do this. It's not necessarily impossible, but it is something that you should be valuing in it because the more people dislike being compliant in your system, the harder it will be for you to be compliant.
Terraform—and Vault also—but much more Terraform—makes it easy to test real systems and verify your systems instead of doing that test pattern.
They're both easy to automate. I am a huge fan of using CI/CD systems for this. But generally the more you automate, the less drift you're going to have. You're also going to be more compliant. Particularly in situations where you are overriding manual things that people might have done. That's a big one. Every time you have one of these things, you can write a test and get regressions in your code every time it runs to make sure it doesn't continue.
If you're not going the infrastructure as code—compliance as code—route, you only end up with postmortems. It's good to not end up with a second one that's the same way. Terraform and Vault make easy, reviewable, testable compliant infrastructures—and easy to build it. They're a strong platform to start automating your compliance process and easy to build in a way that will continually assure you're building a compliant system.
So, that's it. Thank you all.