How Many Tools Does It Take to Build a Cluster: Terraform, Vault, and Consul in Government IT
Jul 21, 2019
Get a deep dive into a complex government IT use case where Terraform, Vault, and Consul are helping simplify tooling and automate away many manual processes while maintaining compliance.
In government IT, and particularly the US Air Force, systems are very large and complex, and security mandates that there are different controls and reassurances covering all aspects of the system. Applications exist on VMs, on-prem, in the cloud, in OpenShift, and many other platforms and environments. In this setting, it's human issues—a lack of understanding of tooling and process—not technological issues that pose the biggest threat to system reliability.
In this HashiConf EU guest keynote, Women in Linux co-founder Tameika Reed takes a deep dive into this air gapped use case and how they used HashiCorp tools—Terraform, Consul, and Vault, to unify tooling knowledge, automate manual processes, and prevent drift in configuration, security, and patches.
CEO and founder, Women in Linux
So a little bit of background about me—he gave you the introduction—I also worked for a company called Expansia. The main topic that we're going to talk about today is how we're trying to solve problems in government, in particular for the U.S. Air Force. The topic here is called the journey—and I should have put up there a continuous journey as well because that is exactly what it is—it’s a continuous journey.
We're going to talk about—this is a good segue for Terraform—my journey with Terraform and things that I had to do—and the things that the team as well had to—do to get Terraform installed on-prem.
» Overview of the environment
We're in an air-gapped environment. When I say air-gapped, I mean there's no Internet connection. Think of it as you're sitting in a trunk with your laptop trying to deploy Terraform or Consul or anything like that—and with a straw to breathe out of. That's the picture I want you to paint and have that in your head the whole time.
The environment that we're in—we’re deploying VMs, sometimes from templates, sometimes they're manual installs, sometimes they're a clone of something. You may have Puppet modules to deploy. The Puppet modules are based on NIST standards. Anybody in here familiar with NIST standards? Compliance as code, and so forth. Cool.
Sometimes things come up, and you have to make manual changes. Sounds familiar? Anybody in here still making manual changes? Probably. Maybe. Could be. Then we have a multi-environment.
And when I say a multi-environment deployment—meaning that it could be on-prem—it could be in the cloud; it could be something with OpenShift, it could be anything. So how do you prepare workers and train people on something in this type of environment and have consistency—and not have drift in your configuration, drift in your security, drift in your patches, and so forth?
The manual testing after deployment—we want to get rid of that as well too. When we were talking manual testing, having a mindset of manual testing for security, manual testing for your configurations with Terraform, Vault, Consul—all those things we want to get away from manual.
Some hardships that we've had are still going—we're still pulling apart the onion. Logging into VMs, making changes in productions, longer approval processes. We'll go into this with Sentinel. Monolithic applications and those dependencies—we're using monolith applications as we deploy, but we want to break those up into microservices.
Again, we're talking DevSecOps and being siloed. You may have one team that's doing security for a particular group versus security being brought in from the beginning all the way through. And having better team collaboration. As we see with being siloed, there's not a lot of collaboration with a lot of teams sometimes—and sometimes there are.
» Goals for improvement
We talked about government transition—this is where we're starting to get into the meat and the potatoes. We want to start getting away from having the silo teams and pushing teams internal. You can think of this like the U.S. Air Force as a hypothetical. Imagine how many IT teams you have in that particular environment. Imagine how many teams you have in your environment where you're not constantly talking to each other.
You may have one team in one section doing DevOps, or you may have another team over here doing DevOps, but you're using two different stacks. There's not a lot of communication. We want to push more to have internal team collaboration because that leads to quicker deployments, better feedback. Alignment as well.
We want to make sure that we're walking through and having our pipeline as we deploy—we're having testing throughout our pipeline from the beginning all the way to the end of that. We want to make sure that we handle testing. And testing is not just testing to make sure the application works, but functional testing as well—to testing of your code, testing of everything throughout the process.
» Understanding the system architecture
I put this here to define what the system architecture looks like for us at this current moment. System architecture for us—if you go online and look—I made some changes and I’ll highlight some of those changes in here. One of the changes that I did put in here was DevSecOps. That wasn't necessarily in the system architecture diagram that I followed.
Also having architecture development phases. And I say that with the meaning of when you're doing a deployment, especially if you start talking about moving from monolithic—or moving from a situation where you're going into microservices or deploying on-prem or multi-environment—you have to have a baseline.
And that baseline can be, “Hey, I know I need to have some security testing in there—I know I need to be able to create my machine images from the beginning to end, testing those machine images.” Also testing for security—testing for where I'm storing those at, what type of revision repository—how often am I going to deploy these machine images and what kind of revisions am I going to have for those?
Continuous security. For us it is deploying the VMs using a template. You can use Terraform for that. Or you can use Terraform and deploy your actual infrastructure.
Right now we're currently deploying against just the infrastructure to VMware using Packer to do our builds—and I'll discuss this a little later in this talk. We're going through and saying, "We're going to template out Packer and then we're going to deploy images from there."
With that we'll have security injected from the beginning. So instead of having security at the end, security can be put in in the beginning with Packer. And security from scanning your RPMs, scanning your ISOs. Also your config files in terms of static code analysis and so forth.
This is also where we get into doing policy as code. We use Terraform Enterprise, so policy as code is huge for us. A lot of the upcoming announcements that just came out with Terraform are something that we really need and I'm excited about those things.
» Image deployment tied into JIRA
Last but not least—throughout our entire pipeline—is always having feedback to the teams and always having feedback into JIRA. So as you deploy throughout the process, there's a feedback loop and quicker feedback loops because now you have that attached to JIRA. Right now we're setting up where we can have our Packer deployment or machine image deployment tied into JIRA.
This is a wild idea—concept—that we came up with because we are using Red Hat: We need to have the satellite server pulling in all the ISOs, all our packages and so forth.
From there we'll roll into scanning those things, scanning those packages, scanning those ISOs. If we find any vulnerabilities in those, we immediately know upfront there's an issue. So now we don't move forward in our pipeline. We fix that. So now we've implemented pulling in those packages and implementing security upfront.
» GitLab and Jenkins
The next phase that you see is we have GitLab and we have Jenkins. Now you may ask yourself, why do you have both of those? I told you it's a journey. So as we go along that journey, we will pick GitLab or we will pick Jenkins. For right now we have both.
How would that look? Well a file is a file, right? So for us it's as simple as writing. You don't have to have a batch group. Any script can be called from Jenkins or any script can be called from GitLab.
» Using Packer
Once you get into this situation. Now we're taking it and we're going to start checking our code in and then kicking off a build. And that's where you see number one, where we start using Packer—and Packer has the ability to build out Docker containers, AMIs, also OVAs.
Then we move into phase two and phase two is saying, “Hey, if I'm on-prem, I can use Terraform to deploy to wherever I need to go. But in this case we're on-prem or I can use in a local situation, I can use Vagrant.”
The next thing that you see is STIGging. When I'm saying STIGs I'm talking to about NIST standards and security as code and compliance as code. Here we can use PowerShell, if we have a Windows box. We can also use Puppet or we can use Ansible scripts. And then the next thing you have is Twistlock or Nexus IQ or ACASS.
With those three—some features that you get in Twistlock, some features that you don't get inside of Nexus IQ. But what you do get out of all of those is the ability to apply some policies within those particular items and say, “Hey, if you pass or fail after I do a particular scan and I can define what that scan is, then you move on and I have what I consider a complete machine image that has been secured from beginning all the way to the end and available for use.”
You can even take that one step further and say, “Hey, let me define what an application would look like and apply an application on top of that as well too—and provide that in your staging state or in your staging environment.” And then say, “I've STIGged the image, I've deployed the VM and I've also applied the application on top of that.” Throughout that process—in your staging process—now you would see that you have security implemented all the way from beginning to end. We're using Terraform for that, and I'll show you this next.
» Policy as code
Policy as code. We want to implement this—as we talked about in the image policy. We want to create it; we want to be able to deploy to different environments—quicker time for responses in terms of failures—in terms of successes. Knowledge transfer through code—how did you do it—being able to share with other teams, and also bringing ops and developers together for quicker feedback loops on deployments.
We talked about it earlier—we heard about Sentinel. How does that apply in our world? And when I say our world is policy as code is huge and something that is ongoing every day.
As you see at the bottom—and I talked about this—checking your code in, building out, whether it's a Docker container or a VM, and then using Sentinel to say, Hey, I can deploy to the staging environment. If not, I'll stop and say I can't deploy to that state in your environment. But I have that ability now to say yes or no. I can go back and say I want to scan that environment as well too. And get that feedback loop to JIRA. That's what that looks like.
One of the bigger issues in this project is having a multi-environment. Well, with that multi-environment, I'm not sure if anyone here plays where Azure Stack or AWS Outpost. Outpost is coming out—I think roughly around September/October frame—is what they were looking at. But as we prepare to go forward, we want to be prepared to play in those environments and we will too. We will have that option and we will have that pipeline already set out.
» Terraform Enterprise advantages
Terraform Enterprise—the advantages that we have seen is—one--we're able to hold state viewing where we can't hold state unless you're using vRealize. But then there are other issues that go along with that.
Control policies on the environment. We need to have the ability to audit. Now if we have those policies that are written out in code, we can see who was able to push the button on their environment and say,”Hey, we deploy to staging,” or, “We deploy it to production."
The next one is see who actually did what and have that feedback in JIRA. Super important.
» Vault Enterprise advantages
We're also using Vault. So we're using VaultEnterprise for a couple of things. One, we need PKI integration. The other one that we need it for is Terraform. We also need it for controlling logging into servers.
When we're talking about logging into servers—in some cases you may need someone to log into a server depending on if it's a monolithic application and it's been around forever—and that's what they've been used to. You want to get away from that. But at least we can have that controlled on Vault saying that you only have 30 minutes to log in and check and see what was going on and then log out. We can put that on Vault.
The other thing we do here is tying that back into Kerberos or LDAP—or even, in this case, Windows which is still underneath Kerberos.
» Problems and collaboration
I had the joy of working with Terraform Enterprise on-prem and working with Amanda, Ginny and Todd. I’ll give you my issues that I had—and how we came around and got this to work—and we're still continuing this process too.
So, what were the issues that I had? I had to set up an environment and I started Terraform Enterprise at least 40 times. Why? Because out of the 39 times it didn't work.
And there's no shade to anyone here at Hashi, but it was just something that we had to walk through and figure out. And some of the things that we had to figure out was what type of Docker version were we using? Were we using Red Hat native Docker version or were we using a Docker version from the community edition or the enterprise edition?
What OS? Is SELinux turned on? If SELinux is turned on, what do we need to have set up in SELinux? Object stores. Were we using the object store or were we using the mounted directory for Terraform Enterprise? When you're doing the install, is my Vault set up correctly for my external Vault or do I need to use the internal Vault when I'm doing this?
We're also in an air-gap situation. So anytime—once we got finished—the providers that we needed, needed to already be wrapped up and on another Docker container—or wrapped up in an installing TAR file.
And then the next big phase was, how do we automate everything that I talked about and install Terraform from checking the code in to GitLab all the way to a deployment. So what does that look like?
This is my matrix that I set up—and I'll tell you about it. It's three different versions of Terraform that I went through. Four different versions of Red Hat that I tested this on.
I'm setting up different IP tables in terms of what particular ports needed to be open. When you install a Docker, Docker writes to those IP tables. What particular ports need to be open on that? What bridges were open?
Vault. When I talk about an external Vault, setting up your external Vault and then connecting that back to Terraform. Was that configured correctly as far as the app role? Also the internal Vault as well.
The next part of this was did I set up my database correctly? Because there is a database involved with this. Was that set up correctly when I was doing my install? And did I have permissions to read and write? And last but not least disk speeds. A key factor in this was finding out when we were setting this up—the disk speeds slowed down the install of Terraform and failed. So we went through and tested the disk speeds for that.
The summary of those issues—now after communicating back and forth with HashiCorp, we got it stood up. We also did a lot of collaboration back and forth on what issues that we had and some of the issues that we had were implemented into a Terraform Enterprise. So I'll take the yay on helping someone else on their next deployment.
And the other one is, it was a shared environment that we set up so that we can all test this back and forth and share this with the actual team. We can go forward and say, “Okay, yes we installed Terraform Enterprise, it does work in this, we can use this—and automated installs.” That was the team collaboration on that with Hashi and Expansia. The next one is worker team and training are some of our issues. As you saw on the actual slide with the image policy, you saw a lot of tools.
We can all agree that when you have a lot of tools that's a lot of learning and that's a large learning curve. So how do we get around that learning curve? How do we take care of each other as we do this?
Group learning. With us as a group—and with any group—I strongly recommend that you learn who you are as a person in terms of learning. Because that really is going to drive how you learn. If you are a person that likes to see videos, or if you're a person that likes to sit down in groups, or whatever—that's going to drive your team. And that's very important.
The other one is emotional intelligence. That was super huge with us in terms of the successes and failures. I'm going to say this—and you could take this with a grain of salt—but don't beat yourself up or don't have fear of failure. You're going to fail. I installed 39 times and failed. I was relentless. Just to give you an idea.
The other one is having support with your team. We had to have team buy-in. We've got to have management buy-in. You have to sit back and have leadership—allowing people to go out and understand how Vault works, how Vault integrates with this, how Consul works, how it all integrates with each other—how does GitLab work? All these different tools—we're up to I think 10 tools right now, and it potentially could be more tools. So having that understanding of how each tool works is super important.
» Daily team building growth
The next one was daily team building and growth. This is key. Having the ability to say, “Hey we did this with Terraform and this is super huge in communicating—and tickets in JIRA— exactly what you did—screenshots—the whole nine.
Remember we're in an air-gapped environment. That's a lot of writing and translating that out into a JIRA ticket. Not super fun, but you actually need it. And that's how you build around those errors and build around that team.
» Shifting metrics left
Next what we wanted to do was be able to shift our metrics left. When I say that—Paul talked about this—is how do we scale, right? When we're talking about scaling, we're talking about we have a lot of teams—and we have a lot of teams that we're working with.
So we have to be able to get feedback from Terraform immediately when something failed. We also need to know that if a package came in if it didn't pass or it did fail—we need that immediate feedback.
But on top all of that when we do deploy and we do have applications out there, we need to make sure those applications work. So things like Consul and observability and micro-segmentation really play a key instance because we do have multiple teams.
For one of the teams that we do have coming into the fold—they wanted to be able to test 10,000 concurrent connections on a Kubernetes cluster and making sure that they're getting those metrics back. Well, it goes back to their system architecture talk that I did earlier—it’s better to architecture design.
How do you design for that upfront? What does that look like? What are you currently doing? How can we change those things? These are the things that we're going to use as we go forward. And also for Consul and observability to give us metrics on what's going on. Being able to trace where we deployed—but now the whole system is shutting down on us. How can we trace that back?
This is where we start introducing that. And remember we're on a Kubernetes cluster, so we need to have some namespaces set up where we're dividing up different projects and so forth.
» Future goals
These are the things that I think are future goals for us. My whole team may or may not agree, but this is what I'm seeing. A multi-cloud strategy—and this is how Hashi comes into play with this—there is an actual article out there from the government.
I don't know if everybody heard of the Jedi Project. What a Jedi Project in the States is, is a $10 billion project. And that $10 billion project was brought up between Amazon and all of the cloud providers that they bid on. And I think whose going to get it is going to come out in August or so—this is what they're saying in the paper.
But with that project, there was one sole winner for that for 10 billion. The next one is we're going to do CTE. They're doing a CTE project and that's going to be a couple of billion. I think Hashi looks really good for that.
The next one is building cloud-native infrastructure. The main one—that is really on this one—that is I'm super excited about is Qvert. What Qvert allows you to do is take a monolithic application, deploy it onto a Kubernetes cluster—that is super huge. And being able to get off of something that's on a bare-metal machine and move it over to a VM and move it over to a cluster.
What you see here are things that are on the horizon. This is how I see Hashi flowing with us from Consul, Vault and Terraform. And then as we continue to grow along, I'm sure there'll be other things that we want to implement. So together, this is what I see. Hashi, Air Force, and Expansia.
I'm Tameika Reed. Hope you enjoyed. You can find me on Women in Linux—Instagram, LinkedIn and so forth.