Adopting Shared Services: Terraform and Vault at the US Dept of Defense
Oct 07, 2019
Imagine migrating 3000 applications to the cloud (30% of them written in COBOL) for the US Department of Defense. And multi-cloud is a requirement. It turns out, HashiCorp Terraform and Vault are great tools for this use case.
Due to the unique mission requirements inside of the US Department of Defense, multi-cloud architecture is not only a need but a requirement to meet edge computing complexities. Designing end-to-end infrastructure deployments that span multiple cloud providers, various on-premises datacenters, and numerous commercial edge computing resources is no small task. This infrastructure needs to be secure, scalable, and most importantly, managed as code.
Cloud Service Provider Lead, Department of Defense
Good morning, everyone. It's an honor and a privilege to get to speak in front of everyone here today. This is my first HashiConf. I've been a HashiCorp user for about 11 months now. So it is an awesome opportunity to get to stand in front of the mass here and explain what we've done the last year and some of the things that we've learned.
I am not amazing at presenting like Herman Leybovich of Klas Telecom Government (see his presentation Microservices for Military Applications)is. But we’ll travel through this together and make it work.
» Adopting shared services
I want to paint a picture. Think back to the environment that you currently work in. If tomorrow you started a new job and you walked into an organization that was, say, 15,000 users and about 3,000 applications. Out of those 3,000 applications, let's say the core framework for 30% of those is COBOL. We're picking languages that are older than me. So keep that in mind.
You meet your CTO, you meet a lot of the teams, and they say, "We want to go to the cloud. We've heard some things that the cloud can offer us. We want to cut some costs and we want to be able to rapidly evolve. Make it happen." How do you do that? Doing that in the public sector is a hard problem to solve, when the public sector is anywhere from 7 to 10 years behind most traditional companies today.
Today I'm going to be talking about how we went about accomplishing that, how we went about adopting a shared-services architecture to help us onboard new ideas, onboard new service teams, and change our enterprise and how we operate.
» Cloud adoption with limited resources
How do you support enterprise adoption of cloud without enterprise support? You work in an enterprise that supports 15,000 to 20,000 users, but you've got 2 guys on your team that are supposed to manage the entire cloud adoption. How do you do that?
Most people would probably exit and find a new job. But how do you accomplish that? How much work do you put in that can be replicated? The question that we formulated was, "How do you do DevOps for enterprises that don't do DevOps?"
How do businesses and enterprises adopt the HashiCorp suite of tools if they've never even heard of DevOps? That's how we went about trying to solve our problems.
When we sat down with industry partners, we said, "These are the problems we're having. These are the areas that we want to concentrate in. What's the best way to go about this? We have minimal skillsets. We have a very quick time for adoption that we need to execute in, but we're brand new to this space. How do we go about accomplishing that?"
We looked at how industry practices at scale and came up with shared services. Why did my enterprise need shared services? It comes down to 3 points:
Size and skills of the workforce
As I was coming into the position, we had no cloud experience whatsoever. I was handed the job coming from a datacenter and networking background and was expected to integrate all of the great things that you get with cloud with no skill force behind that. And we needed it up and going quickly.
» The challenges of the public sector
The public sector faces a different challenge because we not only operate in your normal cloud providers, like AWS, Azure, and GCP in the commercial regions, but we also have other regions inside of that that are what we would consider air-gapped. They're disconnected from the internet, so you have a different set of problems that you have to address.
That means work that I do 1 time gets replicated 4 times. When I have an environment that is working in commercial Amazon or commercial Azure, I need to be able to have clear, concise workloads that work in commercial AWS, commercial Azure, but I can also put in the secret regions of AWS.
With a small workforce of less than 6 people, how can we write code, how can we deploy applications, and how can we support onboarding new users in the most efficient way?
The shared-services model that we adopted matured over this last year.
» Why shared services
I think everyone at some point has heard of shared services. It's a common core set of technologies offered by IT departments that get used by multiple teams and commonly executed to save money. You're not replicating the same thing over and over again.
This is how most non-agile IT departments work. And I think you can find this even today across a lot of businesses; they still execute on this model where a service team or an end user needs to be able to launch a new application or they need some type of service exceptions.
So they put in a ticket. That ticket then will get routed through a chain. For us, it would go to the cyber department instantly, and if the cyber team said, "This is a valid request. We're OK with it," it'd go to operations.
From operations it would go straight into, "What team is handling this? Is this going to be our systems team, our networking team?" And from there we would boot up a VM like anyone else would. We'd put an application on it. It would have to have firewall exceptions. At no point in this do you even see development testing.
This is how a lot of the public sector has operated for the last 5 to 7 years in doing software deployment and infrastructure deployment. There is no development and testing environment for that. How do we change that? How does the HashiCorp suite of tools enable me as a cloud service provider to have a core set of services that comply to adoption, governance, and meet the skillset that we have?
The shared-services concept starts with some type of source control management. The core of most shared services fits in one of these brackets:
The shared-services concept is usually managed by larger teams. You have your networking team, you have your cyber team, your platform teams, a lot of different teams. In our case, we didn't have any of these. We had about 3 people that were familiar with cloud concepts and with writing code and with the overall ideology on how you institute this.
» Transforming with the Hashi stack
If you take this and transpose over it the HashiCorp suite of tools, you get the common tools that all of us know and love. Everything starting from source control, but standing it up to our provisioning with Terraform and Packer. Vault for taking care of our security and secrets management, networking with Consul, and our runtime environments with Nomad and Kubernetes.
This is an architecture that most everyone in this room understands. But out of a group of 1,000 administrators and developers that we worked with, there was maybe a handful of people that had heard of these tools. It sets a precedent.
How do you take an enterprise that's never heard of DevOps, and give them the HashiCorp suite of tools that are at the very forefront of how we do business in the DevOps industry? How do you do that, and how do you integrate that in? How do we begin to institute the software where we can get very quick iterations but also enable our service teams to be able to expand and grow?
» Focusing on Terraform and Vault
Today I'm going to be talking about the shared services in depth around 2 core technologies and 2 core products at HashiCorp, Terraform Enterprise and Vault Enterprise.
This same strategy can be applied with the open-source versions of these tools. But we are looking at it to service a group of 15,000 to 20,000 users. Open source wasn't an option at scale with the workforce that we had.
We didn't have the developers that could bolt on some scaffolding. Out of the gate we needed something that would help us adopt quickly, would give us the governance that the public sector requires, and also shorten the skills gap that was needed for our team to execute on this.
When I was told, "Let's go to the cloud," there's very little context behind that.
We had brought in a new CTO, we had restructured our leadership a little bit to be able to inherently move faster in software development, but we weren't prepared for the things that come along with developing in cloud infrastructures that used native cloud tooling.
When we went to institute this architecture, we were net-new in the cloud. Our networking team, our cloud governance team, our CI/CD team, and our platform team all started on the same exact day.
Everyone here who has been in the industry for a little bit would say, "You don't do business that way." And it's hard to set in principles and technologies. So where do we make up time? Where do we get that efficiency?
A lot of industry partners told us, "Anyplace that you can have efficiencies, that you can use multiple times, that should be the first thing that you go after." How we adopted shared services and how we went about that with HashiCorp’s suite of tools really started around identifying the bare-bones services that we needed to be able to adopt our movement into the cloud.
We needed to identify services, all the things that were happening in a parallel effort. What did we need to do as a service provider to enact and support service teams that, at the end of the day, support the true end user. We had to identify those core services, and everything stemmed from Terraform. Outside of Terraforming an Amazon EC2 box to, "How do we go about legitimately standing up accounts?"
This is a net-new environment. We adopted the AWS landing-page concept, where you have multiple accounts for multiple services. We go from 1 account on a Monday to 1,000 accounts a month later, trying to adopt a mentality that grows. How do you do that with the least amount of work?
With Terraform, we figured out very early the use of being able to have a structured process for that service. We needed to understand what the service was going to provide internally to us as a team, and then we could take that exact service and turn it around and focus it for our users.
A lot of the things that you will see is it's the same HashiCorp mentality: The tools were designed to answer a problem internally, and then they were adopted and supported by the community after that. And that was the same approach that we took out of the gate with Terraform.
We needed Terraform to build the constructs of a cloud environment. Before we could bring in any service team, any developer, any application, we needed to be able to put good governance in place. We needed to have IAM policies, all the things for a secure cloud environment.
» Choosing Terraform Enterprise
What's different about enacting Terraform Enterprise in a shared-services concept compared to Terraform open source? Why did we pick Terraform Enterprise over Terraform open source?
It came down to 2 major areas. I talked about governance as 1 of the 3 things that I needed, and with HashiCorp’s suite of tools, the enterprise version of Terraform includes Sentinel, which is the easiest way for us to deploy infrastructure, secure infrastructure, while putting governance on top of that.
The time that I invest in a tool pays off in multiple areas. Most teams will openly admit that cybersecurity is usually the hardest team to work with. You never walk into a meeting with cybersecurity and they will say yes to everything. It's usually hard no's, and you have to prove why this works. It's especially that way in the public sector.
How do we go about allowing developers and service teams to deploy infrastructure and do all this in a commercial cloud that's not protected? How do I sell that to my cybersecurity team? And that started with Sentinel.
Sentinel was our No. 1 choice on why we were going with the enterprise suite of tools out of the gate. It allows us to put governance in place and execute in environments with very small teams.
We needed our workforce to be able to execute in a cloud environment where no one knows Terraform. Now, we could have developers learn Terraform on the side, but it's a little scary bringing new teams in, giving them Terraform, and they don't know how to use Terraform, and then setting them free in the cloud. That's how you go to $1 million really quick in your spend budget.
How do we enact good governance, good security controls, and good budget practice with that?
The second part of us picking Terraform Enterprise was the need to tie it into core services. I needed to have the ability to tie into my ticketing system like ServiceNow. I needed the ability to have single sign-on and use current IAM (identity and access management) providers that we have and really have that core integration without a ton of development from my team.
The second part of that is private modules. If my entire workforce doesn't know the difference in AWS between a T2 micro and an M5.8XL, and they deploy 11 of them, it becomes a problem.
With private Terraform [modules](https://learn.hashicorp.com/terraform/getting-started/modules.html “Introduction to Terraform Modules”), we were given the ability to say, "We can take the cloud experts that we have, the people that know Terraform, and we can provide a one-stop shop that's self-service. You don't have to know Terraform to use Terraform ability."
When you're in the public sector, trying to bring new technologies, a lot of times, is very hard. How do you bring that to your workforce, enable them, and tell them, "You don't have to learn anything new. You need to put the name of the application, you need to put what team you're on, and is this going to be a production app or not."
That allowed us to generate Terraform modules that were purpose-driven, that could be written by people that understood how to use Terraform and understood how to deploy inside of the cloud.
That goes to Point 3 on this slide, "Prioritize modules for internal teams."
Our cloud service provider team sat down and said, "We're getting ready to deploy Terraform, Vault, and all these things in a production environment. What is the No. 1 thing that's going to replicate every time? Are we going to use EC2 the most? Are we putting this in EKS or AKS? What is our most commonly used module? Let's stand that up. Let's figure out what we're going to gain from the beginning, where we can gain efficiency that other teams will be able to use." And we developed that internally.
To deploy Terraform Enterprise in a cloud environment that can't always use a SaaS offering, we had to go with private Terraform Enterprise. We had to deploy and manage Terraform Enterprise not only from a commercial perspective, but in other security domains. We had to Terraform private Terraform.
Then we looked back again on, "What modules can we use that are available outside and that we can tailor to what we need to do, write it one time, and figure out where it fits in our module? And then we can reuse it multiple times."
And the fourth point on the slide is “Executing and publishing." We wanted to work out a lot of the kinks internally within our team before we offered this service to any other service team.
The whole point of shared services in our environment is, "How do we provide services and tools that comply to our policy and our vision, to teams that don't understand this environment?"
» The role of Vault
The tool that I haven't talked much about yet is Vault. Vault has been pivotal in how we've deployed some of our architecture, but it's also been a point on how we've looked at our current infrastructure.
A lot of things I've talked about are net-new deploying the cloud, but I haven't talked about my current everyday workload that's running in an on-prem datacenter in VMware that is not automated. My organization took an application discovery, and we looked and said, "How many apps today could even be containerized?" Out of 5,000 apps, about 25 of them could go to containers today.
That's a very large spread in applications. Then we looked at the same thing, "How much infrastructure do we have now that we currently using that isn't being utilized correctly?" And that drove how we continue to adopt the tool.
When we brought in Vault, we asked ourselves, "What do we have in our current infrastructure that's Vault-like, that manages secrets, that takes care of SSH keys, that gives us the ability to generate dynamic at-request tokens?"
And we discovered that there wasn't an application that existed for that, but that also meant that wasn't required in our on-premises environment. So how did we stand up a model that allowed us to provide this service and also explain how we were going to use this and integrate it in new applications?
With our shared-services adoption, everything that we started out with was net-new. It was either deploying in AWS, Azure, GCP, and this gave us the ability, after we started standing up core services, to really control net-new projects.
It could go through a proper production environment. It could go through development. We could test code out. We could deploy with our 2 strong tools, Terraform Enterprise and Vault Enterprise, with net-new applications that weren't affecting users. We didn't have to migrate any services over.
These were all net-new additions to our enterprise. And if that was the case, that would be an easy transition. But most public-sector instances, or most enterprises that have not even heard of DevOps, that's not how they're going to operate.
We had to look at, How do we bring in the current workloads that we have? What do we focus on after we get a shared-services model that stands up and provides infrastructure provisioning at request from a developer or a service team, with secrets management?
Once we stand that up, how do we provide that internally to our teams that are starting to hear about cloud and want to incorporate their application to their service, but it might not be ready to move to the cloud yet? How do we offer that?
The service teams—networking, cyber, CI/CD, platform—provide the core infrastructure, but at the end of the day it's only unidirectional. They're only looking at the cloud. They're not having to fix problems behind them.
» Bringing teams onboard
When you change the model and you go to where you're supporting service teams that have applications that are deployed on-prem and also net-new applications that are getting deployed inside of the cloud, how do you service that? What is the best way to go about that?
And that was our hardest problem to solve, and we're still working on that. How do we take Terraform, where a lot of the community is built around AWS, Azure, and GCP, and strengthen that around VMware?
Or how do we bring teams in and say, "We want you to adopt the newer model. We want to bring you into a production pipeline that works for you"? How do we take you from an environment where you left-click and build a VM and you hope it works with the service, to an environment where you are deploying through provisioned infrastructure that is secure, that's been tested, that you know your application or your service is going to execute as advertised?
One of the large realizations that we've come to is that having a stack that you understand inside and out was critical.
So Terraform and Vault: When we looked internally at VMware and how we were going to run our applications and bring those in, the No. 1 thing we did was we went back to, "What are the core services that are running right now that teams currently do on a day-to-day basis?"
If we can change their lives and their workflows on a day-to-day basis, we will gain advocates for our model, and we'll adopt more people in. I can apply more governance to my entire enterprise if I get more people on our shared-services platform.
That has been a key point this year: How do we do that? How does any organization go about that? And our first answer is: Get wins where you can get them. If you have a workforce that the average age of your developers and your system admins is around 58 to 62, you figure out what makes their lives easier so they can watch ESPN, and then you backtrack from there.
We really had been stuck around the infrastructure provisioning point, because that is the easiest way to be relevant to the public sector. In the public sector, a lot of the issues that you see aren't up the stack or at the application level or in the runtime environment. Those are issues, but a lot of the root-cause issues, they're at the infrastructure level.
They can't provision, they can't test, they can't stand up a production environment and test because their development environment looks nothing like their production environment. So how do we replicate that on an internal, on-premises, private cloud and give developers and service teams that opportunity?
It always comes back to: We will develop it internal to the stack, deploy it in the cloud, give them a chance to figure out their application and how that application works, but we don't require them to really know Vault or Terraform.
They're figuring out the upper-level problems with their applications now. We try to abstract as much detail as we can from them and give them the ability to really get into the environment, figure out what their applications do or how that service team provides service to end users.
Then, as they grow and understand that environment, we start peeling back the curtains, letting them design custom modules. We use the mentality of everything has got training wheels. If you haven't been deploying in the environment, if this is your first time, we're not letting your training wheels off. You're going to show us that you can work in this environment and you know what you're doing before we're willing to let you publish custom modules and execute at that.
From a pure governance perspective, that helps our team out tremendously.
One of the strong points is being able to leverage the Hashi stack. The more that you use enterprise service tooling, the more it allows you to ingrain Sentinel policies that are not written by my team. I get to get cyber in the equation very early, drive policy as code for a lot of the services that were going.
I can set a service team down with someone on my team that knows Terraform, that knows Vault, and I can set them down with cybersecurity, and they come up with a policy that's not written on a piece of paper and published on a wall that someone just walks by. It's an actual policy that is followed and executed.
I can ensure that my adoption to the cloud and my adoption to using these tooling sets on-premises is adopted correctly, safely, but at the end of the day provides the service teams and the users the environment that they're hoping for.
» Spreading the self-service mentality
So how do you take this shared-services adoption and this model for an organization that doesn't know DevOps? The answer is the model itself.
You take the small workforce that you have that knows Terraform, that knows Vault, that understands cloud architecture, and you start from a very solid base and then say, "We're going to take our common workloads, our common environments, our common daily executed tasks, and build that internally for our team. We start integrating that into our entire enterprise, and slowly we feed that to customers. We build the advocates, and we also meet our goals of transitioning to a cloud environment. But at the same time we grow with scale and are able to have users that understand what is going on."
It's a self-service mentality, which is really hard for a public-sector environment that is using ITIL fundamentals. With shared services you really get the option. Once you start integrating your services, you tie these services in with ServiceNow or any other big provider that you have inside of your IT department, you get to slowly add those in.
Whenever you bring in a new team, figure out where they fit in your stack. What can you Terraform for that team? What can you open up in Vault to allow that team to offer a service to your greater enterprise in a self-service fashion?
If we can enable teams, service teams and service providers, end users and developers, a self-shop, safe, secure environment, we make our jobs easier, and at the same time we have high customer satisfaction.
Thank you. It's been an absolute pleasure to get to speak to everyone today.