Case Study

A multi-tenant, cloud-agnostic platform for the federal government

Blackstone Federal is a government consulting firm working on major cloud initiatives. The flexibility of the HashiCorp stack has proven to give huge productivity increases for these IT environments.

Imagine deploying a greenfield multi-tenant, CSP agnostic cloud platform using HashiCorp and other related products. The purpose of this project is to provide a platform for a strategic federal government agency in the security sector to accelerate the mandate for components to move to the cloud.

With a small team of engineers and supporting roles, such a platform was built out and the results were impressive. This included a comprehensive collection of DevSecOps tools used to manage infrastructure and deployments, including HashiCorp's Packer, Vault, Terraform, and Consul. The services were also available to the tenant applications greatly reducing the time and money that would be expended to deploy them as yet another variation of the stack required to operate modernized applications in the cloud.

The infrastructure provisioning is end to end, starting with secure networks in a transit/spoke configuration including fully configured Cisco routers.

In this talk, senior architect at Blackstone Federal Mike Augustine will discuss the technical details of this hypothetical platform starting with the architecture, and highlighting the key aspects that make it all come together. And of course, he will step into example code just enough to show the organization and design principles and showcase how the HashiCorp products can be used from start to finish.

Speakers

Transcript

Provisioning a Multi tenant CSP Agnostic Cloud Platform for the Federal Government—That is the title. We don't have time to read it all, but we'll get into it, and I promise we'll break it down as we go along.

All right, a little about me. Mike Augustine, I'm a, what I call, a functioning architect contracting with the federal government. I say, "functioning," because I not only draw, and write, and talk, but I also code. You may know me also as, my name from the conference, as Big Head. In case you thought you already knew me, I wanted to point that out.

All right. Blackstone Federal—just a minute on that—is an awesome DC contracting company. As I came in there, they promised we were going to do innovative things for the federal government. I was a little skeptical, but so far, it has actually come true. And we're always hiring.

All right. So we have to go into some disclaimers, because we are talking about the federal government and some of the agencies and things that I've done work with. I'm not actually going to be able to demonstrate or show any artifacts from those projects. I'm not going to be showing you code or anything we've done under contract for that. And I am in no way representing the federal government. If we were to do that, there'd be a lot more red tape and paperwork to go through, and we didn't want to do all that. I'll still give you an idea of what it's like to be there.

Along those lines, there's also an acronym alert, because I did say the federal government and everything that you do there involves acronyms. You combine that with the technologies that we're all dealing with and it gets into quite a pool of figuring out what's going on. I won't necessarily describe the acronyms; you'll have to do what I do and Google them later and try to keep up on it.

I will let you in on CSP. And one of the things I realized when I presented my topic was that some people didn't know what that meant, and apparently, it's more of a federal government acronym, but it is for cloud service provider, and we're taking about your AWS, your Azure, things like that. Some of you may know that one.

Okay, so we'll get into it. So what we're going to do is we're going to kind of go through a hypothetical project where we're going to do the steps of getting into a federal government provisioning of a cloud platform. We'll go through some of the requirements from the federal government side. Kind of feel what it's like, and then dive into what could be the architectural designs and the artifacts from that. And we'll show some pics and diagrams and get into some code snippets at least.

All right, so here's our project: Right off the bat, it's greenfield; they're not there yet in the cloud. It's a nice thing from a DevOps engineer perspective, because you're actually going in and getting to draw on the blank slate. We will have progressive government-thinking and support behind our innovation. You may be thinking I've just made that up for this hypothetical situation, but it's actually a possibility. I've seen it happen, and it is happening at least in pockets in the federal government.

Part of what we're doing gives us a meaningful mission, because of the fact that we're now doing something for the people—for y'all—we're doing important things for the nation and even international. And it's a lot different than a lot of things that I've been involved in and you have, where the bottom line is more of the target.

While we're going to do this, we might as well have an awesome team. We have all the supporting teams and structures that we need to go into that. We'll take a look that. This could be our awesome team. This is an example of a team that I've been working with. It's not exactly the makeup of the team I'm with right now, but it's going to give you an idea of what's going on. One of the takeaways from this that I want to share is that sometimes you may think that you're the last Jedi, and then you realize there's a whole other generation coming up.

Requirements from the federal government perspective

The first item here about supporting lift-and-shift: this is a reality in terms of if they are going to get to the cloud, they're not necessarily going to refactor on the way there. That's the ideal world, a lot of us like to do that, but there's going to be a lot of just, "Get it in the cloud," because there's mandates and things, and then they're going to have to come along later and try to do that. Part of that is you won't see a lot of talk in this about containers and things because they're not doing a lot of containers now. They're not going to do that as part of the entry to the cloud either. There is containers and it will happen, but it's not all there yet.

Talking about tool-agnostic: what we're saying is a lot of times there's preset requirements of tools and things that they are using or must use for various regulations and that sort of thing. So we can't necessarily just dictate our set of tools; we can't force them into one combination of things that we like.

And then about the secure boundaries and all is that in the multi-tenant situation, trying to bring multiple people onto a platform, multiple agencies, projects. Even within the secret worlds, it's imperative that other people can't see their stuff, can't touch their stuff, so that's a #1 requirement for them. There's a lot of data center and government baggage coming along, and we have to deal with all that.

What about that long title?

I promised you we'd start to break that down. Let's go through that as part of our description of requirements and all.

Multi-tenant:

  • What we're saying is that we want a way to bring in agencies, projects, applications into their own environments in the cloud, and do it in a way that we can support the reuse and increase the velocity in the things that's going on, especially over what happens today. There'll be a lot of different types of applications: small, few machine type things up to whole web farms of multi-tenant web hosting and that sort of thing.

  • Now, they're going to come with their own technical capabilities, because they're doing these projects. They own their own applications in the end, and they want to be a part of the infrastructure as code and a part of the provisioning and things like that, they just don't want to have to go through everything that we've had to to get there.

  • And then, of course, they'll have multiple environments. What you're used to, a dev, test, prod, and a lot of times, if they have the ability, they would like to even have multiples of those, which today, is kind of a luxury in a data center and a paperwork provisioning environment.

CSP agnostic:

  • We all know what CSP means now. The agnostic gets into a lot of what HashiCorp is talking about this week and they've been positioning their company for, but the ability to go to different cloud providers. This will come in where sometimes there will already be a preset cloud provider, because they have already contracted, or they have particular requirements for that, or there were very good salespeople, and so you have to deal with how they come in the door. A lot of times, there will even be the requirements, like we've talked about this week, about having to be positioned to be in multiple cloud providers for competitive comparison reasons and for failover, disaster recovery, and such.

  • From our perspective, we still want to define the infrastructure in being specific to the provider, but in following the HashiCorp ways and the way that they've set things up, we don't want to get into trying to do everything from one high level that just defines what is a machine, we want to define an AWS instance when that's the case.

  • But we do want to avoid some of the tool and feature lock-in, as far as if you take too much advantage of AWS feature functionality, when you go over to the Azure cloud, you're going to have to find a way to replace that in their mechanisms. There's a whole other debate over this; you may have been engaged in some of that. I'll be happy to do that at happy hour, but we can't make it part of this whole presentation.

Cloud platform:

  • A lot of people are doing this: A lot of the government agencies are trying to get to the cloud, they're trying to get out of their data centers, there's mandates, there's things going on, but they're doing the hard stuff over and over again. They're repeating the same mistakes, they're going through a lot that a lot of us have been through, and it really doesn't make sense. And this is even within agencies, multiple projects where they're doing it themselves. We want to provide a platform that's going to streamline all that, let them benefit from all the mistakes we've made, and then move on from that.

  • We have to provision everything end-to-end, everything has to be fairly complete.

  • And we have to make sure that security, and practices, and those things are all reusable, are all inheritable, because that's also part of the whole benefit is it's so hard to get an approved system—and we'll talk a little more about that later—they don't want to go through that each time for each application.

  • Of course, easy adoption is key, because, as I talked about, these starts and fails, if they can't do it and don't feel comfortable, they're not going to keep going.

The federal government

  • To start with, a lot of you are aware of GovCloud, AWS's version of a FedRAMP'ed cloud environment that is secure to allow governments to use. It's both a good thing and a bad thing. It's good because we couldn't be doing this if something like that didn't exist. Azure and other have similar environments. The bad part about it is that it takes a while for any new feature, any new thing coming in to be approved and get into that mode, so there GovCloud environments are usually behind on the new things. And when you hear about something cool like Lambdas or something, you're like, "Oh, we can't do that."

  • Process, paperwork, all those things you think about of the government, it all goes away in the cloud. Well, no, not really, no, it's still there. You have to deal with it. You can't get away from that; we're talking about the government. But there's ways you can deal with it we'll talk a little bit about.

  • And then of course, there's integration points. You're going to have to connect into the government networks, you're going to have to use some of the facilities that are there, identity management, things like that. You can't be reinventing everything, you've got to play along with the existing infrastructure.

  • One of the things I also want you to take away, though, is that, what I've been saying, is the federal government is being innovative. They are doing these things, there are projects and successes that are happening, and it's pretty cool to be a part of it.

I talked about the disclaimers and all, I'm going to be a little risky and I'm going to take a step and take you into a government facility. And we're going to take a look at an analogy of what I'm talking about as far as what's been happening with these starts and stops of different projects and everything.

This is an example of a facility that I frequent at a government facility. As you can see, there's been a lot of different attempts to modernize and to bring things into one common platform, one picture. They haven't gone that far, some of them are incomplete, and there's definitely not looking like they're going to be in one place anytime soon. And I would mention that that is the color green that you thought they abandoned in the 1960s, it's still there.

Architecture requirements

  • This is a technology architecture project. Some of things I want to mention is starting with these tagging strategies. Anyone who's been doing some cloud work, most of you, would be aware that you need to get things like that down early. You're going to regret it if you have to go back and try to retrofit that later. You need to be able to identify—especially in a multi-tenant environment—all the resources and things very quickly, and be able to do that systematically.

  • Security, from the beginning. This is something you want to get set before you even build your first instance. Make sure you're starting secure; make sure security's in the philosophy. You're going to have to do it at some point, and you don't want to go back later and try to make security work.

  • We mentioned a couple of these other things, but the owning your own destiny, part of that, what I'm talking about is where you can put things into your enclave, instead of having to go back to the government paperwork side of things to do it. And so an example would be, like if you're doing certificate management, if you could be running Vault and have an intermediate CA that's trusted back to the federal government, then you can do certificate issuance, you can do replacing of certificates in a reasonable timeframe, instead of every time going back to the paperwork and forms to try to get just one more certificate done.

Okay, so we've come a ways here, we're now laid out kind of a basic diagram for the platform. Later on, you're still going to have to get into some serious technical diagrams and networking diagrams, to prove to some people how things are working, but this is good for our purposes.

On the left are the tenants that are coming in, they're in their own networks and their own desktops. They're connected through the enterprise or government networks and they have to get them connected into our GovCloud platform.

What we do here is we're using a hub-and-spoke network approach. A lot of you have seen this, but having a transit where everything flows through so that we only have one connection point from the outside and we are controlling the flow within. And this way, we can keep those tenant instances separated from each other, but still allow them to get to our shared services and management layers that would be needed to run in the platform.

For the purposes of what we're saying here, these tenants, we're going to call tenant product instances, so any one of the tenants, like Org A, would be potentially the dev environment for this application that a tenant is bringing over there.

Getting into the shared service and management layer a little further, this is just an example. if you look at this again as a blank slate—this is also part of the DevOps dream—if you could go in and start to build out all this tool set and the infrastructure that would drive the rest of the platform, that's pretty good. I've left some of the items just with their purpose names, indicating both that I'm not saying specifically one's we've used, and that you could plug different things into that. So you have a log collector, that could be a variety of different pools that you would put into that but they would be known as the service for log collecting. Of course I've highlighted the HashiCorp products and showing that those are a big part of this whole scheme and that's what we're here about this week. You know we're using those.

Tool selection – lean evaluations

One thing we talked about is paperwork and things like that—just a quick brief: You can get through some of these things, we've been able to see some things work with lean evaluations. If you try to go evaluate all those product selections, get them approved, get them purchased, put into place, that in itself could be a couple of years project.

We have see things work where we can get into where we do some quick evaluations, but we still have to do the AOA to get the competitive documentation, the TRM to get it approved, the PO to get it purchased, but at least we can speed up the process going into that. We don't need long POC's.

This type of thing works pretty well, vendors aren't totally excited about it, they might think they can win a long POC but not the engineers' opinion.

Structures are key

Structures are key, you wanna set these things up. We're doing this multi-tenant, multi-cloud environment, you wanna make sure you can identify everything early on.

It's also gonna be hard to go back later on and try to retrofit if you don't spend some time on that. We all wanna get into prototyping and getting those first things running and testing but you wanna focus on some of this or you'll hurt later.

And then of course all this will be reused and put all across the whole platform.

Tenant categorization

How are we gonna call these things? How are we gonna look at resources and know what they are? And how are we gonna keep them separated going into roles and security and things like that by using this common nomenclature?

And this leads into the Tagging strategy: Like we mentioned, same values, same names.

Naming conventions: How we build the names of things like for example, even an instance or security group, we get a lot of heat sometimes for these long names. People are like, "I don't know what it means," but eventually, you're able to read them like sentences and it makes sense and it gets away from some of our old habits of naming it LOGGING1 or maybe after Disney princesses or the things you've done in the past.

IaC repos

Alright, now we have infrastructure as code, we have configuration, these type of things, they're gonna go into repositories, one of the ways we can approach this is to structure the repositories in a way that's going to make sense for us.

And by putting the tenant product instances into their own set of repositories, we will be able to segregate that code and even separate out things for security reasons, only allowing access to the people that are supposed to have it, that kind of thing.

Here's an example of that

What you could do: The bottom boxes are the tenant product instance definitions, so at the bottom is a layer of the common pieces to the whole product instance, and then you need to build the networking, the compute, the connection, and security pieces.

By doing this, I'm saying if you have network engineers and they have to own the network piece, you can separate that out, they're the only ones that have update privileges and can run the network pieces out to that.

The cloud platform core, the core code is really where a lot of the brains are, a lot of the stuff that you develop, and so those are the common modules, the common configuration, those type of things so that in the end for the most part, the tenants just need to configure the things that they need and take advantage of all the work that's been done. They're not writing the Terraform code that builds Cisco routers, they are taking advantage of the fact that you've done that and can configure to allow that.

Expanding that picture a little, we're talking about...like I mentioned on the right side, the product layer would share the information with the rest of the other pieces of the product instance, so a lot of this where I mentioned the TF vars and for those that are Terraform knowledgeable, those are the substitution variables, those are where things go that we're then going to pass up.

The idea being you can configure a lot of these tenant product instances just by changing the values, just by putting in the numbers and the specifics of what you need, and then going up the left side within their own compute layer or the different layers, they can do some of their own modules and configurations, they can add things that are not available in the core.

But as well, in a lot of cases, it will go up and use the pieces form the core, taking advantage like the module infrastructure in Terraform and common configuration languages and things.

Get into some code already!

Alright, I hear ya, we'll get into some code. But first another picture, so this is just to give an example of the product instance layer and the breakdown of the compute later and that.

These are ways you could go about this but this is where you get into the CSP separation—the provider separation—and this is kind of the Terraform way and I think it works best for these situations, but you have to separate out the AWS and the Azure, you have to have different ways to do things and that way you're using an AWS instance resource in one and the Azure equivalent in the other. When you get to the configuration part, that's now common. Once you build a Linux machine machine, the way that you put MySQL on it or some other purpose is going to be the same, so you don't need to be duplicating that or changing it, it's reusable for the different platforms.

So to step through a few things here in example code: We're saying we're in the product tenant instance layer at the bottom of the chart, and it's saying this is just a lot of variable values being substituted at the different levels of that categorization. As we move into the specifics of this instance, then we can get into some of the networking values, those pieces, this would be used for like the networking layer but also when other parts of the compute need to know some of this information, it's available, it's been provided up-front.

The instance here is the pieces that would be a part of everything that's running to set up the stage for that which would be

  • the tagging—setting up the default tagging like I mentioned

  • common modules—everything done the same, repetitive, repeatable

  • and then going out and using data resources to gather some of the existing information that's there so that you can pull that back using Terraform facilities, and then the other entities can use that as well.

Looking at building a tool station, we'll use an example of a tool station that's gonna be within our enclave, the way that we can run some of our Terraform configuration and we'll talk about Ansible, the configuration manager here. And the variables that would go into that, to set things up for this because it's going to be sent up to run modules that will take those values.

The service role assignments is an example of how you can do some repetitive structures, map-type inputs, and with Terraform, actually get into some looping structures and be able to say, "These are the services I need to access from this instance," and then when it goes up to the common layers and builds those, it would make the connectivities and use the information provided there to know which ones are in that. There's some cool stuff you can do when you get into it.

On the tree side I'm just also pointing out that in this group there may be some of the compute modules that don't exist in the common layer, and so the tenant is then adding those pieces into their own structure, you could look at this as at some point they could be approved, they've written the MySQL module and you don't have one yet, they could eventually move up. And a lot of the things we're dealing with with Terraform registries and Ansible Galaxy and things like that, you could then have other things that enter the domain for other people to use.

Here, we're actually in the code to build a tool station, and the thing I just wanna point out with this would be what it's really doing is just calling a module up higher and passing all those variables. So we've been collecting all these variables, all the information that's needed, and it's just gonna pass that up. There's really no other code in this particular piece other than that.

Jumping into the core layer we talked about, just show you this for an example. You'll notice that it's very similar, again, similar structures, patterns, the trees look the same. So continue to do that: It makes life easier.

Again you have the separation of the cloud platforms and the configuration is common. Here I'm showing some of the build pieces and so we'll talk a little more about this but getting into building images, you also have to have some separation of that. There may be common code but the way that you're going to make an Amazon AMI or other entities is gonna be different according to Packer's rules.

Looking in at that a little bit, this is just a snippet of looking at the networking modules, the layers that are in there, the different pieces. All these things would be part of what was built through all the parameters when they've asked to do this. So, we're going in now, and of course, we're doing things with default tags. Then we are setting up subnets and other pieces, the routing table specific to those. This goes on.

We are doing things, as we mentioned, with Cisco routers and we can actually deploy virtual Cisco routers and then go in through Ansible configuration using these IOS modules and configure those to be part of the infrastructure. So, it's all feasible, and it all can be done once, and then reused.

In the modules, like running the tool station module, even here it's just going to pass control up to a common Linux module because we don't need to rewrite how to build a Linux module every time. We have common modules that we pass things to. We pass up all those variables that it needs and then we can build a Linux module once the way that it needs.

Inside the Linux node—and some of the talks have gotten into how you can do some of these things—this is a pattern where the code builds configuration templates for Ansible configuration and then invokes the Ansible configuration. This would be an example of where now it's gonna go in and install all those agents that are needed that we talked about. So it's gonna put the console agents and put the security scanning agents and things onto every machine. In our DevOps world where we get around that, it's too hard to install all those agents. It's all done in one place, same way, everything comes up running.

As that code would come back to the tool station module, now it's got an instance that's hardened, that's got all the common stuff on it, and then it would take and build a configuration, and then deploy a tool station on it that would be the purpose for that particular thing. That could be your MySQL, that could be your web server, whatever, but now, again, following the same patterns.

This is just a snippet from a Ansible playbook that might load that tool station purpose onto that. Going through and running the different roles. It's going to have the role of being a Git service, of being a Terraform builder, those kind of things. It would roll through and bring that in. Just a quick example of like in the build process, using things like Vault, we would have, through Terraform, gotten some of the secrets. They'd be available temporarily in the environment, not written down anywhere, and then you could pull those in for what Ansible needed to use for those things.

Okay. We're getting there. Down to how we would provision this, how we would actually get this tenant product instance up and running. One approach is here to take a series of steps of scripts that would go through and get the environments ready and then go through and actually build using the Terraform commands that a lot of you would be familiar with. But the idea being these things could be reusable, they're taking in all the parameters about what I am building, what is this product instance, what are all the values that I need coming into that, using these other facilities.

We've talked about building out the environment variables. Then as we do what's called a builder assemble step here, it would pull in the repos that we've been talking about, assemble those to be available, and then would be able to pull it all together and start building the layers of the platform.

Then as you get into the plan/destroy/apply, as it is very similar to what you'd have to do anyway. One of the reasons to break things out like this is because you have the potential for errors at any level of these, and as anyone who is working with Terraform knows, you're going to have to fight the first errors, you're gonna have to fight the plan errors before you can get into the apply errors, and things like that. And so, you don't wanna try to run one big massive thing.

This is just an example out of a script like that as you can do some stuff in Bash with looping and all. This could be done in Ansible or in some other languages, but this would be looping through and pulling in the repos to build a structure like I mentioned.

Okay. So, let's talk about doing that. So here's another view of our platform. We've built this out. We've got the tenant ready to come in. I wanna mention that we have actually built this out using a lot of our own dog food, because that's what we'd want to do. This is just other tenant instances that need to be built. Once we have things working, we can use it for our own purposes. Getting into where we need to have some kind of interconnect into the networks, a lot of that can be done and connected through the automation and then the building out like a Cisco router. That's where the VRF talks about virtual routing and our shared services that are in there.

Then we get in and we start to run the build. We start to run the networking layer and it starts to put out the pieces. So it builds out a VPC specific to this instance, builds out the subnets and other parts that go with that, maybe there's other Cisco routers, other pieces that have to fall into that. Then we run the connection layer with the services. That builds the glue to allow the access into the services layers that we need for those things. Then we can build a compute layer and it would start to put those machines in, and we have now created this product instance to the specs that they needed in their Dev or Test environment.

Once we have this and everything is great, this might have been a temporary environment, they may not need this for a lot longer, they could reverse through the destroy process, go through, take those pieces out, and we'd be back to this again. We're saving that money. We've been able to bring that up and down. All this takes just about as long as it did to go through and talk about it. Maybe a little longer, but when you're comparing it to maybe weeks and months it would take to provision that even in VM environments and data centers, it's an incredible leap forward for the federal government.

Instance creation and configuration

We can get into where do you do some of the steps, but that's another debate that we wanna have later.

A lot of you are familiar with this type of approach, but what we're saying is on the side you can build an image, and a case that we're talking about, we need hardened images. We'll talk a little more, but you would go through with Packer and you could use Vagrant as local virtual tool. Packer would build the image. Then things that we've done is like with Ansible, you can do the hardening pieces. There's Ansible Galaxy roles out there that will do CIS hardening and those kind of things. This way you can produce an image that's a hardened image you're using in your repos from vetted artifacts—you don't pull anything right from the internet. Then you have your image to use.

As you go into the pieces we've talked about, about building out the instance, you could have something like an orchestrator like Jenkins—I just kind of threw that in there for some controversy, that could be like Nomad or CircleCI, but somewhere where you use that to run through those processes, build the Terraform, and then configure like we talked about.

One of the beauties of this, if you did get into the debate where you wanted the images built with more of a purpose on them, you've already written the Ansible code or the configuration management, you could shift that over to the image build and you haven't lost any time on that. Of course, the tools with Vault we talked about in Consul as a service registry is all a part of that picture.

Image hardening

The hardening is very hard. You have to meet a lot of standards and things don't always come out of the compliance the way you would hope they would.

In our environment we need to also get to the point of being secure. Here's a whole other acronym battle. You need to get to an ATO, which is an authority to operate. It's very difficult, but there's a lot of things going on to help you get there. DevSecOps is a way that can help that happen and shorten those life cycles. A lot of times it's like this. You all know that picture.

Here we have built our environment. We are happy. We are using modern techniques. It's all in a nice modern automated location. However, there could be still some bugs, some problems with that. In this case, if you happen to move anywhere near this device it will activate the flushing mechanism and it will flush prematurely. You could have four or five good flushes before you step out of the booth in a few minutes. There are solutions for this. This is one of them. Obviously we have some sysadmins that are working in the facility, but I wanna caution you, don't use that as your final fix. Go back into the infrastructure's code. Go back and fix it, so that you can rebuild it right the next time.

A couple of lessons learned and then just wanted to mention about the HashiCorp way. Part of it is figure out how HashiCorp is doing things before your really get concerned or upset about the way things work. It helps you to understand it better.

More resources like this one

zero-trust
  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector

  • 9/14/2022
  • Article

Intelligence community article list: Multi-cloud success for the intelligence community

HashiStack 2021
  • 8/12/2022
  • White Paper

A Leadership Guide to Multi-Cloud Success for the Department of Defense

HashiStack 2021
  • 8/12/2022
  • White Paper

A Leadership Guide to Multi-Cloud Success for Federal Agencies