Automating Our Way Out of Federal Datacenters and Into the Cloud: Automated Image Pipelines
Oct 07, 2019
Learn how HashiCorp Packer was used to design a pipeline that could be used in federal agencies to migrate into AWS GovCloud and Azure Government.
Exiting a datacenter or pursuing hybrid cloud is a significant lift for an enterprise. Federal agencies in particular will encounter a number of unique challenges while verifying the integrity of systems as they cross the datacenter or cloud barrier.
Packer delivers a unique solution by providing a gateway from traditional datacenter image management processes to cloud-native ones. In this talk, you'll learn how Packer was used to automate the management and delivery of "golden" cloud images based on datacenter images, which in turn came from secure physical media.
Building on Chef to orchestrate the hardening and federal configuration points in the image, the Packer pipeline dramatically reduced the time infrastructure engineers spent on maintenance of gold images, and facilitated rapid deployment of new systems in the cloud. This talk covers how the pipeline was designed and integrated into AWS GovCloud and Azure Government.
Cloud Solution Architect, Microsoft
Associate Manager, Accenture
Nicholas Mellen: Good afternoon, guys. We're here to talk about automating our way out of federal datacenters, using automated image pipelines. My name is Nicholas Mellen. As an associate manager at Accenture, I am currently supporting a variety of civilian, defense, and intelligence clients in their cloud operations or shifting their workloads to the cloud, as well as containerization.
Marcelo Zambrana: Hello, everyone. My name is Marcelo Zambrana. I'm a cloud solution architect at Microsoft. I specialize in application development, infrastructure, automation, and anything DevOps, basically.
» Why an image pipeline?
Nicholas Mellen: Why do we need image pipelines? What do they serve? Across a lot of our experience, which is both development and infrastructure, in commercial or federal, and with all of the major cloud vendors, we've seen some really terrible image practices going around:
Source verification/third-party trust concerns
Secrets managed via images on purpose
Secrets managed via images on accident
Image maintenance issues
In particular, I've run into secrets management in images, which is just an atrocious image practice; outdated dependencies or even kernel exploits that live on inside of images for years inside of an organization, unpatched; image sprawl and maintenance issues; just general management concerns with how many images and what types of images are floating around.
In the federal space in particular, you run into source verification, where you've got things like zero-trust networking and zero party trust, where you can't really trust the source of your image that just was given to you by the cloud provider.
How then do image pipelines help solve those problems?
We think that an image pipeline fundamentally treats images as an artifact: just a simple output of pipeline or as part of the process.
In that, we have a process architecture, where you've got clearly defined and repeatable flow of images moving through this pipeline. You can clearly add and change capabilities, alter your pipeline to add different security or application parts into it.
We also think that image parity is a real benefit of the pipeline approach. If you've done work with banks, healthcare, government, you are trying to solve the parity problem.
I think that the pipeline approach helps you quickly put the same thing in different clouds with one repeatable code-based process, which then gets into image segmentation. A lot of organizations out there are using images as a part of the base for each type of application.
Each RHEL app, each Python app, each Spring app, they've got that type base image. What if we were deploying new ones and supporting them through pull requests? What if we were managing them completely as code and bringing in new images and testing them with clearly repeatable processes?
And finally, probably the biggest benefit, is continuous environment delivery, this ability to ship an image and, let's say, a version of CentOS or a new RHEL image that you received, a minor patch, let's say to RHEL 7.
You can send that out to all of your cloud and on-prem environments with a single push and pull request. You can do these things on your timeline, not necessarily when you have time and you can get a story open for it and an infrastructure engineered to support it.
What about leaving datacenters? This actually helps, this really comes into play here.
A large amount of what we do with the federal government involves using cloud better, not just moving to it. And we think that using automation is a key part in the pipeline approach.
In our personal experience, we've even done things like application rationalization, where we built custom tools for scanning through a datacenter and trying to figure out what images we need to use.
But we've also used COTS products. There's a lot out there. Even ServiceNow is trying to get into that game of scanning through your environment, telling you what you have, giving you visibility.
Scope management. Every time we are asked to do new features for the federal government, oftentimes they wanted us to do it with the same number of dollars.
So we say, "Can I automate this change in scope? Can I automate my approach to those new parts, so I can try and squash technical debt before it even comes into play?"
We want platform to platform. We work with all of the cloud vendors all of the time across a multitude of clients. We need the ability to, inside that particular space, shift images around and to clearly repeat our processes for moving into them.
And, of course, testing. I think there's very little out there on the process of independent validation and verification, testing on images, and now we have a repeatable process for applying that in your organization.
» What is an image pipeline?
We're getting more into the nitty-gritty. As I mentioned, everything is managed as code.
In our approach you're using your source code management solution to have a GitOps usually, or some kind of Gitflow Workflow for your code, and mirroring that up with some CI/CD orchestrator to run a process whenever you've put together and signed something that you want to go out to production inside of your cloud environments.
With that in place, your pipeline can trigger, where you're going to dynamically create a builder using an infrastructure-as-code solution, apply configuration management changes to the image itself that harden your image, maybe using federal standards, or possibly add your organization's specific implementation strategies to that image.
Testing that image using a variety of different testing suites or configuration management suites. And then finally the continuous deployment to the cloud environments and the on-premises environments, if you still have them or are working with them.
Let's turn it over to Marcelo, to talk a bit about our image pipeline approach that we've rolled out with certain clients.
» A cloud-agnostic approach
Marcelo Zambrana: Thanks, Nick.
Let's try to digest this diagram. When we started this, when we tried to build this architecture, we had one idea in mind, to be cloud-agnostic. And we also had the idea to support not just one cloud but multiple clouds, and on-prem, of course, because realistically, in most of our experience, we always need to support at least one more cloud, right? It is a fact.
You need to be able to support a hybrid environment.
It could be Ansible there, too. Kickstart configuration files for automating the initial setup and configuration of the initial baseline. And we start everything in GitHub.
GitHub allows us to trigger these Jenkins jobs. We have our Hyper-V environment, which is basically a Jenkins slave. It assigns that task to that slave and starts getting all the source code from GitHub, all the Packer scripts, all that stuff.
Depending on which cloud we're going to build, we have a different set of scripts for each one of them. And this is because, sadly, each cloud environment has its own requirements, initial base configurations that need to be there before you can import an image.
I tried to make it work with just one generic thing; it was a mess. It didn't work, so I said, "Let's divide and conquer." So now we have a set of scripts for each cloud environment we want to support.
One thing to note is that we are grabbing the initial CentOS baseline from the official repo. But that could be a local path in your internal network. It doesn't have to be, but it could be.
In our experience, sometimes it's really hard to get those base images from outside. That's why it could be there, but it could also be internal. We support that too.
Once Packer does all its magic, runs everything, we harden that using Chef recipes, and we have the VM ready, we export as a VHD file. And then we start the process to import those VHD files into each cloud environment and on-prem.
Again, this other configuration in place there mostly because along with the import process, there is initial configuration that needs to be there. In Amazon, you need to have an extra backend, and you want to make sure it's secure, so you don't leak images outside.
We don't have open extra backends. With a storage account in GCP, we need a backend, things like that. And we also need permissions to access those resources. Some initial configuration needs to be there.
But we use Terraform to automate all that stuff. It doesn't matter if it's on cloud or on-prem, we automate all that base configuration and the input process, so we have everything ready once Packer is ready to import the image.
Nick is going to show us a demo of this process. But that's our architecture, and again, we try to do something cloud-agnostic that will allow us to extend that and support multiple clouds at the same time.
As you can see, we have a lot of tools in there. Nick is going to talk about why we're using Jenkins to orchestrate all these things.
Nicholas Mellen: Let's talk about Jenkins, because we've seen a lot of blog posting on this. Like people are using Jenkins for probably far more than they should. But for this particular solution, Jenkins was quite appropriate.
Marcelo Zambrana: Yeah, especially for our customers, government agencies. Jenkins is pretty much always there.
Nicholas Mellen: Jenkins is already there with most of our clients, and on the rare occasion that they're not already using Jenkins, they're a complete .NET shop, and they're looking at Team Foundation or something like that.
Some of the other reasons why Jenkins came into play here, we can have a stateful process in our pipeline. So inside each stage and inside each shell declaration inside that stage, we have state. So we can easily work with tools like Terraform or Packer, to let each work with Chef.
We can support parallelization very quickly. It's an additional 3 lines of Groovy to support parallel build stages. And when we're parallelizing certain jobs like image management, uploading, things like that, to the multiple clouds, it's a big time-saver.
I would argue that there really weren't any other options that can address some of these needs. And of course we can use Jenkins to offload the actual work of deploying images and configuring things inside of them, exporting them into other formats and sending it to the cloud, on other machines that can be remotely controlled from Jenkins.
So I think that's a major way Jenkins plays into this.
Marcelo, do you want to talk a little bit about Terraform?
» Spotlight on Terraform
Marcelo Zambrana: Yes. As I was saying before, we did use Terraform a lot. We tried to automate as much as we could. And in order to support multiple clouds and on-prem, Terraform was the best choice.
Let us be honest. Who here likes writing and managing CloudFormation scripts or ARM templates? Anyone? Don't be shy. Nobody. Yeah, that's what I thought.
Again, we could use those cloud-native tools, but it wasn't something we wanted to do, because the learning curve of learning those things, maintaining those things, it was just not nice.
So we went with Terraform, and we also use Terraform Cloud. It was released ready for everyone. I highly encourage you to use it. It's really easy to start, and it allows us to manage the remote state. That's what we want; we want our global developers to use the same remote state.
It was great; it was fast, and everything was supported by Terraform. Especially for us, it was really easy to start with that, because it's Terraform.
» Spotlight on Packer
And Packer. It is a real worker here. At the end of the day this is an image pipeline, and that's what Packer does.
So the ability to support multiple provisioners, providers, post-provisioners, it was a nice choice for us. Even if you don't have something, you can extend it.
I think there was a nice session yesterday about how to extend Packer. So it was easy for us to choose Packer and to generate all these builders. It was nice, easy, and straightforward.
But all of this tooling, all this scripting, they had a goal. We had to meet a compliance request from security, and Nick is going to talk about compliance now, the fun stuff.
» Federal compliance benefits
Nicholas Mellen: The fun stuff. We were really trying to seek in particular DISA STIG automation to start, but this really branched out in the ways that we can meet compliance objectives inside federal agencies.
The first concept, which you might've already heard of before and is used without automated pipelines, are gold images. These are central images that are trusted by agencies to operate their applications. We sought to build those gold images from the pipeline. That is the genesis of this talk and all of the other work that we've done here.
But there were other things that we could benefit. We were able to really reduce deployment overhead. That infrastructure engineer that spent way too much time navigating through individual clouds and on-premises solutions for importing an image and then working on them manually—making STIG hardening or other compliance changes to these images—they can focus now on making those changes as code and being faster and more responsive to changes inside those formats.
But of course you can STIG many images with this approach. You can STIG all of your base images and start having new application types quickly rolled out on top of your gold image.
System parity. I hit on parity earlier and I just want to drive home that we are achieving parity much faster than other people working in this space because we have a pipeline constantly delivering the same image that can be independently verified inside each cloud or on-prem operating space.
Everything's automated and deployed as code. We can test that code, and we can validate that code and provide that as a metric to the government when we're describing what we're doing, what we're up to, and how we're supporting them.
And there's way more room to extend the solution. You could do additional security checks, for example, like what you see in the container security space with CIS guidelines, checking for private keys inside images.
You can go deep on this approach, but the main thing that we were seeking to address were DISA STIGs. Let me do my absolute best to show you a demo of what this type of pipeline looks like.
Marcelo Zambrana: One thing about the image pipeline, it takes time. We're going to talk about that in a minute, but it does take some time to build all these images, harden them, and then import them.
Nicholas Mellen: These pipelines take a while to run, because we're doing quite a bit of work on the backbone. Let's just walk through that. In a typical scenario that we are running with the government, we dynamically build the so-called Builder VM, which is a Win 2016 VM, using Hyper-V as the base for Packer.
If you guys are wondering, from a technical implementation standpoint.
With that now built, it comes preconfigured with the Jenkins JNLP agent, and then we reach out and it starts checking out the repo of our images.
Our little toy demo that we want to show you guys, we just wanted to show moving the image to each cloud, but you can take this way further and have folder-level organization of the types of images you want, and your pipeline automatically sends it to all 3 clouds or wherever else you're going.
We have local build directories that have these VHDs, wipe them all out.
And then we move into running Packer build operations. Those of you that are familiar with Packer, in particular using Packer with Hyper-V, know that this is very resource-intensive.
Also it's, I don't want to say like a black box, but it's a bit harder to understand where those resources are being allocated at any point in time, unless you have a good view on the metrics of the server.
We did a bit of testing on this and found out that the Packer builds were faster if we ran them in series, not in parallel. That was an aspect of parallelization that wasn't necessarily helpful. We started off here with Azure, and we go through each cloud process here.
So there's an Azure build, an AWS build, and then a GCP build.
If you are going through the Packer build process, you've got the input where we're taking, in this case, an ISO and the Hyper-V, and then you need to apply that hardening profile, all of those different security checks that you want to do to the image.
We use Chef for that, and we usually do DISA STIG, the RHEL STIG, or the CentOS STIG. And we would run through all of that process.
We'd also apply agency-specific things. So if you've ever worked with DHS, for example, they have a different handbook, it's DHS 4300A, and that governs how you configure the systems for sensitive workloads.
And we configure all of that with Chef. We manage those recipes internally, and that is typically how that workflow works.
The output for all of that is VHDs. We ended up consolidating it all to be VHD, which is great because one format rules them all; it really helped us out.
Then send all of these to the right cloud storage solution. If you're using AWS, their process is, you bring it into a bucket first. For Azure, it's a storage account, and for GCP it's also a bucket.
Sending it first to these buckets, and the next hop is adjusting it. One little plug I'd like to point out here, for any Google fans in the audience, they have united these into 1 single step in their approach. So, we'll plug for them. It’s a beta feature at the moment.
But you can now go directly from VHD to final running image inside their cloud. Everybody else, it's a minimum of 2 steps.
Importing that image typically is a process of taking your VHD that you created, that's in the bucket, and making some kind of managed volume or a managed disk off of it, depending on the cloud provider, and then creating a test bench of it and snapshotting that.
There's your image. That is that second step, that step right after cloud storage offerings.
» The test rig
And then the test rig. I want to go back into Terraform because this is another important approach here.
Our process for testing and validating these images isn't some sort of checksum on the image. We're running this image in the target cloud or on-prem environment. We're building a test rig up with Terraform and then running a test suite against it.
We would write, let's say, a STIG profile in Chef, and then we would write a STIG testing profile in Ansible. This allows us to have a tool-agnostic approach toward the process of applying STIGs and analyzing them in our target environments through our pipeline.
We would test using things like Ansible, or possibly tests with regular Python if we're looking for different things on the OS. But that would be the final endpoint of our cloud pipeline.
This would result in your images now from that single physical medium, an ISO that you could grab from a secure disk, a disk that you don't really have to worry about where it came from. It can just go into the pipeline now, all the way into cloud images.
Marcelo Zambrana: But if you notice here, it took 1 hour and 42 minutes to finish this thing. It does take some time. We wanted to demo it to be honest. It's a lengthy process, especially if we are supporting multiple clouds and the way we are doing it with testing. It does take some time.
Nicholas Mellen: That's just an example of what we did. Marcelo, do you want to talk about our general lessons learned?
» Lessons learned
Marcelo Zambrana: It's not as easy as it looks. For sure, image pipeline requires patience, a lot of patience. Each time we are testing each environment, it takes time.
Remember, each time we are creating a VM, from a baseline official image. It depends on how big is going to be the disk, the VM that was in the Hyper-V environment, things like that. I think we started with at least a 30-gig disk. It was a pain. Just a test, simple things, you had to wait like half an hour, or sometimes more.
Now I think we are down to a 10-gigabit disk, and they still take some time. And we play with some metrics like increasing to more CPU-focused VM, I/O, things like that. Probably we can tune it up more.
But, again, you need patience with this stuff. It requires a lot of time; it's time-consuming. But once you have it running, it's easy to make changes and move forward.
But the initial environment takes some time.
Going back to environment, each cloud has its own requirements. So that's something else you need to keep in mind. You need to do some initial configurations of Vnets, subscriptions, and storage accounts, subnet, IAM policies, things like that, to gain access to the pipeline, to all these resources.
As I was saying, more clouds, more complication. So it takes time.
To make it work properly, especially one of these distributed works to multiple clouds and environments, it's a lot of complexity to handle. I don't know if you guys are familiar with Kickstart configuration files. How many are familiar with those for CentOS?
Testing those files is hard, because they change based on the version we are using. It is complicated, especially in a cloud environment.
Nicholas Mellen: A lot of back and forth.
Marcelo Zambrana: Yes, it is. It is pain.
And the import process. Each cloud has a different process to do this. For some reason Amazon requires a JSON file, I don't know why, but you cannot even name the final image. But you need to learn all those, have everything in place, and it takes time.
Granted, Google is better. It has all in just 1 command, but the output may take up to 2 hours. So it's a time-consuming process, and each cloud has its own limitations and is not easy.
Also, we wanted to optimize this import process. We didn't want to add more CPU-consuming time to our pipeline, and we didn't want to do image file conversions right from OVA to VMDK, back and forth, things like that.
We found out that VHD was the common format for all 3 clouds, and for on-prem in our case. So we went with that, and so far it's working fine. We should be able to support all those formats, but for now VHD is common for all 3 clouds and it is working.
Moving on, design. Again, we wanted it to be cloud-agnostic. And one thing that we had to consider is that we could have used the official Packer build for Amazon or for Microsoft or Google and grab a Marketplace image.
But what's the fun with that? Especially for us, working with the government agencies, they won't allow you to use a Marketplace image. They will block you right away.
Most of the time, you don't even have access to the Marketplace. So that was a decision, between Marketplace image or independent ISO file. But this could be locally stored, so we went with locally stored. So we grabbed one, we hardened and everything.
We couldn't use the Packer builder for each cloud. You have to do something generic that allows us to export to each cloud. That was something that we had to consider. And again, Hyper-V versus cloud-specific. Hyper-V allows us to use the VHD format, and it was allowed in all clouds.
Building this initial Hyper-V environment also took some time because we're building Azure, Microsoft, so we should be able to do this anyway. But having all that configuration requires some initial complexity, mostly because of security.
We were focused on supporting an air-gapped environment like our customers. So even the initial Packer VMs that we create in the pipeline, those get internal IPs, that you can reach from your internal network.
You don't get public IPs at all in this pipeline. Everything is internal, so we had to consider those. But we added support for an allied subnet, and that way, if somehow you get access, you are allowed to use external resources. We can map that routing and go outside and get those resources as well.
It was a nice balance. All that networking and everything was something that we had to think about. Recognize our design was able to support both environments, air-gapped and not air-gapped. And again, we used Terraform to that, which was nice.
It could be extendable; that's the purpose. We can do a lot of modifications, just do it offline, it would be there. Or we can move all that to a different subscription or account and just use it there. That was nice.
From the beginning we had the idea to do everything as code as much as we could, instead of manually. That was a nice decision because it allowed us to automate and save time in the long run.
Those are some of the lessons we learned along the way.
» Future modifications in the works
But we have plans. This is a nice solution, but there are always next steps, right? Right now, we only support 1 subscription, 1 account, 1 project. But ideally you should be able to support endless subscriptions and accounts and projects.
Usually, the way it works, you have a management account or subscription, and other accounts and subscriptions are attached to this.
You just import those images to that specific management subscription and from then you share those images to everywhere. That's something that we have in mind, because if you start importing images everywhere, that's not going to work. It's hard to manage those things.
And we had an issue; we had to import images everywhere, and it was a pain, because, "Who imported this image? Why is it still here?"
Things like, "Oh, we forgot to import to these other accounts we didn't know existed." That's why we want centralized everything, just 1 managed account and from there share everywhere.
Now our Jenkins slave is using service principles with Amazon keys. Jenkins has a nice plugin to store secrets, but again, a more elegant solution will be to have slaves on each cloud, and attach those other identities to this slave itself.
Using Amazon roles or Azure managed identities and GCP service accounts, that will be a more elegant solution. Again, we want to put some governance and policies in place, for, like, How long are we going to keep these images? How to enforce development teams to use the latest version.
Using Chef Server. Now we are using Chef Solo. I think that was just because it's easy, but using Chef Server will allow us to do a lot more things. And along with that we want to support multiple provisioners, like Ansible and Puppet.
Just be agnostic. Again, everything is going through containers and Docker, but the same architecture could be applied the same. So the next step is to start creating these image pipelines for containers.
Nicholas Mellen: Every time we do one of these things, we get all of our code out there, from Nick and Marcelo. Everything that we can share, we try to build something similar to what we did and put it in a GitHub.
You can check that out at the HashiConf 2019 repo. You can look at our previous talk and also our previous code, which was less about gold images and where everything starts, to STIG automation in the operation stage. So, looking at STIG drift, looking at how your systems may shift out of compliance, and how you can automatically remediate them without human intervention.
Marcelo Zambrana: This is mostly what happens after you create a VM using our gold image. How to keep that VM in compliance. Our ChefConf talk talks about how you can do that and automate that process.
Nicholas Mellen: Thank you, guys.
Marcelo Zambrana: Thanks so much, guys.