Dive deep into Petco’s method for reducing the cost of updating stateless on-premises virtual machines via automated redeployment using HashiCorp Packer and Private Terraform Enterprise.
Hello. I'm going to talk about automated server redeployment in a traditional on-premises, virtualized environment. My name is Chris Manfre. I'm a senior DevOps engineer with the Petco DevOps COE (Center of Excellence).
There are a couple of different ways you can go about updating your on-premises, stateless, virtual machines. When I say "stateless," I mean that the backing data for the service that they provide has been externalized—for instance, web servers and middleware that speak with a database.
You can update these servers via traditional hand patching, where someone logs into each of the systems, starts wrestling with packages despite there being a package manager, and then prays that the system comes back online on the new kernel after the final reboot. This is not a very good method. It costs a lot of time and people power.
You can automate your patching. This does save on costs, but as your systems age, the potential for configuration drift increases. You're no longer sure that what you have running is what you initially deployed.
Finally, you can do automated server redeployment. This saves on costs and results in higher-fidelity copies of your systems. Your VMs become fungible.
I'm going to share with you today how Petco is able to redeploy its Sumo Logic Collectors. Sumo Logic Collectors are log and metrics aggregators. We have a handful of them load balanced in each of our locations via F5.
The tools that we utilize for the redeployment process are GitLab for version control and CI/CD, Packer to do our Nutanix base images, Terraform to express our systems as code, and private Terraform Enterprise, which I will henceforth refer to as PTFE, for the Terraform execution.
First, I'll give you a quick summary of our Packer. Then we'll take a deeper dive into the Terraform so that you understand what resources we're creating as well as how they're organized. Finally, I'll share with you how we're able to do the redeployment with PTFE.
We build base images via Packer for Nutanix. We have this scheduled monthly out of GitLab CI/CD. We use the QEMU builder every month to produce an updated and hardened image. This artifact we copy to a centralized object storage, and then we use the Prism API in order to distribute the image to each of our Nutanix clusters. This gives us a consistent set of images in all of our locations.
We also pseudo-tag them in their names, like
rhel7_latest using the date and time that they were created, like
rhel7_2021-08-17_1-11-48. This allows us to pin a deployment to a specific version of a template so that we're guaranteed the versions of the packages that come within it. We can also subscribe a deployment to the latest version of the template so that we're always getting the most up-to-date packages.
This also allows us to rotate our images so that we can reap older images based on the age. This ensures that we're never deploying out-of-date templates.
Petco has a collection of composable Terraform modules in order to do its on-premises deployments. They can be classified into 2 types: platform-specific modules and helper modules.
For our platform modules we have VMware and Nutanix. These are responsible for creating the virtual machines within vSphere and Prism Central. They also provide some standardized outputs. These outputs can chain into subsequent helper modules. The helper modules are platform-agnostic, meaning that they can take the outputs from any of our platform modules. And if we ever have a new platform, all we have to do is create a new module for it. Then we can slot that into our existing helpers.
The helpers are responsible for creating supporting resources for each of the servers that the platform modules deploy. They also make heavy use of provisioners running out of
null_resources for places where we don't have a formal Terraform provider. They do things like remote-execs for post-deployment commands on the servers themselves as well as local-execs for API calls via Python on behalf of the systems. The Terraform makes heavy use of
for_each. This allows us to uniquely describe and key each of our servers.
On the screen you can see our
each_server variable as well as the server1, server2, and server3 objects. The Terraform variable type for this is a map of objects, but you can think of this like a big piece of JSON.
Now you can see our server1 object opened up a little more in the deployment: Active Directory, Chef, ServiceNow, and Sevone configurations.
The server1 deployment configuration looks something like this slide. Here we're able to specify the image name. In the case of our Sumo Logic Collectors we're using the
rhel7_latest image to make sure that we're always getting the most up-to-date image. We can also specify other things like the CPU settings and the memory; if we needed extra discs, we could put that here as well.
Our Active Directory configuration looks like this. In the case of our Sumo Logic Collectors, we want to make sure that our server admins and our Sumo admins are able to log into the systems. So here we're able to authorize the server ACL groups that they belong to.
Our Chef configuration looks something like this. Here we're specifying our Sumo Logic Collector policy as well as the policy group that the systems get deployed into and the Chef client version that we're going to bootstrap.
Our ServiceNow configuration looks like this. Here we're able to assign who owns the systems as well as the location and environment that the systems are being deployed into.
Finally, our sevone monitoring configuration looks like this. Here we're able to specify what ticket queues our event tickets go to and who to contact within PagerDuty in case these systems go down.
When we call our Nutanix module, our platform module, it looks something like this. Here we source it by Git URI so that, as the module gets developed, it can be in its own discrete repository. And we reference it by tag, so we're able to pin to a specific version of the module. This allows the module to continue to be developed and not have the code move out from under us.
This is very important for Petco, as a larger organization with many Terraform practitioners and multiple teams that utilize these modules. If you're a smaller organization, you might want to just reference by branch. If you're a single practitioner, maybe you just want to use a relative path and develop your modules within the same repository where you're consuming them.
A little further down from the source, you can see how we're able to ingest our
each_server variable. Here we're putting in the deployment configuration for each of the servers and referencing it by that server key. The module itself is responsible for creating the Nutanix virtual machine resource. We also do automated hostnaming via an in-house-developed Petco provider. It has a large library of words, and it will concatenate 2 random words together in order to form a hostname. We get such classics as
yaps-unsweetly. We also get some not-so-great names, but I won't discuss them here.
It also has a ServiceNow client, so I can check a prospective name to see whether it's already in use in our inventory. This custom provider we self-host in our own private provider registry on-premises.
The module also uses the Infoblox provider in order to do IP allocation and create an A record for the hostname and a PTR record for the IP address. It's also responsible for some post-deployment tasks like setting the hostname on the systems when it comes up and doing some networking configuration.
As I alluded to before, these platform modules provide some standardized outputs, and that looks something like this. Here we're able to take the name of our Nutanix virtual machine resource; the ID, which is the UID as seen by the hypervisor; the fully qualified domain name; and the IP address. We're able to return that all based on that server key.
Our Active Directory module looks something like this. It's responsible for creating the server resource groups for each of the systems, as well as adding the authorized ACL groups to each of those resource groups. You can see how we're feeding in our Active Directory configuration based on that server key.
We're also ingesting the ID, the name, and the IP address of each of our systems from the Nutanix module output based on that server key. The module is also responsible for a post-deployment task of delivering the Vault AppRole ID for the joint credentials, which Chef will later utilize in order to join the systems to our domain.
Our Chef module looks something like this. It's responsible for bootstrapping the Chef client, as well as applying the Chef policy.
Our ServiceNow module is responsible for adding the systems to CMDB inventory, and it also removes them and retires them when we do a destroy.
We have a
delete_sumo module. It's responsible for cleaning up the systems from the Sumo Logic portal. It's the Chef policy that ends up adding them to Sumo. And it's this module that's responsible for cleaning them up after the servers get decommissioned.
We have a Nutanix affinity module. This module is responsible for pinning guests to certain hosts within our clusters. They're are licensed for the RHEL 7 operating system. We run in a mixed environment, so not all of our hosts are licensed for all operating systems. This module allows us to deal with that.
We have our Sevone module for adding our systems to the monitoring platform. It's also responsible for moving them on decommissioning.
With these collections of Terraform modules, along with Chef policy, we have code in order to be able to specify any arbitrary on-premises deployment. This code, along with a PTFE workspace for the Terraform execution, allows us to fully manage the lifecycle of these systems, from deployment all the way to decommissioning.
Next I'm going to share with you how the framework that we utilize for redeploying our Sumo Logic Collectors looks. It runs out of GitLab CI/CD. It's triggerable, meaning that other pipelines can request a redeployment via API. It's also schedulable, so we can do monthly redeploys in lieu of patching. And it's containerized. All the scripts and tools that we utilize for the redeployment process are rolled up into a tagged image, which gets run inside Docker runners.
When I say it's a framework, I mean that it's reusable, it's an importable
GitLab-ci.yml that can be utilized in multiple projects. And it's configurable. It's designed to be able to sequentially recreate any list of arbitrary Terraform resources that are managed within a given PTFE workspace. We just happen to be using it here for redeploying our Sumo Logic Collectors.
In order to configure the redeployment framework, we have to pass it a
config.json. On screen you can see the config. We're passing in things like the hostname of our PTFE instance, the ID of our workspace, some timeout settings, and then the resources that we're going to be redeploying.
When you think about a redeployment, there are resources that you want to keep and there are resources that you want to recreate. Things that we want to keep include the IP address of each of our systems. Unfortunately, we don't have programmatic access to our F5 load balancers. So we want to make sure that the IPs in the backing load balancer pools remain the same. We also want to keep our hostnames. It doesn't make sense to rename our systems. We want to keep all the DNS records that we've created. It doesn't make sense to mess with those. Otherwise we might end up with some TTL issues.
It doesn't make sense to remove the systems from inventory. They can stay there. Same thing with our monitoring. They can stay in monitoring. And we also want to keep all the things that we created in Active Directory, which allow people to log in.
One thing that we do want to recreate for the redeployment, first and foremost, is the Nutanix virtual machine. We're going to end up tainting this resource so that we're 100% guaranteed to get a redeploy on the new template.
After the systems have been recreated, we need to do all the follow-on tasks, like resetting the host affinity, rerunning the Chef bootstrap and policy application process, and all the post-deployment things. The way we specify that looks something like this. You can see the resource list expanded a little more. In the name field we have the Nutanix virtual machine resource as Terraform would recognize it.
We also list all the dependent objects, those other things that we want to keep. And we're also passing in a little bit of additional information here, the server key, server1 in this case. There was 1 of these resource blocks for each of our 3 collectors, server1, server2, server3. I only have server1 expanded.
In addition to this configuration, you can also bring your own Python scripts. It has the ability to run arbitrary Python at certain points in time while the pipeline is running. First, there's a prescript, which lets you run Python before the redeployment of any resource. There's a health check, which runs after the redeployment of a resource. And that's used to confirm success to make sure that the thing that we recreated is up and healthy.
Finally, there's a postscript, which is a last chance to do any work before we move on to the next resource. There are some arguments that we pass to the script. The resource that's currently being redeployed from the config.json gets passed in as a Python dictionary.
We also retrieve the Terraform state at those points in time via the TFE API, and we also pass that in as a Python dictionary. There are some secrets that we need for API calls that we utilize within these scripts and these credentials. We end up retrieving them from Vault via JWT, and we do this within our
GitLab-ci.yml. That looks something like this. You can see on screen we're configuring our variables for Vault, including the Vault address, the JWT role that we're going to be logging in as, and then the mount point and secret path where our secrets are at.
In the previous script, you can see how we're able to log in via the JWT role that we specified, retrieve our Vault data, sign out and revoke our token, and then export the secrets as certain environmental variables that our scripts will then consume. Here we're exporting the PTFE token, which our pipeline will use to drive the workspace, some Sevone credentials for managing the monitoring, and some Sumo credentials for managing the Sumo portals.
At the very bottom, you can see where we're including our redeployment framework. Up first is our prescript, which looks something like this. It searches the Terraform state that gets passed to it along with the server key in order to find the hostname of the system that is currently being redeployed. This hostname, along with the credentials that we have retrieved from Vault, we use to put the system into maintenance mode. We want to make sure that we don't get any alerts when the system goes down for the redeployment.
In order to do the redeployment, the first thing we do is we taint our Nutanix virtual machine resource. Unfortunately, there aren't any ways to do a taint in PTFE via the API. This is something that we have to do via the binary. In order to do that, first we query the workspace for some metadata using the TFE API and the workspace ID that we specified within our config. Then we use that metadata to render a dynamic backend remote file.
We use things like the organization and the workspace name and that token that we retrieve from Vault in order to render this backend remote. We also check to see what Terraform version the workspace is set to, and then we download that same version of the Terraform binary. Then, with this initialized backend, along with the Terraform binary that we downloaded, we do the taint. This ends up resulting in a new state within the workspace.
Finally, we do a targeted apply on the Nutanix virtual machine object, and also that list of dependent objects, and this kicks off a run. The first thing the run does is it goes into a planning state. While it's in a planning state, we poll the state of the run, and we're waiting for the status to become planned and for the boolean is-confirmable to be set to true so that we can proceed.
In this output on the screen. You can see that it's still planning, and is-confirmable is currently false. Once the plan is complete, we end up confirming the plan via the TFE API, and that kicks off the apply process. Similarly, while the apply is running, we poll the status of the run and we wait for the status to turn into applied so that we know that it's finished.
Once it's finished, everything is green. It looks good. Then we want to do a health check to make sure that the system that we just recreated is actually up and healthy and able to accept traffic. We check the state of the Terraform along with our server key in order to figure out the hostname of the system.
Then we use that, along with the Sumo Logic credentials that we retrieved from Vault, to check to see whether the system has successfully registered itself within the Sumo portal and is up and accepting logs. After we've confirmed that the system is healthy, we run our postscript, and this we use to close the maintenance window that we had opened within the prescript. Again, we're searching for the hostname in the state, and then, along with the hostname and the Sevone credentials that we retrieve from Vault, we close that maintenance window.
That completes the process for server1. Then we repeat this for server2; we do the same thing all over again. We do the taint, we do the target apply. Then we repeat for server3 we do the taint, the target apply, and then it's complete. This whole process ends up taking about 20 minutes. With very little effort on our part, we can do this with a click of a button or a scheduled task.
But we didn't arrive at this overnight; it was a journey. Each piece of code that we created enabled further and more complicated automation that we could layer on top of it. And the path that we took was roughly the same that I described the pieces of this automation in. First we worked on the Packer. This gave us consistent images for our deployment automation. Then we worked on Terraform and Chef. This gave us code in order to be able to do consistent deployments.
Then, adding to that, we layered PTFE on top of that for a consistent environment for our Terraform execution and a centralized place in order to be able to store our Terraform state. Finally, we were able to layer our redeployment framework on top of all of this and be able to automatically drive the PTFE workspaces in order to perform these redeployments.
There were a lot of people that worked on this that I would like to thank. First and foremost, I would like to thank the Petco leadership team who gave me the time and opportunity to be able to put this talk together for you all. Without them, I would not be here. I also want to thank the following IaC squad members who contributed to this code base: Rajaram Mohan, who worked on a lot of the initial Python for scripts where we didn't have a formal provider: Chellapandi Pandlaraj, who did a lot of the kickstart in the Packer, as well as some systems tuning within Chef; Sridevi Potluri, who worked on some of the monitoring pieces; Gabe Oravitz, for all things Nutanix; Tammie Thompson, for helping us out with the networking issues; Lee Yeh, who did most of the Active design work; and Pratheepa Murugesan, who helped us out with service accounts for automation.
There are also some former IaC squad members that I would like to thank that have in the meantime moved on to other opportunities: Chad Prey, friend and mentor, who got the seeds for all of this planted; Paul Grinstead, who got me started with Packer and Terraform and got Petco started with PTFE; Vivek Balachandran, for all things Nutanix and VMware; and last, but certainly not least, Venkat Devulapally, who got me started with Chef. Thank you, all.
I would like to leave you with a car analogy. Why spend time and resources and energy maintaining a fleet of used vehicles when you can just get a new set of vehicles every month instead?
Thank you very much.