Terraform for the Rest of Us: A Petco Ops Case Study
Oct 07, 2019
See how Petco uses Private Terraform Enterprise (pTFE) to provision VMware instances and control the process from IPAM to instance destruction, even with legacy workloads.
Terraform has changed the way Petco works in the cloud, but it also works well on-prem. Why compromise when you don't have to? Use Terraform to provision VMware instances and control the process from IPAM to instance destruction, even with legacy workloads.
- Chad PreyManager DevOps, Petco
- Paul GrinsteadDevOps Architect, Petco
Chad Prey: Hello, Seattle.
Paul Grinstead: Hey, Seattle. How's it going?
Chad Prey: We're really excited to be here to speak with you all. I know that the reason you all are here is because you want to hear about how we coaxed Terraform into making our VMware virtual machines.
But first, I'd like to tell you a short story. Before I started at Petco I worked at a place that's much smaller. In fact, one of our teams was the size of the entire engineering organization. And that's where I first learned the word "Terraform."
There were 2 engineers, Alex and Rod, and they came to me one day and said, “We're using this tool called Terraform and we want to show it to you. They showed it to me and I thought, “Man, this is great. You did it, infrastructure as code.”
And that's our presentation. Thank you. No.
What happened is I got put into a different team. And I put Terraform on the back burner and developed this deep love affair with Vagrant.
Fast forward a couple years and a little bit of career success. I get this phone call one day. It's a recruiter. And they say, “We've got something for you. We looked at your LinkedIn. We think you're the right guy for the job. It's a big company. It's Petco Animal Supplies.”
I thought, “All right. Let's give it a try.” Having been introduced to Terraform, I felt there was still some work left to be done. And I got a chance to go back and reinvigorate that and make it happen for Petco.
But scale was the thing. It was cool to do it for the small organization; it's a lot more meaningful in my mind to do it when it's for a large organization.
Who we are
I want to take a minute to introduce Petco. We'll introduce ourselves. And then we'll continue along our journey.
I'm Chad Prey. I've been interested in automation since the Solaris JumpStart days. Any of you people out there use JumpStart? Yay, JumpStart.
From there I went to CFEngine2. Then to Puppet, back to CFEngine3. Chef, Ansible, back to Chef. And those configuration management tools change things.
But there was a paradigm shift. It was like when warp drive was invented. When you first saw first Vagrant and then Terraform. And we would just crush tickets using these tools.
My goal is I want to keep learning daily. I'm doing my best to remain helpful. And I would definitely say that the Petco team has been my inspiration for this.
A couple of fun facts about me: I'm a senior novice AMA District 38 dirt bike racer. And I'd like to say I'm a pet owner, a pet parent, but I'm not, because the cat that we have adopted us. Does that count? Anyway, enough about me.
Paul Grinstead: I'm Paul Grinstead. I am the DevOps architect for Petco. A little bit about me: I started in IT when I was 17 years old as an ops engineer for a really large ISP based in Los Angeles. I remember walking in this datacenter, and I was just a kid going, “Wow, look at all this equipment. What is all this?”
Even in 1996, I struggled with system life, patching, decommissioning. When I think back of all the roles I have had, that seems to be the common theme.
I, unlike Chad, have worked for big Fortune 500 companies, some of the biggest in the world. That's been my whole career. It's always been big orgs. Being in a datacenter wasn't very new to me.
I grew up in telecom. My parents both worked for the industry, and I have deep roots in design and architecting.
I've been with Petco for about a year, and it's been an amazing journey, not just for us as employees but for our team.
We have these moments where you have this high five or this huggable moment where you've worked so hard to get something done. You get it deployed out of production. You're like, “Man, that was awesome.” And then you have these other Debbie Downer moments where you're like, “Man, that stinks. That just doesn't work.”
A couple facts about me: I have 2 chocolate Labs, Shelby and Beanie. You'll see them in our presentation. We are a pet company, so I figured we'd throw our animals in there. And I build racecars as a hobby. If you're ever in Southern California, roll up.
Petco has been a part of pets and pet parents since 1965. The founder opened a mail-order vet supply company in San Diego. Over the past 50 years, Petco has been the trusted source for quality, premium pet products, and services.
We provide these products, services, and advice that keeps pets physically fit, mentally alert, and emotionally happy. Everything we do is guided by our vision: healthier pet, happier people, better world.
This is everywhere at our company. You walk through the halls and it's plastered everywhere. It's so important to us.
But the thing we have to remember is, Petco is a very large corporation, but it very much feels like a family. We're very integrated; we all know each other.
I'm going to hand it back to Chad.
Chad Prey: You probably want to hear more about Petco, right? Probably not. Let's get into the code.
One of the things that I wanted to mention is that you're all in the cloud. You're already in these patterns. But there's Terraform for the rest of us. There's the cool kids, and then there's the rest of us. And the cool kids are sitting at the table having your lunch, and this presentation is for the rest of us.
We're solving real problems day to day with the tools that we're using. Let's get into it.
First of all, we have a slide for you:
Where are we now?
How did we get here?
Where are we going?
We hope to answer these questions by showing you what we've done. There's no smoke and mirrors here.
When we were running through our deck, one of our engineers said, “That code is super-old. I'm embarrassed for you to show it.” If you notice it make sure you point it out. Tag us.
But that tells you how fast this space is moving. Because there's a release train, and you better get on it. You're going to get left at the station.
I took an interest in what HashiCorp CTOs Mitchell Hashimoto and Armon Dadgar were saying online, and I found some really great quotes and was going to present those to you.
But when we were finding logos for HashiCorp, we found this, which is the Tao of HashiCorp, which you heard Armon mention in his presentation. And it really resonated with us.
When you're selecting software, you need to be sure that the product offers the reason that you purchased it. It has to do that thing that you bought it for. However, some companies want to have something like a lengthy ProServe engagement or lock you in somehow. And, "Oh, yeah, we also solve your problem." But it's almost like an afterthought.
We get a bit frustrated when we find tools like that. And it's refreshing to be able to engage with a company that everybody wants to use those tools. It's really wonderful. It changes the way that we work.
Our Infrastructure as Code Squad values
I made this coffee mug, and all the people who were part of what we call the Infrastructure a Code Squad received one of these mugs. It has the Petco logo on one side and then "Equity at every step," which is one of our goals. We want to add equity at every step.
And these HashiCorp tools, the stuff that you're being shown today, they match our values and how we work. Hopefully, you'll see if we've done a good job. You'll see these themes repeat throughout the presentation.
One of the things is, automation creates opportunities to reduce operating costs without sacrificing speed or quality. One other thing is, you no longer have the luxury of holding on to outmoded mindsets.
Industry is not going to reward you for being the BOFH for your organization. That ship has sailed. The customer demands a certain level of responsiveness and level of service. Even your internal customers.
As service offerings become more commoditized, the customer doesn't have as much incentive to stay with the brand. They can just leave. "Boop! I'm gone. Thanks so much." I'm a fan of Peter Drucker, Stephen Covey, and Gene Kim. I have a copy of the Agile Manifesto principles hanging in my cube, and I can go on like this.
Paul Grinstead: I sit in front of him and he just talks all day. That's just the way it is.
Faster, easier system builds
Chad Prey: We told leadership that we can improve quality and go faster, and they said, "Show us."
The typical system that we used to build, just the system build took around 5 days. We now do that in minutes. Systems with low RPO/RTO (Recovery Point Objective/Recovery Time Objective) were fretted over. We had full backups, lots of management, and now we just rebuild them.
We tell our customers, "Push the button."
One of the things that you may hear in your organization—I know that we hear it—is, "I just want to get my job done." What I'm hearing the developer say is that they want to close their ticket. It's based on their lensing, the way that they're seeing their problem.
Private Terraform Enterprise (pTFE) allows the organization to leverage these tools to do that work that is abstracted from a developer.
Maybe they don't want to think about infosec. They're worried about their software architecture. They don't want to think about CMDB. Patching compliance, forget it. They just want to go fast.
The value of a hybrid approach
We built that into the deployment and the pTFE tools allowed us to do that. Petco has made a huge capital investment that hasn't been fully realized. I know a lot of you here are likely running VMware or have other on-prem solutions that you are trying to get all of the value out of before you transition into something else.
And there's a lot of life left in the systems if we can squeeze that value out using these excellent tools.
There are workloads that are better in the cloud. I will admit that. BI, it's going to be tough to do BI on-prem.
But there are also workloads that are better on-prem than in the cloud. That's why hybrid became our choice. And you'll see why pTFE became part of that.
Cattle, not pets
I tend to avoid jargon. But I wanted to introduce this concept of the "cattle not pets" pattern—which is ironic, because we're with Petco. Obviously, pets, you fret over them. You worry about them. Cattle is a different lifecycle. You don't have to take as much care.
The demo we're going to show makes use of this cattle-not-pets pattern. And ultimately this pattern lowers the threshold for things like operating system and application refreshes by building equity.
You're not going to be applying OS patches or application refreshes directly to the system; you just rebuild the whole thing wholesale. And it lowers your total cost of ownership.
The Petco toolbox
At the heart of the Infrastructure as Code Squad is a team committed to DevOps automation, and they crush it with GitLab. I just wanted to mention that when we're talking about our Petco toolbox. Our customers love us for bringing it on-prem, like we did something.
All we did was say, "Hey, this is really cool. You guys should use it." They're like, "Oh, thank you." And we're glad.
Our other on-prem tooling is the same stuff all you other large corporations have, but we've chosen to look at those resources a little bit differently. What APIs do they provide? How can we deeply integrate with them? How can we hook them together and safely provide these resources to our internal customers?
We got creative, and it's going to be an interesting year.
We chose HashiCorp tools. I asked our team, "How can we democratize our platform?" The team answered, "You democratize by empowering teams."
Terraform Enterprise allowed us to satisfy our customers’ change requirements and deliver code quickly. We can do both. You want everything tracked in ServiceNow? Fine, we'll do it. It's just we're going to do it using their API. We're not going to have humans go through step by step and click through things
You democratize also by reaching across silos, building trust, letting your customers know that you're along for the ride with them.
There are so many things that we've done because of that little phrase, "as the user." Base images got built. And like an M.C. Escher drawing, our tools deployed themselves. You've seen those 2 hands drawing one another?
Behind the scenes of self-service
One of the bigger challenges to understand is the pre-work required to be able to provision self-service enterprise Terraform workspaces. There's a lot that goes into it. You need to learn "sit" before teaching "stay."
We worked with HashiCorp and Nebulaworks to ensure we understood what we were building out. And to be sure the services were in place before buying the tools. And then we executed on what we decided.
In large corporate IT, good tools and partnerships are very important, but implementation is equally important. We pulled together a cross-functional squad of people from DevOps, infosec, storage, compute, and networking, and they make up a group called the Infrastructure as Code Squad.
We made a list of problems, put them onto a Kanban board, and started tackling them one by one. There were people that initially doubted, but they started to get onboard. First, seek to understand and then to be understood. There is a lot of work to be done here.
Paul is about to show you where we are now.
How it all works
Paul Grinstead: Thank you, Chad. In this video demo, we're going to walk you through the process of running Terraform inside of Terraform Enterprise with VMware and Chef using our CI pipelines that spit out an image for both VMware and AWS, along with our IPAM DNS solution, Infoblox.
The result will be a Sumo Logic Collector that's managed with infrastructure as code.
For some of you that don't know or maybe are a little bit curious, a Sumo Logic Collector is an agent that receives logs and metrics from sources and that encrypts, compresses, and sends it off to the Sumo SaaS.
Our on-prem collector endpoint requires configuration such as protocols and ports. Managed collectors can be deployed on-prem or in the cloud. At Petco, we do both. But our piece was about VMware.
I want to reiterate that at the beginning of this, it really comes down to having executive management understand what you're trying to produce and what you're trying to do. Without that high-level sponsorship, you're just never going to get anywhere. That's really important for us.
Thankfully for us at Petco, it worked out really well.
We use Packer to build a custom image for VMware and for AWS. We use a GitLab Docker Runner with a custom image for Python and PowerShell.
Using the Vault binary we log in with an AppRole to get our token. The token allows us to retrieve a secret. Those credits are used to run Packer, Chef, CIS hardening, VMware tools, various other agents and helpers that we need.
Post-creation, we tag it with "latest" but keep the old image around in case we have some reason to rebuild. We then copy the new image into the content library, and we use a custom Python script that knows where to distribute the template, either VMware or AWS
This is all automated. We don't interact with it. This is all built in-house for us by Petco.
The image we created is now a shrunken image. It's a very small version of itself. We ran into problems in the beginning where our images would be massive—100 gigs, 200 gigs—and there's no reason to store that kind of thing.
We use a custom Terraform module written in-house to resize the disk and to attach or resize the storage. We can also attach persistent storage if we need to, depending on the application.
We also call our Infoblox Python script and Infoblox module that takes the count and the outputs of the requested IP's account. This config can create 1, 2, 10, 20, however many boxes you need. It's just a matter of trying to figure that out.
Once a server's been created, Chef will come in and bootstrap the collector based on the role. Register the collector with Sumo Logic, and you're done.
This is a condensed version of what's happening, so onward to our demo.
Building a VMware box: A demo
Here we have TFE with Terraform that we've already written. Our VCS connector to GitLab has already been established, along with the workspace. We use a standard GitFlow. Code is committed in a branch. Branch has a pull request. And then merge in the master after approval.
Our Terraform is now running its plan. Terraform builds the VMware box with CPU, RAM, disk, along with the network. These are all variables passed into Terraform and TFE. We provide this to our end users in template form, so they are 90% of the way there. Then the user only needs to provide some credentials and count, and they're off to the races.
Our plan's ready. And we'll go ahead and confirm our plan. One of the things that I like about confirming is that it's literal: Are you sure that you really want to do this?
This is the last and final step that gives our user the option of taking a step back and really think and make sure this is what you want to do.
All of us that have used this application know that when you make a change and you're expecting to see 1 or 2 resources, and then it that comes back, and it says, "53." You're like, "Holy moly." It's that pucker moment where you're like, "Oh, gosh." It's always best to take a look. You'll always have that opportunity before disaster takes over.
I want to walk you through the process of what's happening here. The video's going to go by in a flash, but I'll try to explain.
Once VMware guest is built, it bootstraps to our Chef master. It next runs the default Sumo Logic Collector cookbook and a special rule that defines the type of box that it is.
We have different types of collectors in our ecosystem. The Sumo Logic RPM is installed. Default configs are written in their default directories. The recipe then places the Sumo Logic config files based on the definition. Restarts the application. The cookbook then registers the collector with Sumo.
This follows our cattle-not-pets pattern. Going forward the server will not be patched. We will either decommission it or destroy it. Maybe tinker the resources with our new image to redeploy. But we definitely are not patching anymore.
That's a high level of what we went through.
As you can see from our last part of our demo, we can log into Sumo, and we confirm that our new collector's ready to go.
We see that some of the inputs are grayed out. And this is because the collector is being run as infrastructure as code. There is no input for anybody to come in, no matter how high your credentials are, to override this.
I want to drive this home, how easy this is. With all my talking, and the magic of the video editing, this process is really only about 2 or 3 minutes. Granted, we have some development that has to take place, but I can cut 50 of these out in 10 minutes. Stack them out, send them on their way. It doesn't really matter.
A graphic flow of the process
Now that I've shown you the Terraform, Chef, and Sumo, what do all these steps look like? Well, it looks like this slide. Exactly.
We decided to render a graphic flow of this process. I'm not going to get into details of exactly what's happening here. Most of your workflows are going to go very similar to this. But imagine building all this by hand, doing this 3, 5, 10 times.
As a previous systems engineer, people would ask me, "What image do I use? What's the RPM gateway? What DNS servers do I use? Which version do we have to install? I don't have credentials."
It was this long, drawn-out process. You're constantly kicking the football over the fence and hope that someone else will pick it up.
Building infrastructure over and over can sometimes be boring and frustrating. Studying the flags over and over, remembering the steps, creating a checklist is painful. Having a human do this will cause drift and introduce inconsistencies in deployments.
Using Terraform along with the rest of the Hashi tools removes these barriers for Petco.
Believe it or not, writing Terraform and using HashiCorp's tools is easier than training a dog. But let us be serious about this. What you saw was a complete implementation of a Sumo Collector, start to end, without any person interrupting it or doing anything of the sort.
All this comes back to the what, why, and how of our presentation.
Petco wanted to reduce cycle time, reduce O&M, reduce deployment pain, improve security. We're able to maximize the value of on-prem VMware and Chef investments by using quality enterprise tools with infrastructure as code that we can create a repeatable, maintainable system.
And this is driven home for us every single day. This is our whole reason for being here.
I'll hand it back to Chad.
The rough patches
Chad Prey: All right. It's not all tail wagging. There has to be a little bit of Commander Kurtz going up the river and the horror.
One of the things that you'll see is engineers need time to work effectively with this new mindset. You need to establish that mindset. You can't just give them the tool and hope that they're going to know what to do with it.
And change is hard. It's a paradigm shift. Like I said before, it's like having warp drive. Any Trekkie fans? And there are problems that you cannot anticipate. Some of the policy that you got right now just doesn't make sense.
The other thing is resource guardrails. You need resource guardrails in the cloud because you don't want your CFO coming down to your office saying, "Why did you deploy 50 R5.16XLs? We just missed our quarterly objectives because of your deployment." And this is a job for Sentinel. There's probably another talk on that. But I would definitely go and check that out.
The other thing that we noticed is, because of these tools, the engineers got excited, and they started creating these things that the rest of the organization needed time to be able to digest.
We like to think of this as an inchworm, where you'll have people go out and do something and then demo-retro it and pull the rest of the organization behind. And it moves in this inchworm pattern. Small changes delivered frequently.
Terraform did change Petco. We had a great deal of success with the business unit in 2017. And we earned their trust, and our customers wanted in on it. What we did was we gave them the reins. Teams are now much more engaged.
They really want to participate, and they're willing to participate and surprise.
One of the things that we noticed is we would go through her fleets of systems and we would find machines just laying around. What were these things doing?
Well, when it takes 5 days to get a server built, they had some emergency spare servers, like a Clif Bar in your backpack, just in case.
Those machines went away, and now all of a sudden our resource utilization goes down and we're able to do meaningful work for those systems. Our change requests are streamlined now; because of pTFE, they all look the same. It's really easy to review that. Our change review board is able to easily see what we're doing.
It's been over 2 years since I got that phone call. I'm glad I took it. We've done a lot, but there's still much more that's left to do.
pTFE and automations mean we have more time to work on our futures and initiatives. We're no longer spending so much time fretting over how we're going to get those boxes stood up.
And Petco is building on top of the stable platform that pTFE provides. We are building equity with every step. Every module release is just another bit of equity. We're retail. We're a pet store. But we still write some pretty killer code.
Paul and I did not do this alone. There's a bunch of people back at the shop. The Petco leadership, I asked them if they would be willing to trust me and Paul and the infrastructure-as-code team to build this, and they said yes.
Also, if you haven't met Naomi Watnick of HashiCorp, she is a super STAM (senior technical account manager). I don't think that we would be up here presenting to you without her assistance. She put us in touch with the people that we needed and got us the information that we needed from Hashi.
Also we had some great consultant help from Nebulaworks and Slalom. And the people on the right side of this slide are listed underneath Naomi, who's a de facto member of the infrastructure-as-code team.
I'd like to thank you all for your hospitality and for your attention today.
Paul Grinstead: Thank you, everybody, for attending. Please be safe out there.