Is Terraform a leaky abstraction? How should platform teams balance flexibility with prescribed workflows? Grainger shares its experiences in deciding how much Terraform devs should need to know.
Hi, everyone. Thank you for joining us for the last talk at HashiConf Global. We've had a great time here the last few days. We hope you have too.
My name's Emily. I'm a director of engineering at Grainger, and I lead our KeepStock engineering teams. And I'll tell you a bit about that in just a minute. And I've got my friend Dan with me here.
Hi. Dan Capetta. I am the director for enablement and platform engineering, and we'll learn a little bit more about what that is as we go through the talk as well.
We're here to talk about whether my team should need to know Terraform or not. The inspiration for this talk is really an about four-year-long debate that I've been having with Dan and his teams about what is the best interface for the platform engineering teams at Grainger to expose to teams like mine. So, we want to talk you through a bit of our journey and what we've learned from it. But first, we want to explain a little bit about Grainger and what we do.
Grainger is the industry leader in maintenance, repair, and operation. We're a 13-billion-dollar company that stocks over 30 million product SKUs. We have 1.5 million of those SKUs in our distribution centers and 24 distribution centers spread out across North America. We have about 24,000 employees that help make sure that we get things to the customers. You look around a room like this, you'll see a lot of things that a conference center needs that Grainger can provide to keep their operations running smoothly, whether it's replacing an exit sign that might have gotten damaged or any of the other kind of consumable things you'd need to run this kind of operation.
And, I mentioned my teams support KeepStock. KeepStock is a part of Grainger's business that helps customers manage their inventory. So, we help customers have the right product in the right place at the right time. Doing inventory management's really a high-level optimization problem. It's about deciding when it's worth spending a little more money to have more product, because the risk is so high of stocking out of something important, or, when you need to optimize your space and make sure you're not taking up too much. But it's all about helping your customers keep their operations running.
We use a variety of different technology solutions to help our customers do that, and it's the software that my teams write. So we have a mobile app. We have an inventory management platform, a website, and we have a variety of different types of vending machines that we use to help our customers manage their inventory.
As I mentioned, one of the things my team does is build a delivery platform to help teams like Emily's deliver that software to our customers. We really have five major components of our delivery platform. The first is the tools that we build to support teams. We want to make sure that we're using reusable code, reusable libraries that really accelerate and simplify the development process. We build infrastructure automation leveraging Terraform to ensure that we're building service quickly, that they are able to scale, and we are able to really meet the needs of the applications as they grow.
We have a variety of capabilities around driving quality into our software engineering process. We want to make sure that we're giving fast feedback to the engineers developing our software, and we're really focused on reducing the amount of manual testing that we're doing. We want to do all this in a secure and compliant way, embedding audit checks into the process, so we have traceability about the changes we're making in the production. And all of these things are really focused on driving a good operational experience. We want to make sure that we've integrated monitoring, alerting, and logging into the applications that we're building, so that we're able to detect issues before they become problems, and really able to keep that good customer experience on the software that we're building.
Now that you know a bit about Grainger, and what we do, let's get in our TARDIS and go back in time to about four years ago. This will take us to when my teams first started building cloud-native applications, and when Dan's team was first starting to build out our delivery platform to help teams like mine do that. And, things started a little bit like this:
"Hey, Dan, I'm building an inventory system to show my customers how much inventory they have. To do this, I need a database. Nothing fancy, but can you help me with that?"
“Yeah, absolutely, Emily. We're here to build these kinds of database solutions for you and for other teams at Grainger. Here's your ticket, and we'll absolutely get to that as we work through our ticket queue.”
That sounded okay at first. You know, Dan's going to build my database for me. That means I get to focus on my inventory management system. I'm going to go work on that. It's going to be great.
And then, some time went by, and I realized that I had written a lot of code and still hadn't really shipped anything of value to my customers, because I still didn't have a database. And I hadn't heard from Dan in a while. So, I thought it was time to follow up.
“Dan, how's it going with my ticket? Can I have my database yet? I really want to ship this software for my customers. Can you help me out?”
“Emily, we're absolutely been working on it. We're not ready to build that database for you yet. We've been talking to the various stakeholders across the organization. We make sure we're talking to security. We're talking to the ops teams to make sure that the automation we're building to get you that database meets all the requirements of the different stakeholders across the organization.”
“Okay. That sounds good. But, you're really blocking me from delivering value. Isn't there anything you can do, Dan, to help me out?”
“I mean, we have been iteratively developing the Terraform around building the RDS instance for you. We've been incorporating the learning that we have gotten from these different stakeholders. I mean, there's no reason we can't share the code that we've created so far, so you have something to start from. We're just not ready to build that database for you yet, but there's no reason you can't use what we've already started building.”
“Well, that sounds better. What would that look like?”
“We've started really creating this idea of accelerators. These accelerators are code repositories that have all the Terraform that we've written with all the information we've collected from those various stakeholders. We've built out documentation that's part of the repository. So, it's really this kind of one-stop shop that defines what we think a database should look like.”
“Okay. Cool. I'll try that out.”
So, again, went back to my team. I've got this Terraform accelerator, so I think I'm ready to ship my inventory management system. But, again, a fair amount of time went by and I realized I still hadn't delivered anything really of value to my customers. So, my team and I, we looked at each other and we said, "What have we been doing? Dan's not blocking us anymore, but we still haven't delivered anything. What is going on? What have we been doing?"
Here's what we realized: First, Terraform was a new tool for us, so we want to make sure that we understand the tools that we're using. So, we did some research on what Terraform was. And, we don't just want to use the tools that we're using, we want to use them in the best way.
So we did some research on that. And then, in the course of our research, we discovered that there's this kind of cool example of the kind of database that we want to create available on the Terraform registry, and maybe we should use that.
But then that led us to looking at, Do we really understand the kind of database we want? Are we doing this in the right way? And, we were looking at that as well. And then, yesterday afternoon, Jason, an engineer on our team who was most comfortable with Terraform, was out sick, so we just all went for coffee instead. And, at the end of the day, we've learned a lot.
Dan is not in our way. But, we still really haven't delivered a lot of value. So something still seems a little bit off.
“So, hold up, Dan. Wait a second. When your team first started, this is what I was promised: I was told that because of your team, my team was going to get to focus the majority of our time on creating customer value, and that that was going to be possible because your team was going to create platform capabilities for supporting infrastructure, so that I didn't have to worry about that. I feel like I'm in the weeds here. Is Terraform really the right abstraction for you all to be providing to me?”
“I mean, I'm not sure that Terraform is or isn't, but one thing I'm really pretty sure of is that I want to make sure that you have some mechanical sympathy for the infrastructure that you're using as part of your software solution. There's this idea that you don't have to be an engineer to be a race car driver, but you should have some idea of how the car works. And, I'm just really concerned that if it just happens and you just have a database, you're going to end up with a flat tire at some point, and that I'm going to be left being your operations team, which isn't why the platform team's here either.”
“I guess I understand that. We can give that a try.”
So, we took another pass. We got our database set up finally, and we felt pretty good about it. We got our inventory management system. It's supporting our vending machines, our website, our mobile app. Things are, we think, going pretty good. Until...
“So, Emily, remember how I've been talking with all those different stakeholders and I've been having those ongoing conversations with security, and they came back with a new requirement that we have to implement. We have to encrypt all of the data at rest when we put it into a public cloud. So, my team absolutely owns that accelerator. We're going to go make that change to that repository. The problem is, the way we forked it, I need you and everybody else that used that accelerator to go back and make the same change and reapply your Terraforms to that database so that we get that encryption at rest.”
“Okay. I mean, we'll do it, Dan. Security's important. But, it seems pretty inefficient, doesn't it? Because every team like mine has to go make the same change.”
“It does. It does. We have to figure out a better way.”
So, this gets us almost up to date. Next, we'll talk about what we've done differently since that last example, and what we've learned overall.
One of the first things that we really focused on is making it easier for teams to adopt these accelerators or these starter kits. We created an engineering portal that helps curate and manage both the documentation and the starter kits, so that teams like
Emily's team can find the starter kits and really have a better understanding of how to use them. There's also a lot better automation around initiating or kicking off getting one of these starter kits put into a repository. So, instead of it being a manual process, it's something that happens through automation in minutes.
This solution really gives customer teams a much more self-service solution, and it gives teams that have a more complex use case the ability to start with a starter kit and learn the tools and make the changes they need to do their jobs.
One of the other things that we really focused on was, it can't just be around the technology and the process that we put in place. But it also has to be about the people, about the engineers that are using the solutions or the capabilities. So, one of the things that we've done at Grainger is build out an engineering effectiveness capability, and one of the things that that offers is our dojos.
Our dojos are set up to help change the culture, practices, and drive experimentation with the teams that come through. We want to really enable a culture and a mindset that is safe for teams to learn, where they can experiment and try new things in a safe environment, and learn quickly. And that they have an opportunity to practice and really hone their skill around the modern engineering practices that we want them to do whether it's in Terraform or any of the other modern software practices we're trying to drive.
Another big shift that we've made is, in the early days of Dan's team, there was a lot of focus on starter kits. But that's really not enough. We realized that we needed to complement starter kits with libraries to help teams with their ongoing maintenance and life cycle.
Before Dan's team started building anything self-service, the effort for a team like mine to build something in the cloud was very high. It might take us weeks or even months to get a basic environment set up. And so, in the early days, Dan seemed focused on starter kits and that kind of first-day problem. And it made a huge impact. My team could then build a new system that was already in production in a day or even a few hours, and it was much better. But what we found is that the ongoing maintenance effort was still fairly high for teams like mine. And so, over time, Dan's team has pivoted to focus not just on the starter kit side of things, but also libraries. And, because they publish versions and change logs, and give us fair warning, my teams can pick when it's right for us to upgrade, with the other delivery that we have. And this has helped quite a bit.
Now, overall, we won't claim to have the perfect scenario figured out at Grainger for the right interface for a platform team. In fact, I really don't think that that exists. There's a lot of trade-offs. You have a lot of options. And there's no one solution that's going to work right everywhere. But, we've tried a number of different things, and we've come to some pretty strong convictions about some platform principles that are helping to drive the way that we make decisions now about the right interface.
One of those first principles that we really talk about is understanding the kind of solution you're trying to build at any given time. There are two types of use cases we think about:
That 80% use case, where a large percentage of the organization is going to use it.
The 20% use case where it's not something we're going to see a lot of adoption on.
The key here is to know that you have to manage those two things differently.
There's going to be things where, in the 20% space, the team that you're working with knows that they're getting in a car that's going to tear down a dirt road that's going to kick up a huge amount of dust, and they're mentally prepared to work with you and understand what they're getting themselves into. For the majority of teams, though, they're not interested in the process of going down the road. They're really focused on getting from point A to point B, and then in those cases, we really want to build a solution that enables a lot of people to get from A to B very quickly, safely, and predictably, because it's really about getting to the destination in this case and not about the journey.
The inventory management system that we used in this example is a pretty solid 80% use case for us. There are lots of teams at Grainger, not just mine, that want to build similar systems with databases. And so that's where Dan's team has spent a lot of effort, making sure that that path is very easy and quick to get started and maintain over time. In contrast, my teams care quite a lot about management and deployment of IoT devices, and we're the only ones at Grainger that currently care about that. So that's not a space where our platform teams have invested a lot of effort. That's something that my teams have done on our own — but, in close partnership with Dan's team, where it makes sense.
One of the other things that we realize when we try to highlight in the example that we walked through is that there's just no way a central platform team can build everything for everybody across an organization the size of Grainger. So, what we really have pivoted towards is this idea of enabling flexible contribution, but with appropriate guardrails. Because, without those guardrails, you lose a little bit of the trust that you have in the solutions that are being provided to the organization.
The analogy that I like to use is, if we can get lots of different folks trying to solve the problem, or fill up the reservoir within the channels that they're contributing to, you can get to a place where you fill the reservoir very quickly, versus one of these teams trying to do it. It'll take a lot, lot longer. This open source / inner source model is working pretty well, and it really is focused on having good tests with clear ownership about who's accepting the PRs, and making sure that the quality of the contribution stays high, and that we're supporting teams as they learn how to contribute to these shared solutions.
And next, platform teams need to meet customer teams where they are, but be prepared for that to change. The kinds of problems the teams are solving are going to change over time. And even if the problems stay similar, the technology landscape around them is going to change, and the skills of the team are going to change. Or they're going to vary, from team to team. So, it's important that the platform team is always paying attention to - who are your customers? And what do they need now? and evolving the interfaces that you provide to best meet those customers. You can use coaching and dojos and training to help fill those gaps also, where needed.
Also, platform teams should look at designing interfaces to prompt the right questions of their customer teams. I very much agree with Dan's premise that my teams do need to have some mechanical sympathy for infrastructure. If we don't we're not going to make the right decisions about the software that we write that depends on it. But, there's also a lot we don't need to know. And, when you present a lot of really complicated, flexible options to a team that doesn't understand something, it's very easy to overwhelm them.
I think that one of the best things that platform teams can do is think about what controls do teams like mine really need to know and care about? We need to care about how our database scales. How it's replicated. We might not need to care about the details of the networking rules that keep it working with our VPCs. And so, making sure that when you design your interfaces, you help my teams focus on what we really need to care about. That will make a huge difference.
Lastly, always try to encapsulate the non-negotiable requirements. If there's something that our security team says that I'm not allowed to do, and no one else at Grainger is, don't let me do it. One thing that's tricky with this one is, sometimes your non-negotiable requirements are going to change. And so thinking about where you may be able to provide slightly higher levels of abstraction to protect your customer teams from that will help.
This is one of my favorite examples of a non-negotiable requirement: I came to LA from Connecticut, and Connecticut has a lot of these very scenic highways that generally run parallel to major interstates. This is a road that runs parallel to Interstate 95 in Connecticut, and some of the non-negotiable requirements that it has are these very pretty low bridges that are very rigid stone that are just immovable objects. Inevitably, somebody wants to go faster. They want to try to get from A to B more quickly, and this is the result, where that truck hits that immovable object, and you end up with, I think it's oranges, scattered all over the highway.
So, the goal here is to make sure that we're preventing those kinds of non-negotiable requirements from becoming something that someone tries to leverage. As we talked about earlier, one of those examples is encrypting data at rest. We absolutely have the variable for storage encryption in our Terraform modules, but as Emily talked about a little earlier, we're not trying to elevate this setting to a place where her team even knows where the setting is within the modules. We want to make sure we're calling out the variables around scalability, and make sure that those things are front and center within the variable files that we're using.
First, there's something like encryption where we want to just make sure it's in there, we're controlling that it's happening, but we're not trying to make it visible to Emily and her teams.
And, of course, we want to make it very clear, Grainger does really love Terraform. We've gotten a lot of value from being able to declaratively provision our infrastructure and manage it through code. I'm still not 100% convinced that my team should need to know it, or be experts, but don't want that to be confused with the value that we're getting from it at Grainger.
And with that, that actually wraps us up for today.