Learn the steps that Starbucks cloud engineering took in advancing their infrastructure as code capabilities to automate and simplify as much as possible for their engineers.
Learn the steps that Starbucks cloud engineering took in advancing its infrastructure as code capabilities to automate and simplify as much as possible for their engineers.
When Starbucks first started its cloud foundation services platform, they had way too many repositories to manage and had a bunch of "snowflake" infrastructure. But there was a problem with having infrastructure that was unique and special: it was easy to break.
Terraform modules helped make standardized, reusable blueprints for infrastructure. They also created a set of pipelines for their infrastructure organized by components.
In this talk, Ryan Hild and Mike Gee from the cloud foundation services platform team at Starbucks share the details of how they used Terraform and Vault on their journey to better, more advanced infrastructure as code practices, including:
Modularizing declarative infrastructure code
Codifying the enforcement of standard tools, scripts, and workflows
Using the testing pyramid to decide what is needed
Observability for ops, architects, and business managers
Mike Gee: Hello, everyone. Thank you all for joining us today.
I'd like to start out by painting a little bit of a scene for you. The year is 2019. You're an engineer for a big company and you manage a bunch of infrastructure, Vault clusters, Kubernetes clusters, Cassandra clusters, Kafka clusters, and you've Terraformed everything already, which is great. You got your infrastructure in code. And you spent, say, the last week, 2 weeks, deploying a big set of changes out to your environment.
And you come in one morning and you get a bunch of alerts that say you had a script that failed and caused an outage. If your infrastructure looked anything like ours did about a year ago, maybe 2 years ago, you're going to be getting ready for a pretty long week or 2.
At that point you're going to be getting ready to fix that bug, get it committed into your repo, pull requests in. You're going to be testing it in pre-production, making sure that it works, and then slowly rolling it out to every single instance of that cluster that you have out in your environment. It sounds great, right? What's wrong with that picture, though?
Ryan Hild: Well, it starts to feel a bit like an assembly line. You make your changes, and then all you're doing for the next week is rolling them through your environments. It gets very repetitive.
We've got very smart, very talented engineers on our team, and we think that there's a better use for their time. How can we put that time to better use? There are a couple of things. First thing that comes to mind is we can just hire more people, and then we get the job done faster. But it takes time to find people, and it takes time to train them, and it takes time to get them up to speed.
We've also taken a look at cloning. But, you know, there's a gestational period of 40 weeks, and then we’ve got to wait another 16 years after that before we can legally hire them, so it still doesn't quite work with our timelines here.
Mike Gee: Assume that you have the team that you have available today. You're probably not going to grow very quickly. How can we fundamentally change the way that we approach our infrastructure to basically regain some of that engineering time? That's what we really want to be doing, right?
We want to inspire you today to take a look at your infrastructure as code that hopefully you're working toward already or already have, and think about it from a software engineering perspective to maybe help you get some of that valuable engineering time back.
Ryan Hild: I'm Ryan Hild.
Mike Gee: I'm Mike Gee.
Ryan Hild: And we are part of the cloud foundation services platform team at Starbucks. What we've been doing for the past couple of years is providing infrastructure and services for some of our internal customers at Starbucks. Our big value add is that we allow teams to develop their applications and come in and host themselves on our platform, and we provide a platform that's built up and managed by us, and it's secure by default.
Mike Gee: A little bit about Starbucks. Starbucks is a fairly large company. I think most of you have probably heard of us. Based here in Seattle, we have over 30,000 stores across the globe, and in those stores we have 380,000 partners. "Partners" is what we call our employees. It's a pretty large company. We're spread across 80 countries, and every week we get up to 100 million impressions, or face-to-face interactions with customers.
Last year we had nearly $25 billion in revenue. As you can imagine, doing that basically a cup of coffee at a time, it's really important that the company runs like a well-oiled machine, that the technology behind the company is scalable and very reliable and secure.
Mike Gee: Our mission as Starbucks is to inspire and nurture the human spirit, 1 person, 1 cup, and 1 neighborhood at a time, and we try to make sure that, as Starbucks engineers, we practice this through our own technology as well. Our team was formed a few years ago around a specific purpose, and that purpose was to deliver the Starbucks Rewards program to the Japan market.
Ryan Hild: We put together a v0 iteration of our architecture for that, and this included organizing our code by modules. We had Vault, Kubernetes, Kafka, Terraform, all organized into their own repositories. We created repositories for the shared components that we used so that when we put together our networks, we could do that in a repeatable manner.
And then we put each of our environments together in a different repo so that we could get a picture of what all was deployed into a given environment just by looking at that repository.
Mike Gee: There have been advantages to this approach. It was clear when you were going into GitHub and digging around code, you knew exactly what type of infrastructure you were working with, where it was, was it production, non-production, etc. Also, when we did the component module design, we had a very clear release cycle, so when we would come up with new features, we'd tag those and get them deployed from pre-production through production.
And finally, like I said, infrastructure as code is really important. We had to essentially achieve that from the beginning. We did have the advantage of going in greenfield, which helped.
Ryan Hild: There are also some downsides. We found that this organization led to us having a lot of different repositories to manage, so we had to constantly be changing which repository we were working with.
This also led to some variation in each repository. We would get where our engineers would make a change to deploy into a given environment, and then they wouldn't make that change to every other environment. So we were getting a little bit of drift in between our environments even though we were already using infrastructure as code in this way.
Mike Gee: Occasionally people would skip pre-production and go straight to production as well.
Shortly after we launched the Starbucks Rewards program in Japan, which was successful, we really focused on our next effort, which was taking that same project and scaling it up for North America.
For this next iteration, we wanted to solve some of the problems that we ran into with the first design, and one thing we decided to do was go with monorepo structure. Instead of having repositories for every single environment and every single thing, we just had 1 run repository for non-production, 1 repository for production.
Inside of those repositories, we defined our environments by a folder structure, and we put all of the definition of Terraform modules inside of that. We also took outputs and remote state as data references from Terraform and we put them in a shared file, and we symlinked them everywhere, which had some benefits but some problems, as well as you'll see.
Finally, we took our component modules repositories, from Vault, Kubernetes, Kafka, Cassandra, etc., and we made them de-ice-able.
Ryan Hild: I'm sure you all know what "de-ice-able" means. You heard that term 100 times before. Probably not. We should probably go over that.
When you think about infrastructure, one of the common patterns in DevOps is you want to treat your infrastructure as cattle, not pets. Another way to think about that is you don't want to build snowflakes.
You do not want your infrastructure to be unique and different and special, because then anything that comes along could be breaking to that. We want to eliminate all of our snowflakes, and we found that the best tool to do that is even better when we called it "the de-icer."
Mike Gee: What is on screen is a sample of what our early de-icer configuration would look like. It would reference the repository where the module code could be found. It would have the version tag that you wanted to release, the network that it belonged in, and then the various parameters that we defined for that component.
Ryan Hild: Back to our overall architecture of the v1. Again, we've had some benefits that we found through this. There were fewer repositories to manage, which was definitely one of our original pain points.
We found that achieving a parity between our environments is now easier because these configuration files were living right next to each other. And we finally were able to share resources easier because now, like we mentioned, it was just that symlinked remote state, and then we just use that throughout all of our projects.
Mike Gee: Some downsides to this design, which we did have: Pull requests got really nasty. You'd be digging around in the code to figure out what changed. Sometimes it would be 3 or 4 different components. It required a lot of discipline to make sure that our engineers were only changing specific things in the non-production repo.
It became challenging to automate, so when we did start thinking about automation, you have to dig around inside of a commit and figure out: What changed? What do you action? What do you ignore? Sometimes that can get very hairy.
And finally, because the components were using this shared remote state everywhere, we ended up with the component code having a really rigid API, like it expected the output from another component to be there from the specific remote state file. If we wanted to tear it apart it was going to be a nightmare.
Ryan Hild: I see a few people nodding your heads, like you've experienced this.
Mike Gee: Does this sound familiar?
Ryan Hild: We took a look at that, and we successfully delivered our North America launch. But we still wanted to continue to develop our platform forward, so we tried to identify some things that were good from that design and that we wanted to carry forward, but also we wanted to make some changes.
The biggest change that we were dealing with is North America has a lot bigger scale. With our Japan versus our North America launch, we had about 4 times the infrastructure. We decided to make another iteration, and when we did that we set out with some goals.
Mike Gee: Our goals for this next iteration, we wanted to make sure that our deployments were actually automated. We didn't want engineers running
terraform apply, ideally. We wanted to make sure that operational tasks outside of Terraform were also automated.
For example, when you deploy a change to a Vault cluster, an engineer would have to go destroy instances so new ones would come up. We wanted that to be automatic. We also wanted to have automated testing in place so that we knew that the system was healthy throughout this process.
And finally we were looking to make sure that we had consistency across all of our environments, that the same versions of everything were deployed everywhere, and that the platform from our customer's perspective internally was also consistent, so they were always interfacing with a known quantity.
Ryan Hild: These are the goals that we put together, and we decided to create a set of pipelines for our infrastructure. We organize this by component, and for the remainder of this talk we'd like to walk you through some of those changes that we made to each of our components, and what that process looked like, and where we ended up from it.
Mike Gee: Before we start talking about the specifics, if you're thinking about your own infrastructure right now, and you're thinking about, What could I adapt into a pipeline type model?, there are a few things we wanted to talk about that you might want to have as prerequisites when you're selecting something.
There are some height requirements before you start down the path. You want your infrastructure to be declarative, so you want to be using Terraform or Docker or something of that nature. You want to have standardized tooling and scripts, so that from pre-production through production, you're running the same scripts on every cluster with parameters if needed.
Also you want to make sure that you have metrics available, so that as you're deploying things you can query and say, "Is this system healthy across the board?" If you already have some infrastructure that kind of matches that profile, that's a really good place to start. If you don't, pick one that's easy to adapt these things to, and then start with that.
Ryan Hild: The first thing that we put together was taking a look at automating our Terraform. How many of you out there are currently using Terraform? OK. And how many of you are automating your Terraform plans and applies? That's good.
If you haven't put together automation for your Terraform already, here are a couple of key points for you. You want to make sure you develop your scripts so that you have some repeatable runs.
You want to make sure that this is running the same way an automation would run on a development machine to debug what problems you're going to run into. We found that it really helped to put these into containers, so when you run Docker, you get that consistent environment, that consistent setup every single time.
For us, remote state management became very important. We had to be very prescriptive about how we were organizing our environments, and then our components within those environments.
Mike Gee: Operational activities, rolling clusters, that type of thing—scaling that was really important for us. When you want to scale those things, it really means automating them, right? This means removing the hands-on activity. You don't want your engineers in there running scripts every time a change happens, ideally.
There are a few different things that we want to talk about as far as automating goes.
What do you want to automate? Ideally you want things that your team already has processes for. You probably don't want to go in and start automating something that you've never done.
It's highly recommended that you, at the very least, have something like a do-nothing script, which is something that's worked really well for us, which is essentially a script that prints commands out to a terminal. The engineers follow that process, hit a key to go to the next step. It sounds kind of silly, but it works really well.
Then you can take those scripts and start plugging in your own script automation code inside of that, and eventually you've got a fully automated process.
Now things like pre-deployed testing: Is the cluster in a good state? Am I ready to deploy?
Ryan Hild: Yeah, update execution. We deployed a lot of Vault clusters, and we used autoscaling groups to do so. Our update process would go out there, update the launch configuration, and then terminate the old instances after we've made the
terraform apply happen. And that was how we rolled out our changes to the Vault infrastructure.
We'd also include things like post-deploy testing. After we finished an update, we'd run a few checks, run through some testing, and validate that our tenants could still connect to this cluster, generate new tokens, use the secrets inside of it.
We also had a rollback process documented. This was pretty much just doing the same thing in reverse, but putting us back into that known good state that we could work with.
When we focused on putting these things into code, we wanted to make sure that we were using tools that our developers were familiar with. If you have an engineer that has never worked with Golang before and they've suddenly got to debug why something isn't working in production, you might have a few issues there. So we made sure to choose tools that our team was comfortable picking up and learning.
Mike Gee: When you go to automate your operations, don't forget guardrails. Especially if you're in data storage things where you could have corruption, make sure you use your lifecycle hooks, use prevent and destroy in your Terraform. Make sure you're testing the integrity of your data before and after your rollout.
Ryan Hild: That leads us into talking about, How do we test our platform as a whole? We came up with a few different questions for that, which was basically: What kind of testing do I need? How many tests do I need? And what kind of tests am I going to be running?
When we took a step back and we thought about, "OK, how do we answer these questions for our platform?" it really brought us back to thinking about where we were. Again, we've deployed into North America and Japan. We've set an expectation for that security. We had a stability aspect to what we were deploying.
Our developers are very used to how things were, and they didn't want things to break without them being warned ahead of time. So when we approached testing, we approached it as using it to increase the stability of our platform.
Mike Gee: Looking into that first question a little bit more—What types of tests do you want to think about running?—there are 2 categories that you want to think about. Functional tests: Is the system operating the way that you expect it to? And then nonfunctional tests: Is the system healthy? Is the user interface to that system what you expect it to be, etc.?
For functional tests, you're looking at, coming from a development background, unit tests, testing small bits of the infrastructure; integration tests; system and acceptance tests.
Ryan Hild: We also want to make sure we don't forget about performance of our system. If we have a really low latency right now—and that's kind of an expectation of our downstream consumers—we need to make sure we have that under test so that we don't break that without knowing about it beforehand.
We also need to make sure that we have good compatibility testing. Our interface to our tenants is just a Docker API, so there's a lot of range for different ways they could configure that. We needed to make sure that we stayed stable with the features that they were using.
Now that we've identified the tests that we need to write, the next question is, How many of those do I need to write? What helps here is thinking about the testing pyramid.
I've also recently seen the testing ice cream cone, where it's inverted. But the idea is you think about how many tests you're going to write based on the speed at which they run and the cost they have to maintain.
If you have a bunch of small, very targeted tests, those typically are referred to as "unit tests," and those typically run faster, and you'll run them much more often. When you put that together into higher levels and you test several components at the same time, that's what we refer to as "service tests."
Then, at the top of the testing pyramid, you have your UI tests, because that's where you write 1 test and it tests a bunch of functionality underneath the covers.
When you think about this and how to apply this for your team, what you need to understand is, What kind of value are you bringing to your team by writing this test? If you put together a framework that tests a component that's not going to get deployed, except for in maybe 1 or 2 instances, there might be better ways to spend your time.
Mike Gee: Finally, from the testing standpoint, I want to talk about how and where you're going to run these tests. Often you're going to have developers that are running them locally on their machine, and that's ideal, because you want to get fast feedback. You want to know if you broke something. It's not always possible, so you want to make sure they're easy to run.
Your objective is to make sure that people can figure out what's broken, right? So if you finally get to the pipeline stage and you've made your changes, they seem to work locally, but something fails at, say, the deployment stage, you want to make sure that your developers can take rollback and rerun that locally and debug it and figure out what went wrong.
Beyond that you're going to have them running inside the pipeline itself. We talked about pre-deployed tests. You might have gates at each stage before you move on from a pre-production to a production environment. Then you're going to have one-off events, which would be things like you're upgrading a major version of your data store and you need to migrate the data over. You might have scripts that you develop for those specific purposes.
Ryan Hild: We found that when we started writing tests for our one-off events and put those into our repositories and managing them for the rest of our code, those tests would be reusable, both if we needed to perform a similar migration in the future and if we wanted to just add another check for our smoke test. So that after we're done deploying, we can make sure this thing still works the way it did before that chain feature was introduced.
Mike Gee: Finally, we want to get into the meat of the talk, which is about the pipelines themselves. Here's another simple chart describing the flow that we developed internally.
What we were going for was a standard prescriptive process that we could apply to all of our components across the board, whether it was strictly Terraform code or Kubernetes deployment or whatever it was, kind of the lowest common denominator, and develop all of the scripts and tooling to match that pattern, and then stamp that out for every component.
What you get is basically like multiple cycles at the end to promote through different stages. In our case we have an alpha, beta, gamma, and prod stage, so that loop would happen where you promote the code, deploy it to that environment, run all of your post-deploy tests.
And if they're successful for every instance of that environment, then you can proceed to the next one. If any single one of those failed, you would cancel that change, and the developer would want to figure out what went wrong.
Ryan Hild: What does that look like? Well, we wrote a configuration file, and we put together a format that was very prescriptive about these pieces that we wanted to include. You notice in the code displayed on the screen that we have our stages defined, and these are very prescriptive names that we use, so we always have a validation field that performs our PR checks. We always have a build that produces our binary artifacts that we're going to use as well as compiles any tooling that we're going to use during update execution or that smoke testing after we deploy.
We also have defined our publishing stage, which is how our artifacts are tagged or whatever it is that we chose to do for that particular component. But that's what marks it for being ready for deployment.
Then you can start to see at the bottom here a little bit of how we define our deployments, and we would define an arbitrary grouping there, and we are able to target specific pieces of infrastructure and deploy through them in sequence.
Mike Gee: That file you were just looking at would be a deploy config YAML file for a given component. As you can see from the sample folder structure, this would live inside the component repository as well.
This is borrowing from our monorepo designs. We would define every place where there's a cluster. So each individual file represents an instance of that component, and inside of that would be a de-icer config with all of the parameters that are needed.
The nice thing about this design, when you look at a PR, there's maybe a single change that's happening if it's an instance size or something like that, and you get a small blast radius, which is good.
Ryan Hild: Coming back to that code we showed before, once we define those de-icer configuration files, we'd go ahead and reference every one of those that we wanted to deploy and manage. We'd reference it in our deployment stages at the bottom here.
Just like our de-icer tool transformed a configuration into Terraform, we also wrote a tool to transform this configuration in a very opinionated Terraform. We'd run that tooling, and then we'd immediately be able to run
terraform init and
terraform apply, and we would produce our pipelines within AWS code pipeline.
Mike Gee: This code panel you're looking at now is defined in Terraform as well. There are providers for code pipeline and code build. We leveraged that to translate our pipeline definition into what you see here.
This is the top stage of that deploy config. What gets generated is essentially a source based on a GitHub webhook. When a commit goes into master, it triggers the pipeline and first goes through a build stage.
What the build stage does is essentially bundle, take all of the Terraform templates, it compiles any tooling that we included in that, like the node roller or the roller for Vaults. That would be a tool that gets compiled in this exact process, any Docker images that get built, and they would be packaged up for future delivery.
Then we would run tests for our build. We would run a Terra test, so we'll stand up a temporary instance of that component, run the test suite against it and make sure it does what we expect, and if everything looks good, we'll finally publish that complete bundle as an artifact to the deployment stage.
Ryan Hild: Deployment stage looks a little bit like this. We have 3 phases. We have a plan, we have an approval stage, and then we go to an apply. Wonder where we got that from?
During our plan stage, we do run
terraform plan and we perform a few other checks based on whichever component we're working with. We then send that output to our approval stage, where we have an engineer look at it and decide whether it's good to go or not.
This has all been done in parallel, so when they move on to the apply stage, those clusters are all deployed in parallel. They're rolled in parallel, and then they're tested in parallel.
This grouping here provided a lot of power. We can write focused tools that target a single instance and they'll work on that same targeted single instance the same way that they do in alpha and the same way they do through beta, all the way up through our production environments.
By the time any tooling or any code gets all the way out to production, we know that it's been tested against every single one of our clusters in exactly the same way it's going to be used.
We also found that by having an approval stage, we can hook into that and, say, if we wanted to output to Slack to tell our engineers that there's a new approval, that's where we started.
This allows us to do a lot of iteration. We can say, "Now go inspect those plan files that were generated and tell me how many changes were going through there." If we have other things that we want to check, we can add that into the pipeline and check those outputs and just do that iteratively in our pipeline.
Mike Gee: To go back to the visual of the pipeline diagram, this is what you end up with logically. We get consistent environments. We have a standardized config format that defines this thing. Same scripts are being run everywhere for every instance of that component. And it integrates really well with Terraform.
You're probably thinking, "How do I get there? It seems like a lot. Where do I start?" Think again about all of your infrastructure. Try and target something that you know is very simple. Our team started with Consul. Consul is a known quantity. It's a simple thing to manage operationally, and Terraform itself was also very simple. That was our first MVP (minimal viable product) for pipelines.
Once you get that out, don't stop there. Keep iterating, keep adding tests, keep adding scripts, and then start adapting your other components based on what you've learned. It's about just constantly making little changes. Don't try and do everything at once. It won't work.
Ryan Hild: This gave us a lot of power because it allowed us to evolve that design. Even though this is all part of our v2, the design of our pipelines has definitely changed over that. And we've allowed it to start from a couple of minds, trying to build that big idea of what we're building, and then we gradually introduced it to the rest of the team to where we can collaborate.
If someone finds a feature of this pipeline that's not implemented yet, they have the power to go make a pull request and make that happen for everyone.
Mike Gee: In conclusion, if you take the time and invest your time toward developing these types of automation, these types of pipelines in a software engineering kind of approach, you're going to end up with a lot more time to do other valuable work. You can build new features. You're going to get rid of a lot of the operational tasks that are bogging you down. Also, you're going to get this really powerful ability to insert new logic.
For example, if you wanted the cost analysis features that have been demonstrated, you could build something like that into your pipelines. If you wanted auditing or compliance checks, you can do that as well. It's really completely up to you, and because you've developed the skills to manage these things internally, the sky's the limit.
It really for us has been a source of happiness. Everyone has more free time. They're not doing stuff that they don't want to do, so you end up with a happier team altogether, which is really nice.
Finally, Starbucks is a technology company. We employ top-tier technologists. We're focusing on creating in-house solutions for opportunities that we have as a company, and it's really cool stuff. If you'd like to talk more with us after, come find us. Again, my name is Mike Gee.
Ryan Hild: I'm Ryan Hild.
Mike Gee: And thank you so much for your time today. Appreciate it.