Deep dive into the architectural patterns that Starbucks used to build secret and identity management capabilities for 100,000+ retail edge devices with HashiCorp Vault.
Speaker: Andrew McCormick
Brief disclaimer: I work for Starbucks, but the opinions expressed in this presentation are my own and not necessarily those of Starbucks.
I am the lead security engineer at Starbucks. I've spent most of the last 6 years doing systems and reliability engineering in our retail technology environment.
I have a ton of experience with the scalability challenges in distributed edge environments. I specialize in edge computing and security, and I'm obsessed with automation.
The first thing that I learned when I got into engineering at Starbucks is that the edge at that scale is really hard. To give you an idea of what I mean by retail scale, Starbucks has over 500 licensees and 16,000 stores in North America alone. Each one of these stores acts as a standalone entity and is very isolated, but there are over 100,000 edge devices spread out across thousands of networks and stores that we still need to consistently secure and manage.
There is a wide variety of use cases for compute in a retail store, from point-of-sale terminals to temperature sensors and everything in between.
For the sake of this talk, I'm going to assume each device is only running 1 application. But if there are more, you may have multiple identities for each physical device. Each of these use cases have different requirements, and we don't want to spend $1 million protecting a $30 asset or vice versa.
There also aren't a ton of solutions that are designed for edge environments, and when they do exist, they tend to be more of an afterthought than an intentional design. Vault is definitely not immune to this problem, but let's talk about some of the features of Vault that can help accommodate our scale.
Scaling Vault was a unique challenge for our team because we weren't just new to Vault. This was also the first large-scale platform we had built with Terraform on Kubernetes. We went all the way down the rabbit hole with Terraform and Helm to implement everything as code, from the infrastructure and network all the way down to Vault policy and identity.
Ramping the team up on all these new technologies while trying to design and build the platform at the same time was a challenge. But the result was a resilient and scalable infrastructure that simplified several other aspects of managing the system at scale.
Kubernetes provides the ability to scale our pods up and down to handle surges of load, while Terraform and Helm enable us to rapidly spin up new clusters, test new features, or onboard tenants.
Vault Enterprise has 2 features that were critical for resiliency and scalability in our case: performance replica and disaster recovery clusters. Vault performance replica clusters were a significant aspect of our scalability design for 3 main reasons.
They provide a read-only cluster for edge devices to connect to without exposing the primary Vault cluster.
They provide horizontal scaling and regional load balancing.
Replication filtering enables physical isolation between tenants.
This last one was really important for us because some tenants may have 3 stores and others may have 10,000. Performance replicas allow us to give a very large tenant their own physical cluster that isn't used by other tenants.
Disaster recovery clusters were also critical for meeting the uptime requirements of retail. Vault can be safely operated without disaster recovery clusters by taking frequent backups and establishing processes for restoring them.
But from a service resiliency perspective, it's hard to beat swapping in a warm cluster that already has all your critical data. In an environment like retail that requires constant uptime, this was a big win. But scalable infrastructure is actually the easy part.
The real challenge is at the edge. Vault's design brings a ton of value to DevSecOps in edge environments. Vault has a very flexible and extensible implementation of authentication methods and secrets engines. You can authenticate with traditional methods like AppRoles and TLS certificates, as well as modern machine identity methods like Azure AD and AWS IAM.
It also supports a wide range of secrets, from static key-value secrets to short-lived PKI certificates to dynamic database and cloud access credentials. All of this is built and supported by HashiCorp and the large community that's gathered around their stack.
On top of this, you can build custom authentication and secrets engines unique to your use case or to add support for a product that doesn't have officially sanctioned engines yet. In the past, if you needed to manage PKI certificates, database credentials, API keys, and passwords, it would require integrating several different systems to effectively manage all those secrets at scale.
The administrative overhead of managing all those systems is significant, and that doesn't even address how you connect them, rotate credentials, and provide secure, auditable access.
This whole system is designed around the idea that credentials and secrets should be short-lived and subjected to fine-grained least privilege to limit the likelihood and impact of a compromised secret being used in an attack.
These identity-driven automated secrets flows can be a major contributor to an effective and scalable zero trust edge environment. To enable this flexibility, Vault has a full-featured REST API that can be used to build seamless self-rotating secrets flows into your application stack.
There isn't a single function in the Vault platform that cannot be accomplished with the API. The API is enhanced by tools like the Vault agent, a major differentiating feature that allows a client to automatically authenticate to the API to manage its secrets.
If a static key is rotated on the backend, the agent will automatically pick up the change. If a dynamic secret is nearing expiry, the agent will rotate it.
The agent also provides a caching API that can serve as a local Vault responder in an edge environment and provide some resiliency if connection to the Vault cluster is lost.
This is all great, but I haven't really mentioned the elephant in the room yet. How do we bootstrap unique identities to hundreds of thousands of edge devices? We call this the "Secret Zero dilemma," and it all comes down to trust.
In an edge environment, the only reason I trust the device is because I can identify it, manage it, monitor it, and brick it if I need to.
The biggest challenge in solving Secret Zero at the edge is scalability. If I need a unique credential for each of those 100,000 devices, the process must be fully automated, because any human involvement will overload operations teams at that scale.
Furthermore, since we have such a wide range of use cases and security requirements, we need a set of repeatable patterns that can fit into our various use cases.
Before I get into the details of the first Secret Zero pattern, I want to briefly explain a construct HashiCorp created called "response wrapping." Response wrapping is a method of securely delivering a secret to a device across an untrusted network.
When requesting a response-wrapped secret from Vault, you'll receive a short-lived one-time use token instead of the actual secret. This token can then be used to authenticate to Vault and get the actual secret ID out of a temporary secrets store called "the cubbyhole."
This allows you to limit the risk of the secret being compromised in transit by delivering a short-lived accessor instead of the actual secret. It also gives you the ability to know if a wrapping token was compromised before it got to the end device, because an audit event will be triggered if a wrapping token is attempted to unwrap twice.
Also, keep in mind these architectural patterns I'm about to cover are intentionally vague, so you can think about how variants of these patterns might fit into your environment.
Now let's dive in. The first pattern is the orchestrator poll. I started here because it's the only HashiCorp reference architecture that could really be molded to an edge environment.
In a lot of these patterns, I'm going to assume that devices in the edge environment have already been provisioned with a role ID as part of the imaging process.
Once your trusted orchestrator sees that there's a new device needing a secret ID, it can request a wrapped secret ID from Vault and pass it down to the device, which then unwraps the token and uses the role ID and secret ID to log into Vault. This gives us a nice dynamic flow that allows the orchestrator to rotate the credentials frequently and allows us to leverage the identity of the orchestrator to connect to Vault.
The orchestrator would likely be using one of those cloud or datacenter authentication methods I mentioned previously, which gives us a higher level of assurance than the client can provide in this pattern.
The downside is many orchestrators can't integrate with external systems in this way, and when they can, this type of tight coupling may be undesirable. Any issue with the orchestrator can also create a downstream impact that causes Vault credentials on managed devices to miss rotation and expire.
In the rest of these examples, I'm going to use the Vault agent as the API client for consistency, but this could be replaced by your own application that integrates with the Vault API.
In the client-based pull pattern, your trusted orchestrator installs a bootstrapping package on every device that gets provisioned. This contains the Vault Agent and a shared enrollment AppRole. This role would be minimally scoped and only have access to request a wrapped secret ID for other devices that are in that scope.
The edge device logs into Vault with the enrollment AppRole and requests a unique secret ID for the desired role ID. Vault would return a unique secret ID to the device, which would then use it to log into Vault and access its secrets.
This is a nice and simple pattern that could work well for use cases with low security requirements or other mitigating controls. It's dynamic and client-initiated, which allows you to use relatively low time-to-live (TTL) for your credentials and enables clients to self-heal if credentials expire.
It's also more flexible than the rest because it doesn't require any unique functions or features from the orchestrator. As long as you can push code to a device, you can use this pattern.
However, this is also the downside to this approach. Having a broad shared role that can be used to assume the identity of any other device within that scope isn't great from a security perspective. You have low assurance on the identity of edge devices, and non-repudiation is nearly impossible.
The enrollment AppRole should also be minimally scoped and rotated frequently, which brings a whole new set of logistical challenges in large-scale environments with many scopes and devices.
The batch approach should look pretty familiar to some of you, as it's a common pattern for integrating between multiple APIs. Here, we have a cronjob that runs on a schedule, let's say every hour. This job queries the orchestrator for a list of all trusted devices that need a Vault credential.
This is the first model that allows us to do some additional posture-checking for trust. It could be as simple as, "I manage this, therefore I trust it," or as complex as, "I manage it, it has all of my security tools, it hasn't been flagged as compromised, and it's never left the geofence."
The cronjob will then loop through the devices from step 1 and request a unique wrap secret ID for each device. Once Vault returns the wrapping tokens, the cronjob will assign those tokens to each device using the orchestrator's API.
The orchestrator then kicks off a bootstrap install package that includes the Vault agent and the wrapping token. The client unwraps the token to get a secret ID and then logs into Vault to request secrets.
This pattern is nice because it's fairly simple and well understood. Your cronjob is effectively just a small API broker, and response wrapping protects the secret as it passes through the tooling. So the level of effort here is fairly low.
This can compromise provisioning at scale, though. If you're trying to provision thousands of devices every day, a 1-hour interval for this cronjob could create a significant bottleneck.
This also tends to result in somewhat longer secret IDs, because you need to leave enough slack in the system to prevent downtime caused by delays or issues with the cronjob. There can also be some orchestrator limitations here, as not all orchestrators have an API that will let you get and set metadata or run packages.
The last 2 patterns are prototypes that I'm still tinkering with in my home lab. But before I dig into the details, I should probably give a quick primer on TPM. TPM, or Trusted Platform Module, is a standardized and inexpensive hardware security module that is becoming increasingly common in mobile and enterprise PCs.
For this, there are 3 main aspects of TPM that are most relevant. The first is the endorsement key pair. I'll refer to this as "EK_pub" for the public key and "EK_pri" for the private key. This is a cryptographic key pair that is stamped into the chip at time of manufacturing and never changes. It provides a good mechanism for identifying a device throughout its lifecycle as it moves to different locations or users or gets reimaged.
The second is the storage route key pair. I'll refer to this as "SRK_pub" for the public key and "SRK_pri" for the private key. This cryptographic key pair gets generated when a system or user takes ownership of the TPM, and it's rotated anytime the TPM is cleared. This key tells us when ownership of a device has changed hands.
Last but not least, TPM can be used as a secure local secrets storage for things like certificates and credentials so we don't have to keep them on disk.
The second Trusted Computing Group link I gave you is for a proposal open for review about adding a new key pair to the TPM spec specifically for device identity and access station that could be interesting in these patterns.
Now that we know what a TPM is, how can it help? The enrollment gateway model was the first pattern that made me think about TPM because I wanted to see if the shared key model could be improved by having a stronger assurance of identity attestation.
In this pattern, your orchestrator installs the bootstrapping package that contains the Vault agent; enrollment key; enrollment client, which could be a service or even just a simple script; and optionally, a role ID if not provided at time of provisioning.
The enrollment client would request a Vault secret ID by sending a fingerprint to the enrollment gateway that contains the EK_pub, SRK_pub, hostname, orchestrator ID, and Vault namespace.
The gateway would then validate that fingerprint against an orchestrator or CMDB to ensure that those public keys map to a managed device that has our security tools in place and meets our posture check.
If fingerprint validation succeeds, the enrollment gateway would initiate a nonce challenge by generating a random nonce and encrypting it with the SRK_pub and then encrypting it again with the EK_pub. This allows us to validate that the authenticating device has access to the private keys without the keys ever having to leave the TPM.
The TPM will use EK_pri and SRK_pri to decrypt the nonce and then respond to the challenge by sending the enrollment key signed by the nonce.
Now that we've validated a device's identity and it has authenticated with its private key, the enrollment gateway can request a wrapped secret ID from Vault and pass it down to the device.
The device unwraps the token and logs in with the AppRole. Obviously, this is a much more complex flow, but it also provides a lot of value. We still have a nice, dynamic client-initiated flow that allows for short TTLs without requiring a ton of specific capabilities from the orchestrator, and we get the higher assurance TPM provides.
We do still have a scope shared key that needs to be managed and rotated, but the risks associated with that key are lower because of the fingerprint validation flow and posture check.
We do need to have infrastructure that allows us to validate our fingerprint, though, so it's important that your CMDB or orchestrator knows the TPM public keys for your devices.
The complexity of the previous pattern got me thinking about how to cut out the middle layers. We didn't need all those extra points of integration, and potentially failure, if we used the TPM to authenticate to Vault directly.
In this flow, our bootstrapping package is really simple. All you need is your Vault API client. The edge device will send a TPM auth request directly to Vault with the same fingerprint from the last example.
Vault could then reach out to the orchestrator or CMDB to validate the fingerprint using a set of pluggable validators. If the validation check succeeds, Vault would initiate the nonce challenge described in the previous pattern.
Once the device decrypts and returns the nonce, proving it holds the private keys associated with the public keys, Vault would issue a token and that device could then request its secrets. This is a really interesting pattern because it has the potential to bypass Secret Zero delivery entirely.
As long as your provisioning process scrapes the keys from every new device and your orchestrator or CMDB tracks the device throughout its lifecycle, we can use the hardware identity of the TPM to establish and maintain trust.
Having this be a full authentication method also allows the device to automatically get blocked from authenticating to Vault if it gets flagged as compromised or noncompliant.
This is also the lowest complexity and highest scalability from a workflow and infrastructure perspective without sacrificing cryptographic security.
The downside is that code complexity is very high. This could be a custom authentication method that's built and managed in-house, but I generally don't recommend that for anything in production, because there's so much that can go wrong with building your own authentication, and the level of effort is fairly high unless you do this a lot. Ideally this would be an officially approved and supported authentication method from HashiCorp.
This is still a very early idea that I'm prototyping in my home lab, so I'd love to hear feedback from the community about things I may have missed, how you could improve this design, and how you solve Secret Zero in your edge environments.
There was so much to take away from our experiences with Vault that it was hard to condense it all into 1 slide. First and foremost, this takes time and resources to do right, a lot more time than we thought it would take. Getting our first operational Vault cluster standing was quick and easy. The process of architecting a holistic secrets management solution for the retail environment that met all of our security and business and technical requirements? Not so much.
The security implications of a system like this requires thorough planning and collaboration. It's crucial to spend the time to work through how to implement least privilege, effective auditing, and all the operational processes that are required to run a system like this at scale.
Short-lived, dynamic secrets are much easier to manage in a system like Vault. They simplify automation and remove a ton of operational overhead and risk. We started with 1-year PKI, and it didn't take us long to figure out that the shorter, the better.
As a newbie to Terraform at scale before this started, state management was a huge lesson learned. At the very beginning, we had a handful of large state files that contained our major components. But as we hit pitfalls with that, we started breaking these down into smaller, more atomic elements. This makes your configuration easier to manage and reduces the chance of unintended consequences from a change.
Finally, organizational change is hard. Vault's a very different and more modern way of managing secrets than most organizations are used to. It can require code change for your application teams. It can shake up existing processes that have been in place for years.
To push a change like this through, you need to really engage your stakeholders, prove the value of the solution and how it benefits them, and make sure that this really is the right solution for their use case. In an organization of our scale, automation is just about the best driving benefit there is.
Thank you very much for joining me. I hope you found this as interesting as I did.
Using Terraform Enterprise to Support 3000 Users at Booking.com
Terraforming RDS: What Instacart Learned Managing Over 50 AWS RDS PostgreSQL Instances with Terraform
Running Windows Microservices on Nomad at Jet.com
Transforming the Management of Application Configurations & Secrets at 24 Hour Fitness