Adobe has been running Vault Enterprise in production for two years and now the platform services over 130 teams. Learn about all of the best practices and pitfalls of using Vault from this large-scale use case.
Dan McTeer, who helped lead the operationalization of Vault Enterprise at Adobe during his time there, now works at HashiCorp and continues to look at user feedback to try and make Vault better.
In his HashiConf EU 2019 presentation, McTeer talks about the monumental scale of Adobe's Vault usage, which had over 130 teams using it before he left(and over 100 trillion transactions). The talk focuses on strategies he gathered for operational readiness in Vault so that you can deploy services quickly and easily, making sure you're gathering requirements properly along the way.
Get ready to learn about Vault adoption topics around...
Thanks for coming today. Just a bit of a disclaimer. I was recently an Adobe employee. I moved to HashiCorp about three weeks ago. I was offered an opportunity here to help progress Vault—move it forward, work with customers like you—get a pulse of what you have going on, what challenges you're facing.
While at Adobe, I was part of a large team where we specialized in shared services. So we did a lot of things—DHCP, DNS, message queues, log aggregation, identity management. There was a multitude of things we were involved in. My background is all technical operations, network engineering, system administration, even managing SRE teams.
This is a big month for me. This month marks the two-year anniversary that Adobe has run Vault in production. Kind of a big deal. It took us about six months to deploy once we had the purchasing finalized—but a big month.
I've been involved in Vault for quite some time on an operational level. I want to make sure everybody understands that I don't work for Adobe anymore. I do not represent them, but I am happy to answer whatever questions you have regarding anything I say here today. I want to be as helpful as I can to you and your own journey.
I don't know if anybody got a chance to see some of the other Vault presentations going on here. J.P. Hamilton spoke earlier today about operationalizing Vault. There was another good presentation by James Atwill from Hootsuite yesterday.
I don't know if anybody caught that as well. I was concerned that we would have a bunch of talks—talking about operationalizing Vault—and we would all end up in the same place. But they complement each other quite well—so I want to make sure I direct attention to those two gentlemen. Definitely take a look at their talks if you didn't get a chance to see those.
My presentation today is more around operational readiness because my team provided over 20 different services at Adobe. Operational readiness was a key part of making sure we could deploy those services quickly and easily—making sure we were gathering requirements properly.
Part of my team was in Romania, and part of my team was in the U.S. People ask frequently—and I didn't add a slide about this—I had 6 total people on my team by the time I left Adobe, just managing Vault.
I want to talk about what those steps are—as far as preparing to deploy Vault cluster and what you do post deployment. What you want to do is start with your customers. By the time I left Adobe, we had onboarded over 130 teams to our Vault service—and as any of you can imagine, there's just no possible way to gather requirements properly from all those people.
I know there'll probably be questions around how do we still be effective at this. My response is pick a representative group of people that you think will be your biggest or most complex customers, and go to them and start having those conversations with them.
The specific requirements that I list here on the slides are all requirements that were provided to us by our own users. We had to be able to deploy multi-cloud. Adobe had customers that we were under contractual obligation to not deploy their infrastructure in AWS for certain reasons. We had to be able to work in multiple places. Most of our digital marketing infrastructure was still all in the datacenter, so we had to run there as well.
Then we needed a standard way to consume things. Our developers and operational staff wanted to be able to retrieve secrets in a way that was standard across the board—no matter where they were running that infrastructure they were able to get it in the same way.
Then we needed something robust and performant. There were a couple other tools that we were running at Adobe—the way they were architected would not stand up to tens of thousands of requests per second or a legitimate amount of response time.
Our developers actually came back in and requested that we respond to secret requests in under 50 milliseconds. In most cases—based on our monitoring—we were able to do that on average in about 3 milliseconds from Vault—regardless of which location they pulled from. Worst case in most situations was about 14 to 15—so we were meeting that requirement pretty well.
Then again, tens of thousands of concurrent requests. Adobe Digital Marketing is the backbone of much of the digital marketing that goes on across the globe. 47 of 50 Fortune 50 customers are our customers over there—some big names that I know you probably heard of. Tens of trillions of transactions every year are going through that infrastructure, so we have to have something that can respond to that quickly and easily.
Each of these products has their own contracts with external customers and needed to be able to meet the needs of those customers. We had to be able to meet the needs of those customers, essentially.
I apologize—here's just so much to talk about in this particular topic alone. This could go for hours. So please ask questions after. I'm going to have to breeze through some of this stuff.
Once you have your customer requirements, it's important to document that in an SLA—and make sure you're talking with your customers and getting them to agree to that SLA.
Let them know what your promises are to them and help them understand how you're keeping those promises. Make sure they understand the definitions behind those things. You almost have to be a lawyer in some cases when it comes to this stuff. But then make sure you understand what KPIs are important to you, what your objectives are in meeting those SLAs.
At the very bottom there, you'll notice, “Validate Dependencies”. What I mean by this is the network is a dependency. You may have concerns from some customers as far as whether or not the network is up at a high enough availability for you to guarantee the availability they need. Make sure you're checking with those dependencies before you're going to your customers and making promises to them, as far as what you're going to provide.
I recognize this is like a 10,000-foot overview of some of our architecture—and so, again, please ask questions afterward if you have any. We split things up in three ways. We had our canary/sandbox environment first and foremost. This was an environment specifically for my team to evaluate new features, new builds, test certain requests from customers to see if certain things were possible.
There were other reasons for this, too. The reason that the canary environment was so good was because we could take our snapshots from production and quickly rebuild this environment to do all of our testing against something similar to production.
It also helped us validate things like disaster recovery on a regular basis. We knew that our snapshots were good.
And then dev stage was more of a way for us to introduce some of those features to some customers who wanted to test those before they went to production. We offered a lower SLA.
My background's operational. I certainly understand the struggles behind building an exact replica of production. What we did was focus on the features specifically of what we were trying to test against.
Replication was an important part of the Adobe architecture. We wanted to make sure we had a few clusters in each of those environments so that replication was running and we could test our use cases against replication. It's important to have the features—not necessarily the size—but the features, for sure.
This is a very high-level overview of what the architecture looked like. Our primary datacenter was on the West Coast. The blue dots and earwax-colored dots represent the clouds. We had three clusters in each of these geographic regions, and they were all replicating from our datacenter on the West Coast.
That gave us 11 performance secondaries in a single master cluster. Regardless of where these apps were being hosted—that we're accessing these secrets—this is how we were able to deliver sub-15 millisecond responses in most cases.
I want to take a minute and focus on this. I hate to beat this in. I was in Greece a couple months ago with some friends, and we rented a brand new Audi.
Us Americans are not used to the tiny underground European garages. We were all waiting outside the hotel one morning, and one of my friends started driving the car up, and you can probably only imagine what happens next—that cringey sound of metal on concrete. He pulls out of the garage, and the whole right front fender is just scraped down to the metal.
Luckily—and I'm like a lot of you, probably—where I usually don't pay for the warranty or the extra money for insurance. This time I did. 13 dollars a day. For $13 a day, we were able to take that Audi back and pay a $29 administration fee, sign a piece of paper, and walk out of the car rental place scot-free.
It's critical to understand that most of the time DR is such a small price to pay for what you get in return for it. These statistics I pulled from the Gartner site. 82% of companies do not have a disaster recovery plan in place for their services—certain services—or all of their services in some cases. $100,000 an hour is the average—on the low end—of what you'd lose during an outage. In Adobe's case, in some cases, we would spend that much a minute of downtime.
54% of companies have at least one 8-hour downtime event within any given year. 93% who have a significant outage in any given year go out of business within the following year. Just for the record, I would consider an 8-hour downtime event a significant outage.
This is what our disaster recovery looked like. We have our primary datacenter on the West Coast. We had a disaster recovery cluster in AWS on the West Coast. Then we have a disaster recovery in a separate datacenter on the East Coast.
The reason I like this model so much is because we were facilitating nearly 800 million requests a month by the time I left Adobe, which isn't terribly high, but you can imagine what it would look like to shift that traffic suddenly from the West Coast to the East Coast if we only had one DR cluster, and it only failed over to the East Coast.
We had one there locally and in the cloud to give us some level of separation from our datacenter, but to quickly failover and to keep the response time similar to what it was—what people were already experiencing if they were coming back into our West Coast datacenter.
If it's not beat into your head already—if you do anything this year, build disaster recovery behind your services—particularly Vault. I think we should start an annual disaster recovery horror story night for HashiConf or something where we can all gather around and tell each other about all the terrible mistakes I've made or whatever.
Automation. Super important. Lots of different reasons. I'm not going to beat this to death. Some of this is pretty standard. Like, have deployment templates—we use Terraform. Obviously, it was the easiest way for us to deploy multi-cloud. No, that's not a sales pitch. We used open-source Terraform.
We had other things in place—deployment pipelines to trigger infrastructure provisioning from places like Jenkins. Configuration management. We were a SaltStack shop. I don't imagine many people are around here. You all seem smart. No offense to SaltStack. It was a good tool when it worked.
Auto-unsealing is super important. Any of you who use Vault know that if a server goes into a sealed state, it's not accessible anymore. So we followed the HashiCorp guidance on AWS KMS auto-unsealing.
We had a Rundeck process where we could trigger it via notification, and alert Watson and say, "Hey, this particular node's in a sealed state. Send that to Rundeck." Rundeck would kick off a script that would hit AWS KMS and send the unseal keys to that host and unseal it automatically. Then we would get a notification in Slack or via PagerDuty to let us know we needed to go investigate that further.
Here are some of the things we monitor. We had a couple instances where customers would—and when I say customers I'm talking internal users—would request 600-700 thousand service tokens at once—not understanding how to properly use the service, and it would fill up certain system resources that would cause problems.
Number of secrets created. I have a story I want to come back to on that one, for sure.
Tracer bullets were very popular at Adobe. Adobe has a motto internally and externally, and the motto goes, "Make it an experience." This was echoed even on internal teams.
Everything we did for our internal customers, we wanted to make it an experience, and we wanted to understand how their experience was with our service. It wasn't just, "Oh, I can't ping it anymore," or, "The CPU load's high." It was, "How is this actually working for you, and is it working to your expectations?"
We use tracer bullets to do things like, "I'm going to write a secret, and then I'm going to read it back from that same spot, and then I'm going to send that data to somewhere so we can track that long term."
For the record, Splunk is the service we used for that sort of thing. We sent a lot of data to Splunk, used it for a lot of different things-various pieces of reporting, alerting. I'm happy to share details about that afterward.
Operations per second. Request volume. Vault does have limitations—especially based on your hardware—on how it can handle those things. Those are things you always want to be aware of, especially with regard to capacity planning, too. Knowing when you need to bump resources, essentially.
Sealed state. We already talked about that. And replication state. Performance secondaries were a very critical part of what we did.
That last item, best practices. We advised our users to treat Vault performance secondaries as though they were a service similar to NTP or DNS or something like that—where you would have multiple points that you could reach out to pull a secret from. So if your first choice on the list wasn't working, you could hit the second and the third.
With service tokens, the complexity comes in—the fact that you have to pull a token from each of those clusters to access secrets on each of those clusters. We posted code examples for our internal users, so they would understand how to do that.
With the introduction of batch tokens in some of the newer versions of Vault, there's no reason to have that sort of complexity anymore. But we did find that hitting multiple clusters generally provided about 99.97% availability by using the service that way. We gave them that best practice. The problem is, and any of you who've dealt with end users know that they don't always follow your guidance. So we looked at our customers—we looked at our audit logs inside of Splunk and determined which customers were hitting a performance secondary and only a performance secondary in one location. Then we generate a report based on that and proactively reach out to those customers and say, "Hey, we notice you're not using our best practices. Please shift how you do things this way so that you can have better availability from the service."
These were the most critical pieces of documentation I would say that our users requested:
Having a Quick Start Guide. We had our best practices documented. As I said, we would have multiple use case examples associated with that documentation so people would know how to use that.
One thing I do want to mention here that was helpful to us. We had a service level review document that we would have our customers fill out to let us know what their expectations were around the service—and then we would evaluate those during the onboarding process so that everybody was on the same page.
Migration scripts. James Atwill from Hootsuite. This was really good. I'll call out his presentation yesterday because he had some amazing innovative tools that he provided to his end users. I would highly recommend looking at that and getting some ideas there. Anything you can do to prepare before you launch a service like this is going to be helpful—going to make things easy.
We had users on CyberArk and Thycotic—we had scripts that would help them copy things off of those tools onto Vault. And then we had policy templates so they would understand how to use HCL to write a template for Vault to provision access and provision certain things for their users.
Email lists and groups. We had an email list for our admin. We also had an email list for our users. We would use the email list for our users for things like announcing downtime or maintenance. And then we had an admin list that any user could email that would spam out to all of us who owned the service—ran the service—so somebody could pick that up and respond to our users.
Our chat channels were by far the most effective. Users would come in there and ask questions, and sometimes other users would answer those questions. It reduced our support burden immensely and gave people lots of quick answers—lots of quick responses.
Ticketing was reserved for the more complex stuff that nobody had ever seen before, or reporting potential outages.
Once you build the environments out, make sure you test them. Make sure you verify them. Make sure they're working as expected. Make sure they're working as your users expect. You want to make sure you get all this in place before you launch. This will make things much easier on you and everybody else.
Quick story here. We had one user who used Vault in a way that probably most people would not think of using it, and inserted millions of K/V pairs over the course of a couple days—over 10 million, even.
That caused some problems. There's a couple things I want to call out—a couple reasons why this is important. The first one is that we were absolutely blown away by how well Vault handled this. This was not an expected use case, and our second highest user was not even in the neighborhood of thousands. So millions was a different story. Vault handled it extremely well.
The second part is, make sure your users understand how to use the service. Once that happened, we changed our processes. If they're a new group onboarding, they needed to have a 15-minute meeting with one of my admins to talk about their use case. That gave us opportunities to explain better ways to use the service so that we can make sure we're catching situations like that before they cause problems.
If you're migrating from one tool to another, this is the process we followed to make that happen. You want to start months in advance—communicating out to your users what the change is, why it's happening, where they can find resources to help them move.
Then you want to make sure you have migration scripts in place to help them easily move secrets from one place to another. Thycotic's server uses paths similar to Vault does. Makes it very simple to write a script. I think it took us maybe an hour to put together a migration script for Thycotic to Vault. Super simple. But make sure you have those things available.
Then make sure you have good documentation around moving from one place to another—where to find your quick start guide. All that stuff.
Then what we did, that fourth bullet point, we scheduled some meetings—some brown bag seminars—and we did one with the CyberArk group, and one with the Thycotic group, and we said, "Here's how specifically how you move from this tool to this tool." Then we're there to answer questions. That was a helpful, critical part of the process.
Then we met with the admins of those two tools bi-weekly, and that's where we would talk about who moved off what tool to which tool and what their experiences are with that.
This is how we tracked who adopted what. It was a manual process, but it was helpful because we were required once a month to provide a tracking dashboard to upper management—discussing who adopted what and when; what their experience was, how the different services were running, if there were any outages, anything like that.
That was a good process. Super helpful and made things super smooth.
Then reporting. Again, the adoption tracking. I want to call out SLA. We had dashboards that were publicly accessible to our internal customers that would talk about different KPIs we had with each service—and those were accessible at any time. Most of our data we pulled via Prometheus and funneled into Grafana dashboards.
That was what we found to be the easiest route for Vault in particular. Back then, there wasn't as good telemetry support when we first started building these reports out—so a lot of the monitoring comes via Splunk or other things. But with the new updates and the new telemetry options that you have inside of Vault—Prometheus—is an easy route.
Make sure you have dashboards for your users, but in addition to that, make sure that you're sending them a monthly report—letting them know how well the service is running or some of the things you've run into. Users seem to like that. Again, it's good to make this a customer experience and not just a service they use.
About a year and a half ago, we started an internal training program and had no idea how successful it would be. We started going to different major office locations for Adobe around the globe—and within that year and a half, we trained over a thousand users on Vault.
It was Vault 101. We had the deck provided by HashiCorp, and we would put our own little Adobe spice into that. That was hugely helpful. In some cases we saw within an hour of the training ending people would be helping each other out on our Slack channels. Saved us a ton of time, ton of support overhead. Something to consider if you have the resources to do it.
The last thing I want to mention is chargeback. I know many of you don't necessarily need this, but I heard J.P. talk about it earlier—how they had to chargeback. We were funded centrally for our operational services. But as many of you know—working in a technical operations organization—there is not necessarily any quantifiable business value you can provide or show and so it's hard to say, "We're making this dollar amount for the business," or anything like that.
So this was a useful model. We used a lot of the metadata features in KV-2 to associate cost centers with certain namespaces to quickly and easily put together Vault usage reports for the purposes of the chargeback. If you have questions about that, please come talk to me.
Then the last slide. I want to make sure you understand that this isn't a situation where you put the service up and let it go and hope for the best, right? Make sure you're refining these things as you go along because you will learn things quickly and often—so please make sure you're refining your processes as you go along.
That's it. Thank you.
How OVHcloud Migrated to Terraform Enterprise
Using Terraform to Build a Self-Service GitOps Infrastructure as Code Platform at AppFlyer
How Wix Uses CD Pipelines to Upgrade Vault at Scale
TomTom's Secrets Management Journey with HashiCorp Vault