Vault as a large-scale shared service, at Adobe
Jul 16, 2018
The developers at Adobe found HashiCorp Vault to be "head and shoulders" above anything else they tested. Here's how they got Vault up and running as a hybrid cloud and large-scale shared service.
What's the best way to run HashiCorp Vault as a hybrid cloud and large-scale shared service at your organization? That was the question Adobe had to figure out when developer demand for the tool skyrocketed after its introduction. Chandler Allphin, the lead security data scientist at Adobe, shares Adobe's lessons and pitfalls from building their own best practices and standard procedures for a shared Vault service. He also covers the reasons why they decided to move to the Enterprise version.
Lead Security Engineer, Adobe
I'm Chandler Allphin. I'm a data scientist at Adobe with a passion for DevOps—the reason that I'm here, really excited to be here. First time in Amsterdam, enjoying it so far.
I'm excited to talk about Vault as a shared service. I gave a similar talk HashiConf last year, but more around our architecture, why we went with the Vault, the decisions that were made, a little bit about why we ended up purchasing enterprise, and kind of the scale that we needed.
After that talk, I got a lot of questions from people asking, "Okay, that's all great, but why run it as a shared service? Are you able to keep up with the scale with some of the native features of Vault?" And so I thought it'd be good to put together a talk on running Vault at large scale and as a shared service. And I think this will speak to a lot of maybe the bigger organizations out there. (But also, this is as good as the talk that we just listened to going from startup to enterprise and making that jump.)
» The dynamics of Adobe: Many acquired technologies
Before I dive too far into my talk, I just wanted to kind of talk about the dynamic of Adobe. I don't think I need to explain to anyone what Adobe does or whether you know us for PDF or Photoshop or Creative Cloud, whether you're a fan of our subscription model or you're not, I don't really care. I'm not a salesperson.
But what I need you to grasp is that we are a company of acquisitions. We started back and PDF was our thing. Now it's kind of a different dynamic. We have a ton of acquisitions that have come into the company and that kind of makes my job a little bit more difficult.
I work on the core services team at Adobe. We look across the organizations, see what services, technologies, and tools that people are using and figure out, "Does this make sense to run as a shared service?"
So when we spin something up, it's for hundreds of teams and we have to look at a very large scale and see, does it make sense? Does secret management make sense as a centralized service? That was one of the first projects that they gave me as I started at Adobe as a software engineer was: Take a look at our software management and see if this makes sense.
So—goals of my talk, I wanted to start and just kind of share a success story, what's been good about Vault, where we're at today, running it as a shared service, some of the decisions that we made, and then key wins. But the bulk of my talk is I want to talk about our lessons learned.
After my talk last year, I had a lot of questions on, "how are you doing this as a shared service?" So I'm going to be a little bit revealing up here and just talk about the pitfalls that we had, what we would potentially do differently, because at the end of the day we're all trying to tackle the same things. We've learned a lot of things running this as a shared service and are learning even more as we continue to develop.
And then I'll finish with a demo. I want to be full disclosure and set expectations, right? I know there was like yesterday, someone used open source code as their demo, super awesome. That's not gonna happen, so I'm sorry. But the demonstration that I'll be giving, it's not so much about the code that's written, it's really just some Jenkins and Python jobs that are working together, but it's more the process of automating policy management involved, which if you're familiar with Vault, is a little bit of a pain point. And so I want to talk through the decisions that we made there, hopefully.
» The challenges to address
Challenges and secret management: this was back three and a half years ago when we were really looking. I first joined Adobe, these were the things that we're looking to do. The first one by far was we were lacking this company-wide solution. Obviously it's not that we didn't have a secret to management solution, it's that we had several!
10, 15, 20 teams were doing a lot of different things and we didn't have a central place where all of our audit logs we're going into. We didn't have a great policy management when auditors came and asked questions about this application and the secrets that it has access to, we ended up tracking down teams and looking for ways that we could answer their question.
So what we really were searching for and hoping for was a solution that could handle the large scale of Adobe and fit all of the different use cases. And that's what we found with Vault. I'm not going to go in depth on all of the what all we looked into and all the tools that we tried out, but by far, Vault was the one that was head and shoulders above the rest.
Large scale, distributed, multi-cloud, I think these are all things that we're familiar with at Adobe. We have a lot of infrastructure in AWS, and then a partnership with Microsoft comes out, Adobe hearts Microsoft. And so, of course, we're in Azure. We also have our own data centers. We have to be this distributed multi-cloud hybrid cloud environment at a large scale.
So for us, it's not good enough to say our secret-management solution is going to be KMS (AWS's key management system) with S3 encrypted buckets. That doesn't work. We have to be able to have a cloud-agnostic tool. In steps HashiCorp's toolset. It just fits our needs really well. Large scale is the big piece there—it's being able to scale up to what are our Experience Cloud is expecting.
Secure introduction with microservices: I think Armon did a really good job just walking through what we're all trying to tackle here—moving from monolithic applications. Some of it's more difficult than others, getting rid of those crusty applications, and moving into a microservices architecture. But figuring out a secret-management system that could tackle the secure introduction dynamically, securely, but also not be platform specific. We couldn't rest our laurels on, "Hey, we're on Kubernetes, so we can leverage this specific tool within Kubernetes." As we live at Adobe with the thought from our lead architect. I'm sure he didn't come up with this. It's probably something he stole from somewhere.
But every technology decision, with time, is a bad decision. And so we can't marry ourselves to just one system. We have to be able to have our abstraction layers and our pieces in place so that if ever we need to drop whatever scheduler or orchestrator we're running with and move to something else, we're able to do that.
Numerous use cases
We have a wide variety of use cases. This goes back to being a company of acquisitions. With every new acquisition—it was just announced that Adobe purchased Magento and that was all news for us—we're reaching out to that company and figuring out, "Okay, what opinions do you have with secrets management? (for example) What things are you using now and how can we help you get onto our infrastructure?"
If we have this limited use case and you have to integrate with us in this one way, then they're going to kind of laugh at us and not be as willing to move over. So we needed a tool that was flexible enough to fit a wide variety of use cases.
» Choosing Vault: Key wins
Where we're at today, we deployed Vault, started with open source, quickly realized that we needed Enterprise for replication purposes. We also leveraged the HSM functionality. We liked the disaster recovery functionality, all of that good stuff.
We've been running Vault Enterprise for a little over a year now. Very happy. We have 12 replicated clusters, Azure, AWS, private data center.
Things are going great. We're working very closely with HashiCorp to continually improve this. It's not a perfect setup by any means, but it's accomplishing what we are trying to do. We now have a centralized place, a system that's highly available, replicated across multiple clouds, and when our auditors come and ask for a specific application, what secrets that has access to, we can go to one location and pull the logs and show exactly what's happening. And it's a huge win for us.
I talked about wanting to go over key wins. I think there are a lot. Policy management has been a big win for us, being able to better control what policies are out there and how people are rotating their secrets, how people are distributing their tokens to their applications and so forth.
One key win I wanted to point out here, which maybe doesn't seem as key to you in the crowd, but for us—we had an orchestration tool, we had schedulers, we had this microservices environment, and I could have drawn that all up with a really complicated diagram, but my point is—we have secrets: we're trying to get them to the end application and Vault is the tool that just fit in really nicely for us to fill that gap.
So going through this Mesos marathon with DC/OS, our complicated approach is able to securely inject secrets. Our teams were happy. They are centrally located. So we took the burden of introducing secrets to their applications away from them. But that's not the key win.
The key win came when my upper management comes to me and says, "Hey, we're really sorry but ramp up your team because we may be moving to a different orchestrator or different scheduler and we're taking a look at these things." So what's all the changes we're going to have to make with Vault and ripping things out and all of that?
What do we have to do? The key win for me is, well, we have a very easy abstraction layer. We're going to change two API calls, and then it handles it just fine. So because we had centered ourself in this centralized shared service and all teams were consuming it, our engineering teams didn't have to worry about a thing. Our schedulers were changing, but they don't have to do a thing. It's just a couple of API calls that we're changing, and that was a key win for us. So enough of that.
» Lessons learned
From key wins to lessons learned, that's really what I want to focus on in this talk because there were a lot of lessons learned. This is definitely like standing up here and exposing. No one likes to talk about all the pitfalls and the mistakes that they made, but we learned a lot. We'd started good two and a half years ago with Vault, and it was a different dynamic back then than it is today. So I want to talk about some of the things that we learned and where we're going from here.
I thought about it last night. What I could have done with this talk is stand up, say "Vault secrets, running Vault as a shared service, use namespaces." With Jeff talking about it yesterday, really a lot of the things, a lot of the pain points and the lessons that we've learned are solved by what's coming soon. Right? We don't have any timeline exactly on that, but according to Jeff, it's coming soon. Namespaces is a big deal and I'll talk through why.
Single-tenant vs. multi-tenant mounts
So this slide's gonna seem really weird at first, I'm sure, as Vault is today. But when we first started looking at Vault, rewind two years ago, it was, "Okay, we have these mounts. Vault's very UNIX-based and so we're just going to have this one mount that's gonna, for example, be a KV store. And we're going to build all of our teams off of this one mount and we're going to control policies to them through ACLs. They're going to be able to build out their own tree. We can then develop more policies around what they have in place and it's going to be great."
It didn't make a whole lot of sense to have single-tenant mounts because then we're just handling configuration for several mounts when we could do the same thing with one. That all seemed to make sense until GDPR hit. And now we have these multi-tenant mounts and every team has different needs for GDPR and where they can have their data. HashiCorp's response to that was mount filters but that doesn't help us right? Because we're on a multi-tenant mount. So the big recommendation that I had is—and this is very straightforward—I don't think anyone today looking to start on Vault would go with a single-tenant mount or maybe you would. I would highly recommend not.
Another thing that dropped was the KV store. Version 2 of the KV store. Now we have version secrets. Super excited, five minutes after the release dropped my Slack channel was blown up with people saying, "Hey, this release just dropped five minutes ago. When do we have it in production?" I'm like, "Well, we'll take a look and see. Unfortunately, there are breaking changes in the API of the Version 2 KV store, the way they had to architect it. So when you're on a multi-tenant mount, you can't really just upgrade the mount and then you're fine because there are breaking changes involved.
So migrating everyone over to a single-tenant mount became very important to us. Single-tenant mount with mount filters and being able to do more than just that with the KV store but also with auth backends and transit. Any secrets engine really that you're using would be our recommendation. And really this is table stakes at this point with namespaces, because now you have a namespace you can give your tenants and they can use that and do whatever they want within it. They can create their own mounts and create their own filters and you're able to delegate that out. The next big thing for us was just developing a self-service system. This also seems like shared service 101, right?
Moving to self-service
When you're standing up a shared service, you really need to have it self service as much as possible or else you're going to run into a bottleneck. For us and the dynamic at Adobe, we didn't grow as organically as we wish we would have. We started looking at Vault as an option for us and we stood up this beta service. And we told our developers, "Hey, this is beta. Don't take it to production. This is beta. We're just taking a look. We're looking at Vault Enterprise." Of course, engineers, they find something that they like, they do some little tweaks and loopholes and networking and all the sudden applications are in production using our Vault Beta service. And we're standing there thinking, "Okay, now how do we productionize this with 40-plus teams banging on the door saying, 'We want to onboard and use it.'" So I showed the slide of our architecture and where we're at. That's how we were running Vault but we didn't have a great self-service method.
What we had was, "Can you submit a ticket? Okay, go to JIRA and submit a ticket. If you can imagine any swear words you want in whatever language you want, that's what I heard back when I'd tell my engineering teams, "Can you go submit a ticket to onboard to our service?" They don't like that especially. I don't like that, my team doesn't like that because we're the bottleneck at that point. So how do you develop a self-service system for onboarding to Vault? That was our pain point, "How do we get to that point?" Especially with all the functionality that Vault gives.
You can leverage Vault for TLS, you can do SSH-signed certificates, you have a basic KV store, you have the transit backend, you can manage AWS, federate your AWS credentials. Every single-use case was different and we couldn't really develop this one-stop little webpage where they go click a couple buttons and it provisions because we were supporting so much functionality.
Our recommendation, and where we've gotten to today, is we've stepped back. We have the basic functionality that we feel like all teams need and that's your self-service portal that you can get provisioned. If you need anything past that, then come talk to us and that's when it involves a ticket. This is also something that's solved with namespace, and we're extremely excited about it because now, I can go ahead and still configure and automatically provision their namespace. And [inaudible 00:16:42] grant them access to mount, create, do whatever they want within that namespace and no longer am I worried about, "Exactly what do you want to use this for?" As long as I have the higher-level [inaudible 00:16:54] policies in place to control to a point, they're kind of within their own little parameters. It's a huge feature that's dropping. Next piece was kind of the new functionality. I don't know about you guys but it's like every time I turn around, HashiCorp has a new release. And it's new functionality that I'm trying to learn about.
And I find out about the release from my Slack channel telling me, "Hey, we want this. Can you get it into production tomorrow?" And we're learning about the functionality as it's coming out, working closely with HashiCorp to kind of get some insights on what's coming but this is another piece that it was like, you have this whole, full-featured product. Teams are used to running it as an open-source tool as their own team. And they have the root policy. They don't have to worry about coming to a team. Now they're forced to use a centralized service and they don't have all of that availability. This was a pain point for us and again came down to maybe limiting our engineers a little bit more. Letting them know that, "Hey, we're going to support this portion but if you want something else, we're going to have to have a conversation." Again, namespaces, this helps me out quite a bit because they can now explore the new functionality without creating a ticket. I don't really want them to come talk to me even. They can go and they have their own little demo environment that they can play around in.
Best practice is probably not straightforward, but something that was exposed to us is: you have a lot of functionality teams that are very excited to use Vault and again are banging on the door. Then they come to you and say, "Well, how should we set this up?" We point them to a best-practices guide and, "Well, that doesn't work for us." When you're talking about so many teams, best practices almost became the limiting feature of, "Well, that doesn't work because we have our environment set up in a different way." So best-practice guides, having those in place was something that we could have benefited from and now that we have them in place, it reduces a big portion of the requests that we get in and the meetings that we have to sit through.
The last one, and probably the most important piece, and I think everyone kind of deals with this to a point, is our policy delegation. You have onboarding and put yourself in our customer teams view. They've just gone through a fairly painful process. Now, I'm over-exaggerating. It's not that painful and we automate it to a point. This is definitely not the state we're in today. It's a nicer process for our teams. But back when they had gone through this painful onboarding process and now they've got everything, they've created their secrets, they've got a couple app roles and they're chugging along. They don't want to come back and talk to me.
They get a new application and they think, "Okay, I just need to create a policy, get it mapped to an app role and in the environment, and they come and say, "Can you just point me to a documentation of how I can do this?" And I let them know, "Well, you don't have access to. Again, submit me a ticket." Well, now we're past the point of them swearing at me, they're making fun of my mom, they're asking me to meet them outside and I'm fearing for my life, right? So we had to get this point.
» Demo: Policy management and Vault
We didn't stay in a stage where policy delegation or creation of policies could remain in the ticket structure for very long. So that's kind of what I want to demo today and I want to preface it with: it's not perfect. And there's a reason that we wouldn't ever really want to open source this because it's very specific to how we're doing it. We have a very specific kind of naming structure and how people can create policies and not create policies and a lot of string parsing. It's not pretty but we're so excited because it got us to a point where now with namespaces a lot of that can go away. But we can leverage the same type of functionality to now manage our namespaces. So I really think that I significantly prayed to the demo gods, but we'll see. I'm doing this demo over conference wifi and VPN, so we'll see how it goes. There's not really a whole lot to it so hopefully it works out.
So I've got this unicorn. I don't know if people can see that. Yes, no? Make it a little bit bigger? Is that okay? All right. I'm getting some thumbs up. So I'll go through the files in here. Again, this is specific but this is the thought process that we went through. It's something that I would recommend when you're looking at policy delegation and then maybe namespace delegation or after the onboard process has taken place.
We created this repo, the thought being, "Okay, delegating policies not being a thing really stinks. So how can we make this policy as code?" And get all of our policies into a repo so that now teams know what their policies are. It's very explicit and they can just create a pull request. Well, even with 40 teams and hundreds and hundreds of policy changes, even just looking at a pull request and merging it in became a big headache.
So we went from, "Okay, let's plug in some Jenkins jobs, some Python magic to do a little bit of inspection into what they're requesting and how can we automate this as much as possible?" So we have them define an approver's file. Part of your namespace or poor man's version of namespacing, give us an approvers. And normally, it's three or four. For this demonstration, I'm saying, "Okay, I'm the approver for unicorn team." We have a team at Adobe. Let's say that's unicorn and I'm part of unicorn engineering and I'm the approver for any changes that happen within this directory. This directory is kept in a large repository in GitHub and that's what we treat as exactly what our master is looking for. Exactly how our policies are represented in our Vault environment.
We then have cost center. This isn't super important to what I'm trying to demonstrate but I wanted to show it. We needed a way to show, especially when we have several teams, several organizations, to be able to show, "Hey, Creative Cloud is using this much of our service and here's their cost center so we can maybe start billing back." And so we have our logs that are centralized and every log is coming in with a path so it was simply mapping paths to cost centers and then creating a simple lookup table in Splunk and getting that over to our business intelligence team. So we have this cost center. They onboard ... they're automatically given that, "Hey, this mount is mapped to a cost center and it's being billed back to the team unicorn." We didn't have this mapping. To the team Unicorn, we then have this mappings file. So this mappings file is not known as Vault, there's nothing specific but it's just a place to keep track of what policies are mapped to what.
Let's take a look at this policy, I've got the Unicorn engineering LDAP policy and it's granting access to the engineering folder. Again, this is very policies 101 in Vault, but all your CRUD capabilities that are being granted to this LDAP group and then also they have the ability to read their own policy. So that's being granted, and then that's just a policy that exists in Vault, that's not mapped to anything, doesn't grant anyone access so we need a way in a mappings.yaml file. This is where we store it to map and say the Unicorn engineering admins team or LDAP group has access to this policy.
We do the same thing down below in the operations. So operations teams, SRE teams let's say, they will have a different approvers list defined or maybe they don't. It really depends on the team. I have the approvers Feathers, don't make fun too much because that's an actual username and he's here today, but I think Feathers represents him well. So we have an approvers.yaml and so if any changes are made in the Unicorn_ops folder, it's gonna wait for the approvers@feathers to give the okay. Same thing, cost center, mappings file and operations ACL policy file, all the same.
Okay so this Unicorn team, they've been created. They have no application rules. So let's say they've just recently been onboarded, they didn't request anything, they've built out their KV store now and they have a new application. This new application needs access to a specific secret down the tree, this is something that our Vault admins would never know. It's specific to them but it's gonna be new app, app user one secret.
And the only capabilities it needs is READ. If you can imagine all the different iterations of things that we can see from a hundred plus teams. We get these all the time so they needed a way to create it themselves. So they have this new app and they're gonna add that HCL file to new_app.hcl as a new policy that should be committed to Vault.
Now because they've created that policy, it doesn't do them any good unless they have some type of mapping saying that this token role, this app role, whatever it may be needs to be mapped to the policy so they're gonna go and edit their mappings file and say, instead of an LDAP group, we now have a token role that's gonna be called new app user and it's gonna be mapped to the policy that I'm just creating. I'm not able to do this by default. I can't delegate this access for them to write it directly to Vault as it is today. This policy is not something I can delegate securely.
And so they're able to create that but then they come to a point that, "Well, okay I've created this token role and it's mapped to a policy, but now I need some way to get it into my configuration management tool. Okay, historically, that happens with our SREs now so they'll be the ones that need to have access to just create this token and get it into our Jenkins job or our Salt environment wherever it needs to live to securely introduce it to the application. So they've got these paths that need to be added to some kind of policy, one of the LDAP policies. They're gonna throw it in the operations team so that they now have access to create a token. It's granting them read access to that one specific secret within engineering but they're okay with that because it's their operations team and they also have access to read the policy's new app. So we'll go ahead and save that and then it just needs to come through as a pull request.
So we'll go ahead and add that. Commit it for adding a new policy and role and then go a git push origin. So that's gonna push to our master branch and they're saying, "Okay, this is my policy as code. I know exactly what my environment's doing." It's very explicit and so I would like to push this and hopefully it's gonna come through. Maybe. Thinking. Live demos are awesome. So I've got this request. I'm gonna do a compare and a pull request on this and now where we were at is every pull request that comes in needs to be looked at and make sure everything looks good. Needs to be merged manually into this repo and then it gets provisioned. What we've moved to is just Jenkins jobs that are looking at this, are triggered by any new pull requests and it's reaching out now and it's saying, "Okay, this new policy is asking to be created. Is it granting access to anything that should be outside of Unicorn?" It's doing string parsing, making sure that everything looks correct. It's syntax checking, making sure that nothing's gonna break a Vault.
It's actually spinning up a Docker Vault that's completely separate just in dev mode and it's taking all of our policies and it's writing all to this Docker Vault. So now we have this replication that we can just play with and then it's just testing all of these changes to make sure that everything looks good. I'm praying that this finishes in our Jenkins little test environment that we stood up actually works but once it's done all the syntax checking, it's gonna come back and let me know, "Hey, I've ran this check. Everything looks good and these are the changes that I would have made." If you think about it, it's like a poor man's Terraform plan type thing. And it's not coming back so that's great. So I'll log into Jenkins. I have Chris, one of my coworkers, taking a look. Hopefully it comes through for us.
I have this kind of test environment that's stood up and allowing us to run this. It's been a little bit slow. Okay, @feathers is running a retest for us, let's see. Okay. If not I can just speak to it. But yeah so cloning the repository, running this Docker Vault, you can kind of see as it's running let's see this retest if it goes through. Okay, replicating the active config to Docker. It's going all of this verification for us and then if everything goes through it's not gonna merge it to Vault or make any changes because I don't want it to. No approvers have given approval, it's just doing the syntax check. It should come through and okay, @feathers, good job. Retest worked and the new policy that it's adding, it's telling me, "Okay, these are the changes." Now again, this is poor man's version, it's not pretty but it just gives us a version of what's happening. So because I'm an approver for some of those files, I'll go ahead and add my approval.
If I can type in front of all of you. So I'll go ahead and comment and this is gonna kick off, I guess I can show this as well. The same type of Jenkins pipeline that's coming through. But now it's this policy merger so it's gonna go and take a look and say, "Okay, what files were changed?" And it's gonna iterate and recursively call back through to see this file that's changed is often in all of the approvers' files. If he's not then we're gonna deny it and so I changed two files that I'm an approver for. So it should come back and tell me, "Yes, two of these files are approved. One of them is not and it's waiting for @feathers to give the approval and then this will be merged in." And okay, good. Look, it's going. This should be fairly straight so it should give me something that kicks back and it's gonna fail. So we might be out of luck here.
No, that's what we expected. What am I saying? Okay so it's gonna fail. It's still waiting for approval here. It shows that it's been approved on all the others but it's waiting for approver on the Unicorn ops. So Feathers is gonna go ahead and give us an approval and we're gonna run through this same thing. This time, it's gonna check all of these files, make sure that they're all approved and then it should just be merged directly into Vault. So not a whole lot behind the scenes. It maybe seems very straightforward but for us this automates it now.
And now I'm out of the ticket game. My teams aren't responding to tickets because it's just being done here fully and it's a pipeline. It's a poor man's version of namespaces. Once namespaces drops, I don't think we're gonna have to rip too much out.
Hey, okay so we can see the approval went through. It's now merged. I should be able to go my Vault and do a read on this new app and see that this policy that I just wrote that took no Vault admins to look at was verified, it was written to Vault and now it exists there. I now can go ahead and have my ops team create a token that's gonna be given to my application and everyone's happy because no ticket was created. So this was a big deal for us. Thank you.
We really ran into this backlog when we became the bottleneck. No one likes being the bottleneck but this got us out from being the bottleneck and we're hoping that Jeff says that namepaces is coming relatively soon because I think it solves a lot of our issues and we're really excited for it, so that's my talk. Thank you.