Learn about the process side of managing centralized, multi-platform Vault operations in a large enterprise.
Cisco, a company whose security expertise is world-renown, uses HashiCorp Vault as its centralized, multi-tenant secrets management platform. They've built multiple Vault instances, onboarded tenants, met the operational and administrative requirements, and done much more in their progression toward maturing their secrets management.
J.P. Hamilton, director of cryptographic services at Cisco, will discuss his company's experiences during their Vault progression, focusing less on how Vault is deployed as a platform service and more on the business processes associated with a centralized model.
Cisco has named their centralized Vault service “Keeper” which is short for Secrets Keeper. From the inception of the project they wanted to create an internal platform service that teams across Cisco could leverage.
Arguably, the team delivering Keeper work harder on creating all the processes and documentation needed to provide Vault as a service than they do on the technical work supporting Vault itself. Therefore, the talk will explore:
Finally, Hamilton will review the lessons learned and the adjustments that they are making as well as some of the successes in delivering their Keeper program.
This is the logo we came up with for our service of Vault internally. I'm going to get into the history of this. I also want to give a little background on our team because it's weird that it's a centralized cryptographic services team. We're starting to see other companies create these and we've been at this for quite a while.
Some years ago, Cisco identified it was having a counterfeit problem. Products were getting counterfeited. We wanted to prove to our customers that when you bought a Cisco product, it was a true Cisco product. We started doing device certificates during the manufacturing process.
When we create a device, we do a serial number, and that serial number has a date, a time, a manufacturing location, and even the manufacturing lines. That means we can't pre-batch these certificates and send them down to the line and let them consume. We have to do search in real-time. If you're thinking about what that means, Cisco runs 24/7 global manufacturing operations in Asia, the US, Central, South America, and Europe. We have to have realtime PKI services running to support all of that.
There was a time when I had hair when this project started. But when we realized that if our services went down, we would shut down Cisco manufacturing globally, I was like, "Oh man, we got to up our game a little bit."
Shortly after we started doing device manufacturing certs, we started having to get into software signing. Software signing was a way of assuring our customers that the code running the gear that they're getting was a valid code that was produced by Cisco. That meant integrating with build teams.
If you guys have ever looked at a Cisco portfolio, there are thousands of products—and there are different variants of software for all of them. We had to build a mechanism to do software signing—to allow literally hundreds of build teams connect to our systems and do a centralized signings function.
We took it on as a centralized function both on the PKI space—in a software space because we were seeing different versions of software signing. We'd had laptops sitting under somebody's desk running the software signing engine—and we were like, "No, let's do this a little better."
We had that same problem in the manufacturing space. We saw different attempts to do things. Our executives said, "We want to centralize this and put some good standards around that," and that's how we started about 10 years ago.
We're also doing Cisco's licensing program. I don't know if you've ever heard—Cisco acquires a lot of companies. We tend to buy a lot. At one point, we had tens of licensing solutions. It was a big struggle for us because we're trying to make it easier for our customers. We're collapsed into a single licensing solution, and our team provides the backend for that service.
I'm giving you this context because when we looked at what was happening at Cisco—we were going through a transformation. Cisco is moving out of a hardware business. We're getting into software. We're getting into, "...As a Service," and we said, "We want to take on this challenge but in a central way. How do we provide a function to these teams—so they don't have to worry about it and, so we know as a company it's secure?"
We went off to build a key management system. We said, "This is what we're after. This is where we need to be." We're part of the security organization. But there's another part of the security organization—we're in a program where anybody that wanted to put a service into the cloud had to go through this process called the "Cloud Approval To Operate,”—CATO.
So we said, “We're going to go talk to these client teams and find out what they're doing for key management.” There were about 70 teams going through this process. We quickly realized after talking to them, they didn't want key management. The number one product that they were using was open source Vault.
They wanted us to deliver a platform for doing an enterprise version of Vault—if we could. That's how we came around to our Project Keeper. We looked at it, and we said it's a secrets’ keeper. We shortened the name, and we said we'll call it project Keeper.
For us, Keeper was designed from the beginning to be a Multi-Tenant solution. We worked hard with HashiCorp. I know we weren't the only voice in the room asking them to introduce namespaces and create a way for teams that consume Vault to have mastery of their own domain.
We were like, "Okay, how do we do that?" Before I go much farther, every presentation that I've seen is very technical. I'm a manager. You're not going to see a lot of code up here. I got a bunch of technical guys from our team over here. They'll be glad to answer any questions and get into some of the details at another time. But we wanted a highly available service, right?
Manufacturing goes down—that's a problem for us. If engineering goes down because of our software signing services—that's a problem for us. If our licensing service goes down, it doesn't just affect our clients—we call our internal teams "clients"—it affects our customers.
It was a challenge: How do we build a Vault environment that we know is going to be resilient in operating, and meet the needs of our business units as they go out and operate these services?
We also wanted to follow the same rules as we take our services into the cloud? We had to go through all the reviews—all the security checks—to ensure that our service would meet all the security requirements that our information security team wanted.
We went through that process, and we realized we could help the other teams that are trying to go through this process by saying, "Hey, we're leveraging Keeper," they can knock off a bunch of the requirements, check it off. They won't have to go through any vetting and proof—and things that you run into in a typical enterprise version of this stuff.
We have a lot of experience running HSMs—the manufacturing systems that we've had to build, the software signing systems that we've had to deal with, even the licensing systems. We're running almost 60 HSMs all high volume, all located in different locations. We have a fairly robust key management program, not in terms of a key management system, but processes and controls. And we said, "Hey, if we're going to have HSMs unsealing, we want to own those HSMs."
We provide those HSMs ourselves. We're not using a cloud-provided HSM service for that. Also, we needed to create an ability to operate at a very high tempo and high program performance level. Inside of Cisco, we have priority levels. P1 is the highest you can be—priority one.
Then in terms of what the application criticality is. That's a C1. We run this application. This platform is our highest available service inside of Cisco. What does that mean? Well, hopefully, we don't have problems.
But when we do, it means that we'll have an engineer respond within 15 minutes—that our target goal to close that problem is within an hour. As we're bringing on many different platform teams and services, we need to be able to be that responsive—to give them the service that they're going to expect of us, especially coming into our environment.
We learned from our experience with our other services that we needed to set up a set of environments. We typically like the normal process—dev, stage, and production. I'll show you a production in a minute. But we also introduced a concept of non-prod. For us, the dev and stage environment is all for our team to operate and use and test.
In dev, it's a single cluster—us getting used to whatever version of Vault, whatever changes, however we're configuring it. Then the stage environment from us is to test it with replication across multiple clusters. Again, all internal for our team.
In our other services, we had other teams test with us, and we would play with our stage environment—let them connect and integrate against that. It caused problems because we were changing things. We weren't always pushing everything in stage into production. We were trying to make sure everything worked there.
We created a concept of a non-prod cluster. That is a cluster that we maintain at the same config that we do for our production Vault, and that's where we allow our client teams to come in and test and make sure they're comfortable with how things are going to work—and that they understand how it's going to integrate and work with us. It's a little bit different.
I heard somebody yesterday talk about how they don't use dev and stage. I'm a little nervous about that. We have so many platform teams and so many services that make a lot of money for Cisco—they don't want to experience any downtime. By entering into the enterprise agreement, we also have excellent support from HashiCorp, and I promise I wasn't paid or asked to give you this note. I will tell you we have quite a few vendors in our space that help us out in delivering our services.
HashiCorp to me has gone from vendor to more partner because they've spent a lot of time invested in making sure that we're successful. They've had other teams from Cisco go to them asking to buy, and they point them back to us instead of trying to take advantage of that opportunity to sell.
It's been a good partnership. We appreciate that support. They've been great to work with. No more commercials. I promise.
Encryption scares me because of the compute that's going to be involved with that as that grows. I'll highlight the things that we're after. One—we haven't got our information security department yet on board with how to externalize our AD.
We're working with them. Other teams are working with them. We expect that to happen here soon, and that'll be a big step forward. I think a lot of teams will be more willing to move on and take on our service at that point.
We are having to deal with internal clouds. Cisco is a big tech company. Our IT department has 550 managers. There are 5,000 employees just in our IT department. The idea that we don't run internal clouds, it's crazy. We have multiple internal clouds that we operate. We're in the process— and hopefully, by August I think is what we targeted—to have internal Vault set up in two separate clusters in two separate datacenters, servicing those applications. Hopefully, that'll happen soon with no problems.
Our future plans are to go into Europe. Asia Pac—I put up there. But we're not sure whether we're gonna do that. We're starting to see some requests for having a Vault service in a FedRAMP environment as a platform. We're trying to see what that looks like. Some of our employees are not US national and FedRAMP is for US government consumption. They have some requirements around who can service that. We're trying to see how that plays out.
One of the challenges we had when we started talking these teams—I mentioned there were 70 teams going through this CATO process? It could be a team of five engineers, or it could be a team of several hundred engineers into services and applications that they're doing. We're trying to figure out how we scale for that.
We have to automate. There's no way around that. We're trying to figure out what's the right mixture of self-service. We would like to do self-service onboarding if possible. We have a lot of reporting requests that people want and things like that. We're in the process of building a client portal.
You can see what we've set up. The gray boxes at the bottom are DMZ spaces for us. We run a secure datacenter inside of Cisco Datacenter. It took a little while to get that connection up—that's where we host our HSMs.
That was our original architecture because we were looking at those HSMs as doing the unsealing keys. But we're starting to see some of the HMAC signing stuff for logs and things like that. We're very concerned about the connection speed we're going to see between our DMZ and the cloud, and we're probably going to move those HSMs.
Being Cisco, we have the ability to put things in other places. We're trying to move those HSMs out. We won't plan on leveraging any cloud provider HSMs for this.
We have a very robust DR infrastructure as well. As you look at our two clusters in AWS—east and west—we've got full backup DR clusters there. What's happening is you're seeing replication going across all four of those clusters and we've had a couple clients force us to stress test that. They work great—we're really excited about how that went.
There's one team that's primarily responsible for doing that in our org. Our organization's made up about 40 folks. We're global in size, and this team has just eight people. They have other services that they've got to deliver for us—big services.
We have quite a few private PKIs that we deliver from the cloud. They're responsible for that. Our public TLS certificate services—they're responsible for that as well. That's a whole other presentation though. We'll keep that off of here.
It's a big ask of this team to deliver this, and there's no way one team can do that by themselves. We're a very heavily matrixed organization. It takes a total of all 40 of us helping each other out. But this team is a team that's responsible for the overall delivery of that service.
The other team I'm calling out here is completely focused on managing HSMs. We're running three different vendor HSMs. I've mentioned the numbers before. Each one of those things—it's a whole learning curve to operate, manage properly, and then do the key management that comes along with that.
One of the things we ran into early were nontechnical issues and I know at this conference there has been a lot of technical discussion. But the nontechnical issues, are a big challenge for us. We're a bunch of crypto nerds. We're not business process people. We're not marketing people.
We're having to onboard. How do we onboard all those 70 different teams and manage that? We've got eight people responsible for that service. We've been trying to look at different engagement processes. I'm going to dive into a bit of our engagement process because we've had to figure this. If you're going to take on enterprise as a platform, like Vault for a platform, research some of the things you need to consider as you go.
The other was to sell this program completely, I had to beg our executives to fund it, to start it. But the caveat back to us was, “You need to recover 100% of the costs of this program. You need to go charge these teams for the service that you're providing.”
All of a sudden we're in sales mode trying to explain the value of what we're doing for them and how to get that money back. Part of this process requires them to give Cisco funny money internally and take care of what we're doing.
Then we ran into an interesting problem. I talked about our IT department, right? I'm telling you 5,000 employees, 500 managers. We never dreamed we would deal with a reseller situation inside of our own company.
We had teams that are standing up ecosystems—provide service to other groups—and they go, "Hey, we'd like to plug you in. What are you going to charge us? We're going to do volume discount for you," and things like that. These were challenges we never expected to take on. And again, how do we scale for such a critical service?
I put up here about tier one. We're running into this with all of our services. We saw a need because of 24/7 manufacturing knots, 24/7 software signing, 24/7 licensing. Now Keeper too—how do we keep our engineers focused on the things that we need our engineers' time for? Not on all the ash and trash emails and stuff like that.
We had to stand up a tier-one support service. We have tier one people down in Costa Rica in the Philippines now—just five; it's not this big thing. They're down there helping cover a lot of the minor questions. They're using the service now, and building the knowledge base, and doing all that good stuff.
Here’s our stab at a sales process internally if you will. We decided there are certain steps that teams are going to consume our products are going to go through. The first one is discovery, right? We track them. We get a team that emails us. We send them a note. We say, "Hey, here's a Wiki. Here's all the information you need to set up your system and get it operating."
Then there's a phase where they start calling and saying, "How do we use your system? How do we talk to your service? What's the cost?" We end up in these meetings. We call that our discussion phase.
Then the agreement phase—it's not like a sales contract, but it's like a sales contract. We literally hand a document over and have a director or manager sign something because they're going to have to commit to our financial folks that we can move the Cisco funny money internally around to cover the cost of the service.
I had a couple of business units call their lawyers to have us talk about our agreement document. I'm like, "We're still the same company. You're not getting into a contract with somebody outside."
Then we get into the implementation phase. Mostly they're going against our non-prod testing against that. We get back into the technical side and then get them into maintenance and production.
We were slow to production. We started this journey about two years ago. We went live here in March with the service. It took a long time. To be honest, a lot of it's my fault.
After we kicked off the project and we were starting to see investment, we stopped. We decided to reorganize. We blew up the entire team and shifted around and got things in place. We were running traditional DevOps. Then, because we have some systems under a compliance team, we said, "No, this isn't going to work." We moved to a service management framework.
Ryan—who's sitting over here—he is a service manager. He owns this service. He owns all the people who deliver this service. He is responsible for it. He has to gather requirements all the way to the delivery of the service. It was the only way we felt we could scale for this thing.
That reorganization took time. It took time to get the people on board. It took time to get hired, get them trained and knowledgeable in the service, and get it back up. That was a big reason why we were slow.
We did have some outside technical team dependencies. I mentioned that, in the datacenter, we operate inside of a datacenter—a secure datacenter inside of Cisco's datacenters. Our DMZs are in there as well. It was no minor feat for our IT networking guys to figure out how to get a VPC connection through all of their datacenter gateways and then back in through our gateways into our environment.
It took a while to get all that to work. The teams had been a little slower. The moment you mentioned that you're going to charge them—they're not happy. They don't want any more costs for their product. We have to go through a process of education again—of here's the value of what it takes to come onboard with us.
We're now in this position of having to market our service. We're a bunch of crypto guys. We're not marketing people. We don't know how to do marketing material. The closest we got was the logo that you saw at the beginning. We're trying to figure that out.
The ugliest part has nothing to do with Vault—had nothing to do with the team, the service, anything. It's managing cross charges. We go to our finance person. We hand them a spreadsheet of 70 different teams that we need to get money from—and they need to process that quarterly. It is a nightmare and a mess. I wish we could take it on as a centrally funded program and not deal with cross charges. It's much smoother, cleaner, and would make the conversations a lot better for everyone.
Our team has been buried in these datacenters for years. Again, datacenter within datacenter—delivering and building our environment. We had the chance to go take on more modern technology and, "go do cloud, man"—everybody on the team was excited and wanted to jump in on it.
I loved how the team coalesced around it, and took off with this project and they loved it. The platform for us has been incredibly stable. We've had a lot of good testing over the last couple of months, and we're happy with it. I'm not going to advertise HashiCorp again, but they've been fantastic.
I don't know how to categorize—because I couldn't put it in bad, can't put it in good, and definitely not ugly.
We don't know what our capacity is going to be. We're not sure of the impact that encryption as a service—handing out secrets—is going to be on our environment. And we're bringing on teams.
To give you an idea, we went live in March. In the first two months, we had 85 million transactions off of our Vault. That was with 6 teams out of the 70 that were coming on board. I don't know what we're going to be looking at as we keep growing and expanding—and when we break this thing or when we have to add other clusters. That's a little bit of a scary area for us.
The other problem we've had has been logging—and that's been a self-inflicted wound, to be honest. We started using Splunk. We had one engineer primarily responsible—took us down a bad path. We didn't leverage it very well.
We said, "Move it out of the way—you bring in another tool." We brought in Graylog; started working with Graylog, liking Graylog, using it. Then Cisco went off and bought an enterprise license for Splunk. We’re now having to bring Splunk back into our world. It's nothing to do with the service at all. It's just been struggles for us.
Dedicated clusters. I mentioned FedRAMP. I mentioned our internal world. We're looking at better onboarding tools. Our goal would be to allow teams to do an automated onboarding method. I'm sure there's going to be some work with the migration and stuff like that. But we like that. We're after trying to figure out how to support more use cases. We want this service to be very successful.
How do we provide the use cases to our customers? One of the big things is that AD integration—for us it will be big. We know that'll bring a lot. I already talked about hiring more team members. As we bring more dollars in for this service, we expect to expand that service.