Case Study

Secret Management in an Immutable World at Digital McKinsey

Published 7:00 AM UTC May 08, 2019

The Global Cloud Infrastructure division of McKinsey, a 90-year-old consulting firm, used HashiCorp products heavily in their own digital transformation. Learn how they use Terraform, Consul, and why they especially love Vault.

Transcript

Hey, everybody. Steve Jansen, here. I live in Charlotte, North Carolina, and work for a management consulting company named McKinsey & Company. You're probably wondering, "What is a management consulting company doing talking at a very awesome tech conference like this?"

I work with a practice called Digital McKinsey. Digital McKinsey has been a large investment in our world to help companies figure out, "How do I reinvent my business and how do I start engaging in new ways with the existing incumbent status that I have? How can I rethink how I'm doing my core business?" I may be an incumbent; a bank is a great example. You have someone like Capital One that wants to rethink, "What's the world of banking in the future?"

And then saying, "To power those things and enhance how I engage with my customers, how do I adopt things like Agile?" Most large companies are very good at what they've done for many decades, but they're generally rooted in practices like Waterfall, where they're very optimized for stability and where they see challenges such as, "How do I move faster? How do I innovate more?"

Digital McKinsey’s transformation

Our story, here, today, is part of our transformation. How does a large, 90-plus-year-old consulting company reinvent itself to do something like digital? If we're going to help our clients with that, we have to do it for ourselves. McKinsey was very bold in this and saw this as being a large opportunity for many industries, and so we invested quite a bit.

I joined as about the 15th person about five years ago. We now have over 1,000; I think it's closer to 1,100 people, now, in our lab’s group. Then Digital McKinsey, the larger practice I'm part of, is about 3,000 people. That's been a mix of people who are full stack developers, Agile coaches, UI/UX engineers, and cybersecurity experts.

We've done this at scale and it’s not something we're trying to do in small numbers; In the last three years, we’ve served over 1,000 different clients on digital. We're in 60 countries. We're regularly quoted by some of the world's leading publications on what our position is on digital. A lot of this growth is thanks to very significant investment, including inorganic growth where we've acquired some best-in-class companies, particularly in the UI and UX design world, and analytics.

As we're doing this, I think it's important to know that McKinsey's very interesting. I feel very blessed to work there. Our culture is all about ethics and values. We take that very seriously. There are three values in our core set, here, and this is something we truly live every day. We call back to our set of values regularly in conversation.

Preserving client confidence and building enduring relationships based on trust go hand in hand, and then governing ourselves as one firm. That's important because we're a very flat organization. It's possible to do things in silos, and we try to pull ourselves back together in doing things as one firm.

The challenges for older organizations

Why does this matter? Well, we add a whole bunch of new profiles, people who are not traditionally MBA students who are then graduating and going on to work at a management consulting firm. We consider what feels different about having various cultures joining a very mature organization, and I'm sure banks that are going through digital transformation feel as much.

Testing and learning is very different from traditional IT in a large company. When I first joined, you go through a very awesome IT group that does their job very well. They have their objectives; they hit them well. It's usually around keeping e-mail up and running and having a mobile workforce.

Then you want to onboard something with them, and you're like, "Hey." We heard before with Jeff, we spin something up. They're like, "Great, go through the governance, here, to bring something up for three years." You're like, "Just kidding. I needed it for three hours." That feeds into the next part where we see a different challenge around cycle time. Again, an organization optimized for stability will want to make sure they've got all the right things in place, that's on a different time span.

The last part is the tools of the trade. For most management consultants, they're going to work very comfortably in Excel, and PowerPoint, and e-mail, and their BlackBerry. It's very different when you bring in developers and designers. You're getting a very different set of tools, like GitHub. It's going to be a very foreign concept to a place that's used to a different set of tools.

As we're trying to figure out how to go faster, we're like, "How do we bring in agility?" This is the key problem that most large companies are facing. "How do we bring in agility without cannibalizing what we're already good at?"

Don't get it wrong. Most large companies are very good, and they're there for a reason. They know that they want to innovate faster. They know they're facing threats from smaller, more nimble places that might be start-ups or might be coming in from a different industry, but they still have things they do well. So, how do we keep doing that and add agility into it?

Two-speed IT

The task I was faced with solving was, "How do we embrace something like public cloud where, traditionally, we would have said, 'That's scary. We don't feel like we can trust that'? How do we do that in a way that's not going to possibly jeopardize our core IT mission or jeopardize the confidence of our clients?"

For us, it was very simple. It was dogfooding. We go to clients, and we talk quite regularly about the consent of two-speed IT. Here's a quote from Apple Computer in its early days which said, "Hey, we think that word processing and spreadsheets are the right answers for people to use on an Apple computer. We should use it in our own office, so don’t buy typewriters anymore."

Mine was more or less the same thing. We go to clients, and we say, "Hey, there is a concept called two-speed IT." You can debate the merits if that's where you want to be long-term and accept that you've got one speed that's optimized for stability and one that's optimized for agility. Maybe that's not where you want to be long-term; maybe you want to be an awesome, agile practitioner, like Capital One, who is often referenced in case studies. It was cloud-native.

But in this case, it's a great way to get there. You've got to start somewhere if you're thinking about shifting to an agile mindset, so we decided to use two-speed IT. My practice is not concerned with running our core e-mail systems. We're thinking about, "How do we engage with our clients so that we can probably get our toe in the water and start embracing things like the public cloud? We're going to start migrating some of our client-facing stuff to that, and all the greenfield stuff goes to the cloud."

Consul, Vault, and Terraform

But we hit a lot of hurdles along the way. It's AWS. You should be able to go into Consul, hit some nubs and buttons, everything comes to life. We hit some interesting things, particularly with risk and security, data at rest, and service providers. We went through many months of conversations around encryption key management.

Then we're used to a world where we have, like our clients, very complicated firewall rules, (the mention before about Edge firewall rules). I heard of one company recently, that had millions of firewall rules at the Edge. They have a lot of maturity in that. They trust them very deeply. How do we go to a world where you don't have that anymore, the castle walls and motes around you?

We take client confidence seriously. Cyber today is a hard game. The thing that scares me most is advanced threats that are highly targeted and come from sophisticated actors. I want to make their life difficult. I want to make it very hard, so much that, even if they get their toe in the door, they can't lateral and go onto the next part of my network.

One of the first things we decided we wanted to do along with dogfooding was to look at immutable infrastructure to help solve some of these problems. Like we said before, we want to treat our infrastructure code a lot like our application code. Why are we interested in immutable? We're all developers.

Immutable infrastructure feels a lot like how you do development. We don't try to patch binaries that we shipped or a JAR file that we ship to an environment; we have a new version, and we ship it out, and that's the version that you run. For us, this made a lot of sense as we were going cloud-native to say, "Well, rather than having long-running machines, let's replace them."

This is something that Terraform and Packer made that a lot easier, but we hit three challenges on that. You don't want to bake in the list of who works for you and who's allowed to SSH in your machines; that would be bad news or painful to change, especially as you're promoting it through environments. And then, in an environment-specific configuration, with things like secrets, you're probably not going to want to bake that into your machine.

And then log records. If you work in an industry that has regulatory or governance requirements, you probably need to retain your records for a certain amount of time. You can't throw the machine away. If you're keeping your logs locally, you have to deal with that.

Vault came to be the thing that, initially, we liked because, "Look, it looks like we can keep our secrets— It's an easy way to centralize them." We knew Chef, and encrypted data bags, and Chef Vault well, and we were thinking about it that way. "Okay, we'll do secrets."

The takeaway from this talk is that Vault, to us, does a lot more than this. If you're thinking about using Vault, or if you are using it already, you can go a lot further than that. Regarding administrative access for us, we would love to get to a world where we don't need to run SSHG anymore. We're not there yet. There are still things where we have to get into the machine locally to troubleshoot it.

In this case, we're using CoreOS. We like CoreOS because it's minimal; we're running Docker, that's all we need. I can outsource patch management to Core with their atomic updates. That's great, but it doesn't run PAM. This is a place where a lot of the traditional moves you would make in saying, "Well, I'll buy into LDAP and have all that work with my Linux distro," didn't work.

But Vault, (having so many different options under the hood with this SSH backend), would let us say, "Okay, I can delegate access for the centralized service. We get all the good things from it, and it integrates with our primary-identity and with multi-factor auth that we already have running."

With the SSH CA backend, we can also do mutual authentication. We used to ignore it every time you had an auto-scaling event, and you would get a warning about known hosts that said, "Oh, yeah, it's just auto scaling." We're training ourselves to ignore this valid security warning. What if this isn't one of our machines? By doing SSH CA signing of the host certificate, we now know that, if we get a warning about saying, "I don't know who this machine is," something went wrong. That's another way we were able to level up the security.

Then, of course, with Vault, we get all the great things around policies and we're able to control who has access to that. The list of policies you see there is 100% driven by our primary identity, which, in this case, is LDAP saying, "What groups are you in? What role should you have?" The great thing is it's all auditable. Now we have all of this because Vault is easy to audit. It's getting sent to a SIEM so that we can see who did what.

The next obvious part, probably not a surprise, mentioned the second challenge of, "How do we configure our host?" Consul Template is awesome. I love it. We're able to put secrets and other configuration settings in Vault and also with Consul, and itjust works.

To me, Consul Template has been such a force multiplier for us. One of the great tools that we love in the HashiCorp stack. When they added EC2 instance authentication with Vault, it made our life even easier. We had one less secret to manage.

The next part was, "How do we configure our actual workload?" We bought in on the Docker containers as a service platform model, and we built a wrapper around this where it feels more like a platform to our developers, to give them efficiency so that they are not thinking about lower-level problems.

Here, this is a file similar to what we heard. Capital One, it's in the repo, it's a YAML file. They can ask, "What types of secrets do I want?" Today, lots of AWS stuff. We added a backend for Mailgun for people to have ephemeral secrets when you're sending out an e-mail and then dynamic the secrets for your app with PostgreSQL. Then we'll add those.

Other people’s, requests come in, we'll add additional backends as well. We found this to be very accessible to developers to where, now, the way that we're handling secrets in our infrastructure as a cloud team is also the same that our developers are using in a PAZ-like way. It's nice to get that consistency.

On top of this, we added in our cockpit, here, a self-service UI to where you could come in and say, "These are the secrets I want to change." We had frequent requests coming in from developers like, "Hey, I need to write a new secret. I've got an API key for some third-party service."

We realized, in typical consultant fashion, that this was low value-added work; we weren't doing anything of value by having us write it for them. With Vault, we were able to act with sophistication to do things like write-only by policy; you can change them yourself, along with status indicators to say, "The things you declared are green." The things in your YAML file, you see they're green.” Or, "We have things in Vault that are yellow that you have written, but you haven't declared that you want us to inject it into your container."

Encryption and key management

All right. The next interesting part of our journey, encryption key management. For some companies, this may be a non-issue. For other regulated ones, this could be a real problem where, as great as AWS/KMS is, or irrespective of what Google does with its zero-trust in its encryption, we may have fundamental issues about letting that root of trust leave our doors. We still want to own encryption at the root where we own the root key, and nobody else does.

This is where Vault has paid us quite a bit in dividends. We've had different ways we've approached this. Initially, we were told we had to write our version of KMS for elastic storage. We had to do LUX plus Vault and do a lot of complicated things to orchestrate all of that. Fortunately, Amazon came out with something called Bring Your Own Key.

We’re exploring what it would look like to add to the AWS secret backend in Vault today to have it to where, when I want to generate my own root key with my own entropy in Vault, possibly with something on-prem, possibly with something that's HSM-backed, I can do it right with Vault.

The beauty of that is that HSMs are an easy thing for your security team to say, "Go do," but the interface, the APIs around it are very cumbersome to use. Vault adds a nice layer on top of that to make it much more developer-friendly to where we could say, "All right. I want a new key. Let me call it foo. I'll maintain a record of it in Vault that I know is secure and then I'll send it off and have Amazon import it, and now I can use it with things like EBS, or S3, or other areas."

Then, secondarily, we had some services, like RDS. We use RDS PostgreSQL heavily, but that's hard to tell every individual client that they have their encryption key, so we can tell them, "Go to the Vault Transit backend." Its encryption as a service—all the things that Armon mentioned earlier about making it much easier to generate keys, to maintain them, and rotate them.

We have two different ways to approach that problem of, "How do we make sure that we don't have key management issues for places that care about things like blind subpoenas against a service provider from a government body,” or having a service provider go bad and having access to your data? Or durability concerns, where maybe the service provider says they're no longer going to honor your contract, you have to leave. You have all your data, but it's encrypted. How do you decrypt it to bring it to the next place?

Cloud identity management

The last takeaway, is identity management. This was a bit unexpected, but we have a goal to get STS everywhere with AWS. We got burned by some long-term IAM credentials that ended up on github.com and did some bad things. I'm on a mission to say, "Look, I don't want to have to worry about what's out there. Let's a short-term lease." Armon mentioned it earlier. I don't need this access forever. I need it for the next 30 minutes.

That's easy to do if you think, "Okay, I know how to set up federated auth in Consul. That's pretty straightforward. I know how to say to my third parties, “Here's the name of my role and a shared password to use when you want to get fine-grain access to my S3 bucket or my CloudWatch metrics." That's straightforward.

What do you do when your admins want CLI access to do things like run Terraform plans locally? How do you give them API keys? Amazingly, Amazon hasn't solved this yet, and for us, it was a simple issue of using the Vault AWS backend and then doing a markdown doc in a Wiki that said, "Here, take this piece of Bash script if you want or modify it however you want for your Z shell or whatever. Then, when you call this, as long as you can talk to Vault and you're authenticated with it, it will do all the work of issuing you STS creds. Fine-grain policies, it's auditable, all that.”

I'd say, for us, we're getting to a world where we're having almost zero long-term IAM users. When we get to zero, I'm going to be pretty happy about that. That's another way that we can increase security. Here is an easy example: Calling a shell function, if you wrap it in eval, of course, it'll put it into your environment. It's made it pretty easy to do things like have a sandbox account that you can test out with Terraform locally.

The next part was super exciting to see the TOTP backend. Not to have developers going in and calling it directly to implement TOTP for their 2FA, but it's actually the root account. Just like Capital One, we have a lot of AWS accounts at our firm and it's funny that everyone has the same problem of, "I know I need to put MFA in my root account, but then I've got a team of people, and if you're in a geographically-distributed company, how do we make sure that at least somebody in the middle of the night can get access to this thing?"

I've got some team members where they will register a TOTP device or hardware device and throw it in a safe. You better hope that that time-zone's awake when you need it. Or two people, each one of them has a blue and green; one person has it, and another person has a copy of it. Again, you better hope that you keep that up to date so that, when that person quits, you're not down to one.

Here, when they add it to the TOTP backend of Vault, I'm like, "This is a perfect use case." We can do a shared password saved out-of-band somewhere else for root, but now we have a way to say, "Okay. I have a policy that allows a limited number of people to get the MFA device for the root account."

There are times you have to use root. If you want to submit a penetration request to AWS, they only accept it as root, which is crazy ironic because it's a security function. And then, most importantly, it sends to a SIEM event. Here's a small example of me with a sandbox getting a TOTP code, here [17:19].

Then, more importantly, with the SIEM, shipping that, and being able to generate things like alerts that, if you see that somebody called the sandbox, here, you can ask “who was it that did that?” Now at least we have a fighting chance to know, not only is somebody logged in as root, but probably who wasn't. That's a unique thing that I think Vault's interestingly positioned to help solve.

Hopefully, your takeaway from this is that Vault is great in its core mission of helping to contain secrets, but I think there are a lot of other ancillary things that help you accelerate and go faster as you start to adopt the public cloud.

All right. Thanks.

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023
Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/22/2022
Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

12/13/2022
PDF

A Field Guide to Zero Trust Security in the Public Sector

View all resources