Case Study

Adopting GitOps and the Cloud in a Regulated Industry

Hear how BOK Financial adopted HashiCorp Vault, Terraform, and Packer to streamline their workflows in a compliance-heavy environment.

As companies move into public cloud, many aspire to use enfrastructure as code yet struggle to make that a reality. Skill gaps and the urgency of rapid delivery often lead to manually provisioned and maintained cloud environments, resulting in operational burdens and increased risk.

Learn how BOK Financial created an "as-code" operating model with no prior automation or coding experience in a highly regulated industry, while strengthening security, increasing resiliency, and driving efficiencies. 

Transcript

Kris Jackson:

My name is Kris Jackson, I am the Manager of Cybersecurity, Engineering and Operations at BOK Financial. This is...

Andrew Rau:

I'm Andrew Rau, I'm manager of our cloud services, DevSecOps, and Kubernetes.

Kris:

Fantastic. I got a little picture for you guys. These are our families. I'm on the right here — I'm not the small one. I've got two beautiful children: Ellis, who's five, and Valentine, who's four. As I was leaving, they asked, "Dad, Don't go, don't go. We want to play with you." They didn't want to see me leave, get on an airplane. They want to come on the airplane with me. It's like, "No, I've got to go. I got to go do something very important." They're like, "Dad, do you got to go protect the internet?" That is exactly what I'm here to do.

Andrew:

Obviously, I'm on the other side. I have a daughter, her name's Lily, she's seven, obsessed with YouTube. I told her I was coming out here to speak and that it'd likely be recorded. She said, "You're going to be on the internet" "Yeah. I'm going to have a picture of you." "I'm going to be on YouTube? I get to be on YouTube?" She was excited just to have the picture.

Kris:

So who is BOK Financial? We are a midsize bank. We're top 25 US-based. We have a large business in lending to energy, health care, as well as assets under management. We're one of the largest companies (US-based) with assets under management. Very diversified, highly rated, very traditional lending, so we lend to very safe customers.

Dealing with Regulations

But as a regulated industry, we find ourselves under additional obligations. And if you unpack that a little bit, what does it mean to be regulated? It means that society has decided that we have additional obligations to our customers and to the community outside of a traditional or a standard business. The regulations that we find ourselves under are PCI (payment card industry), GLBA (Gramm-Leach-Bliley Act), HIPAA, and others. Of these regulations, some are private or community-based, like PCI — this is an industry-based regulation. Others are governmental sponsors (GLBA).

And because everybody does not what you expect but what you inspect, we have regulators. As a national association, we're primarily responsible to the Office of the Comptroller of Currency. They're the ones who make sure that we do what we say we're going to do. Now, if you've ever heard the saying, "Move fast and break things" — incredible saying. But in regulated industries, it's the opposite, because if you break things, customers aren't able to access their money. Or if you're in energy, people don't have power. These are really bad outcomes. So it's much more important to be available, to have accurate information, and to maintain confidentiality. Those are much more important than innovation in our space, and that's what the regulations are a recognition of.

Andrew:

Our starting point in 2018 — this was before we started moving into public cloud — we had a datacenter, like most companies. Systems, storage, network, everything was there. Because of managing risk, we had to separate out the responsibilities based on their role. So we had developers that could do one thing. We had Network. We had Security, who could make firewall changes — nobody else.

Firewalls And Moats

When we had our most sensitive applications, things that could move money for example, we put them behind a firewall — separate rules, completely isolated, very locked down to protect them. A lot of this was manual, which meant we had to go through a change management process. It had to be reviewed by a bunch of other people and signed off on and say, "Yes, there's no risk to this type of change."

And then we started our cloud journey. The biggest challenge with public cloud is it’s really designed for a variety of use cases. It could be for a personal blog, or it could be for mission-critical applications supporting a business. So when we take these different roles that we have, the question was, “How do we effectively and securely deploy and manage our public cloud infrastructure?” What it really came down to was, we needed a brand-new operating model. We had to change the way that we worked and really start from the ground up.

Kris:

As we began this journey, one of the questions we asked ourselves is, "What could go wrong?" You're in the cloud. We have tons of experience in operating traditional datacenters. If you think about what a traditional datacenter is, it follows a “castle and moat” philosophy. You have your asset. It's very well protected by high walls — by that we mean firewalls — you put everything inside, and you have very specific gates that you allow entrance into and out of your castle. If an attacker is going to try to breach it, they have to come through a very small opening that you've well protected.

Risk Reduction with Dispersed Assets

This is the traditional model, but the cloud looks nothing like that. As you move into the cloud, you have the concept of villages that exist outside of your castle that you have to protect. And the protections aren't a giant firewall. It is configurations that have to be deployed at the edge and managed at scale. Instead of protecting one assets or one logical asset, now you're protecting assets that are spread across the environment. And the key isn't how well you can maintain firewall rules, how well you can limit access. It’s how do you manage the configuration and state of all of the resources that you're deploying out?

There were a couple opportunities on how we were going about reducing the risk. It's not necessarily that cloud represents more risk than a datacenter. It is a different collection of risks.

The first one that we looked at was role separation. And this is like the tried-and-true datacenter approach. We have different people who are responsible for different things. If you want to change the firewall, you have to go to Security. If you want to make changes to the network, you go to the Network team. If you need storage, Storage team, yada, yada, yada. There are advantages to that. It is effective at making people accountable for different things. The Network team is accountable for making sure that a packet can get from A to B. All they care about is every time you send that packet, it's able to make the connection.

For a network admin, the best thing to do is zero, zero — everything's allowed. That's perfect availability, that's what they're accountable for. Security is the opposite. We care most that the right packets are getting through, and all things being equal, it's usually better to deny a packet than to allow the wrong packets into the network. But this has a lot of drawbacks to it — biggest one is the isolation and the delays as it passes from team to team to get a process done.

So we moved a little bit right of that and we said, "Okay, how can we offset some of these risks?" The next option was detective controls. This is scanning your environment, looking for misconfigurations, and saying, "Okay, this is bad. Let's go fix that." And this is a better solution in many ways, because it allows you to distribute the ability to change your environment while managing risks, at least to some level. But it's not ideal.

The next step is preventative control. How do we prevent misconfigurations from getting into the environment entirely? If we can codify what good looks like, then we can begin distributing access outside of a small group of role-separated people into “anybody can push changes as long as it passes the rules.” We settled really on the last two, as the majority of our control is detective and preventative.

Everything Is Code

Role separation, we minimize — we'll talk a little bit about how we do that just a little bit, but it's definitely minimized over a traditional deployment. We've called this approach GitOps. It is the implementation of DevSecOps into the infrastructure and security components. And this is an “everything is code” model, meaning when users need access to do something, they don't have access to click around in the console, they have access to make merger requests into a Git repo. Those merger requests have to go through an approval process. We’ll talk about it a little bit more later.

The big drawback to going this route is that if I click around in the console, I can get it done a lot faster. That's just the reality. This is the constant temptation that is before us as security and IT professionals — it’s like, I need to deploy a storage account. If I just go in there, click a couple times, it's done. I win. If I have to do it as code, I have to understand what I'm doing as code, and it's a more formal process and approvals and it's going to take me longer.

And they're right! It does take longer to do things in an “as code” manner, at least the first time. But if you're in this room, chances are you are in a regulated or work with a regulated industry, and you're not going to deploy it once, you're going to deploy it to dev and then you're going to deploy it to test and then you're going to deploy it to prod. And each time, if you click through the console, you have to go through the same work. Not only the same work, but as you have more environments — two, three, four, five — keeping them all identical becomes a nightmare of a challenge to make sure that one small change that you forgot about two months ago when you were building out your prod environment gets replicated up. It’s very tedious, difficult, and error-prone.

If you go with the code model, it takes longer to get the first environment built, but the second, third, fourth are reusing the same code. It becomes very, very easy and quick, and you're sipping Mai Tais on the beach while the yahoo that went the other way got dev done real quick but is struggling to make sure that test and prod perform exactly the same way. We found this a very favorable approach.

Andrew:

So, it’s 2020, we're about a year and a half or two years into our public cloud journey. We're working with one cloud provider. We're using a lot of their native tooling. We're actually pretty happy. We like it. We weren't using HashiCorp at the time. And then we got this project that came along. My friend Kris here was the one that gave it to me and said, "We need a second cloud provider for this one capability." And basically our world went like this [slide shows single cloud relationship separating into two]. What happened was, some of the native tooling in cloud providers is phenomenal, and some of it will make you lose your hair. Which one of us do you think lost our hair? It was me.

Kris:

Role separation.

Andrew:

Role separation, Yeah. I dealt with the frustration. Had a beautiful head of hair like his right up until this point.

So, we are working here and we have two cloud providers. We don't like the tooling of one of the cloud providers. The other thing that we thought of was, "I really don't like doing things differently for different cloud providers. I want to do it the same way. I don't want to have a developer know one language for one provider and then a different language for another provider, a different method." We said, "How do we make this common?" And that's where HashiCorp Terraform came in.

Terraform Talks To Multiple Providers

We introduced Terraform in 2020, and that became that stretch across our multiple cloud providers. We built internal Terraform modules to conform to our standards for resiliency and security and things like that and said everything had to go through that. That really allowed us to go into this one way of working. We're using one language across multiple cloud providers.

To give you an idea of how we're using it, we have some metrics. As of June 30th, we have 16 different providers that we're connected into — that's not necessarily cloud providers, and I'll explain that a little later. We've had a total of 33 contributors between 2020 and the first half of 2022. We've developed 76 distinct Terraform modules with around 800 different versions. Like most code, not all of them got used, because we introduced it and then it didn't work right. On those versions of those 33 contributors, 13 of them were contributed to those Terraform modules. We had a pretty small user base contributing to these modules. Using the modules, we generated almost 8,000 different resources within Terraform. That does not include things like VMs that may get spun up through scaling, of which we had about 10,000 last year with an average lifespan of one and a half days.

Kris:

The approach that we went with as we implemented GitOps was a guardrails-based approach. The goal was that we wanted to democratize the ability to deploy resources into our cloud environments by anybody in our IT organization. If you're on an application team and you own an application, and you need to make a change to a security group and to identity and access management roles and permissions — to anything — you could independently make that change and own the configurations that are powering your environment. But being regulated, we have obligations on how we go about that. The solution was merge policies.

Building Guardrails

The first regulatory obligation we had to meet was that a user, or a developer whose language was used, cannot push their own code into production independently. This was because of that classic, "Hey, I'm going to make a change to this code and I'm going to get one fraction of a penny off of every transaction." You've heard that before. We had to address that. This was really easy. We created a merge policy that said that if you're going to merge into a protected branch, you have to get one non-author approval on the merge. Easy-peasy. Right now we have dual-control capabilities on all changes that go into any protected environment.

The second one was HashiCorp Sentinel policies. We said, “There are certain configurations that nobody should ever push out.” (This is going to be a public storage account.) “We're not going to allow rewrite on a storage account. We're not going to allow SSH from the internet to a security group.” Very basic adverse states no one should ever do. Now the reality is, somebody eventually will want to do one of those things and already has, and we make specific exceptions for that. But you can't do that by yourself in our process without us coding in some very specific exceptions. Sentinel policies allowed us to trust our users to deploy out things, knowing the worst things can't happen.

Then we added code owners, and this is the separation of duties that we talked about earlier. There are specific changes that our regulators say have to be done by certain individuals. If you make a change to a network security policy, a security engineer has to evaluate the risk of that change and approve it. We use code owners to do that. If you make a change to a security group, a security engineer is listed as a code owner on that type of resource. We'll go in, we'll review the change, click Approve, go. We can approve almost 98% of changes in less than five minutes. It's a very low barrier compared to the classic “throw it into a ticketing system and wait for somebody to pick it up two or three days later.” Much, much quicker. And the application teams own their code. They really do own that code soup to nuts.

Addressing Vulnerabilities

If we talk about the entire stack of pipeline security — we'll re-go over it just so you can see it all in one picture. You have the merge controls at the far left of the screen, you have Sentinel policies that prevent adverse configurations. Then we were really happy with things and we operated just like this for a while. But one of the risks we identified was the persistent access tokens or API keys that existed in these infrastructure as code workspaces. We were very unhappy with the fact that if one of these were to get mishandled or disclosed or something, it's out there for a long time.

One of the more recent changes that we made to our environment was we added HashiCorp Vault to issue STS tokens or short-lived credentials that last less than one hour. When the pipeline kicks off, the credentials are created, they're handed to the pipeline, it uses them. If they were to get lost or disclosed or written to a log or taken, any of that, we at least know that the window that they can be used is extremely short. This has been one of our favorite recent changes.

Then also on — this is our beta side. Everything so far we do very well, but we've got a little beta that we're running right now with Terraform's new Run Tasks feature. This issues a webhook out. It goes into our chat platform and says, "This workspace is doing this change that's outside of what we expect this workspace to do. Do you want to approve it or deny it?" We have separate workspaces that are owned by different organizations, so we can use that as a role-based access control, and who clicks the button has to come from specific security workspaces. Now we've got this really cool gate that's dynamic in which we can trust people to do things that they usually do. But if they do something unusual, we have a gate there where we can have additional approvals.

This is still in our beta. We love it right now, but we may end up hating it. I don't know if we'll keep it or not, but it's a cool concept and so far I'm really happy with it.

Andrew:

We talked a lot about configuration, deploying the resources into the cloud. When we talk about the compute, there's a big question you have to ask when you're pursuing cloud: “Do I want it to be immutable or not? Am I patching it in place?” That was a question that we asked ourselves, and to start with, we said, "You know what? We don't really want to set up everything to do patching in place." There's a lot of benefits to being immutable. When it's unchangeable it means that chances are I'm deploying it in an automated repeatable manner, which means I can use automatic scaling, I can set max time to live. So from a security perspective, if something gets compromised, it’s maybe seven days max somebody's on that compute.

Immutable vs. Patching In Place

We decided that we wanted to go immutable with our cloud infrastructure or compute. We ended up going with HashiCorp Packer We deployed that within our cloud environment. We set up all the configurations so that we, especially in a multi-cloud, could use the same template to build the same OS with the same settings. We install our base software that is required for endpoint protection and vulnerability management, things like that, as well as really secure it to our hardening standards. Then we publish it and we say, "You have to use these. This is what you can use." So the application teams use these golden images that we publish for them

Then last we put it on a schedule. We say, "Hey, every week we're going to publish a new golden image. That's your latest. And then every month we're going to update it to the stable." (So you have to test it.) Now we have this automatic life cycle instead of patching and trying to track down, “Why didn't it get patched? We got to figure out what's going on.”

We've got this great setup, all these controls with the merge approvals, code owners, things like that. We're really happy with our multi-cloud environment. And just like a couple years ago when you started seeing avocado in a lot of different things, you're like, "Why would I ever want to eat that? Why would I want to try that?" Then you try it and you think, "Wonder what else it can go in." You're like, "Oh man, I want to put it in this. I want to put it in that." And that's kind of what we were doing. We wondered, "What else can we do with this?" 

Extending The Model

We said, "All right, let's take cloud out. What else can we do? How about we manage our Git repo system?" So we did that. Today, it's all done through code. We have a Git repo that goes through Terraform to manage itself. We love it. It's that circular reference. There's no owner of our Git environment. Nobody has the right access.

This is also a security control because we can ensure that the settings, the merge request approvals and things like that, can't be overwritten by anybody. It has to go through this other process, which is going to ensure that it's going through there. In the same vein, we said, "Let's manage Terraform Cloud the exact same way." So we have Git, Terraform going to Terraform. It's a beautiful setup. And that's where we started getting our 16 providers. We started expanding it to some of our software-as-a-service and our deployment pipeline products.

The Future

We've talked a lot about what we've done. I want to spend just a little bit of time talking about where we're going with HashiCorp, and this will be wrapping it up. First thing is, we talked about our Terraform modules. They're great. There's some nuances with it. Sometimes it’s difficult to get something in an advanced concept to work inside of a Terraform module for our developers. What we want to do is start building our Sentinel policies as part of our Terraform modules themselves. It helps solve two things. First is, it gives us our test cases. We can test our Terraform modules — make sure that if somebody tries to change a module and allow public access on a storage account, the Sentinel policy will deny it.

The other thing is, though, we can basically aggregate those Sentinel policies. Now we can say, “You don't have to use modules. They're there to make it easier for you, but you can deploy just the base resources yourselves, and you're still enforced with the same rules essentially.”

Kris mentioned the Run Task features — it's in beta, but we're liking it, so we're thinking about what else we can do with this. What else can we expand upon or build into this custom process that we have? How do we enhance it?

Last, and probably the biggest of everything, is going to be our contributors. It's just a small portion of our technology workforce, and we really want to expand to the rest of them. We want to get more people involved. And really what we want to do is say, “The systems, the storage, the network, and the security components that are within our datacenter, we want to manage it the exact same way so that we can have multi-cloud as well as within our datacenter, a single working model.”

More resources like this one