Case Study

Terraforming NBIM’s Tenant Self-Serving Apps as Code

Published 11:00 AM UTC Aug 10, 2022

Learn how Norges Bank Investment Management migrated its brownfield deployments to use Terraform and custom modules to self-service Azure AD applications.

Inconsistent change control processes, manual lifecycle management, and not meeting the principle of least privilege — these were some of the problems found that led Norges Bank Investment Management to change the way it deploys and manages its 600+ applications.

The answer: Build a new operating model based on infrastructure as code, GitOps, and self-service.

In this case study presentation, you’ll learn how NBIM halved its support tickets, reduced its time to deployment, improved its security posture, and built reusable code foundations. You’ll also see the effort and trade-offs that was required to get there.

»Transcript

Hi, everyone. Very good morning to you all. I'd like to start by saying how very excited I am to be here today. This marks quite the difference from my last year's talk at HashiConf Europe, where I had the privilege of presenting to a smartphone on a tripod in the comfort of my one-bedroom flat in London.

Today, I'll be using our time together to talk about Terraforming our tenant and how we've delivered a solution to self service apps as code. To kick things off, I'd like to ask everyone in the audience a question. By show of hands, who of you have previously thought to yourselves, "If I could start fresh on designing my system, I'd be using an infrastructure as code approach?"

There we go. Couple hands. Great. The good news is I'm not alone. Not crazy. But more importantly, unfortunately, we have to face reality. Like in many other types of development, we simply don't have the luxury of starting fresh whenever we want to. Especially if our systems are already depended on by a number of users or stakeholders.

My goal today here is to answer the following question: "How do I migrate to using infrastructure as code whilst maintaining the user experience?" Well, again, in short, with great difficulty. But don't worry. We've got another about 28 minutes for us to go through everything I wish I knew from the beginning and everything that I've learned throughout that journey.

»Presentation Overview

Today my talk will be in three parts. I'll begin by introducing our Azure Active Directory tenant. I'll tell you where we started and why we felt ready to begin. Next, I'll go through what we've actually built, the principles, and what we've done to deliver Terraform for this purpose. Of course, it wouldn't be fair to you if I didn't level the playing field and do a reality check. I'll tell you what it costs for us to do this, as well as what we've gotten out of it.

So — quick round of introductions. My name is Peter Barta, and thankfully, as of yesterday, I've got good news, and I'm now a Senior Cloud Security Engineer at Norges Bank Investment Management. Thank you. Thank you. Very chuffed as well. I've been designing and building security solutions within the IM space for a number of years. And I'm a huge fan of self-service and automating things away.

As for NBIM, also known as The Oil Fund, we are Norway's sovereign wealth fund and the largest in the world at that. Valued at over a whopping $1.2 trillion, we have a critical mission of ensuring that we safeguard and build financial wealth for the future generations of the Norwegian people.

If you've ever heard the expression, "A finger in every pie," well, I think NBIM is probably one of the best examples of that, with certain ethical exceptions. We are invested in over 9,000 companies worldwide and own 1.4% of all publicly listed companies. Next time you think about pretty much any company, you can think future prosperity of the Norwegian people.

»Why Are We a Good Use Case?

Well, with just over 500 employees spread across Oslo, London, New York, and Singapore, we're both a diverse but incredibly lean organization compared to the size of the fund. And as I'm sure many of you know, when in such environments, you have to build incredibly efficient solutions that are easy to use and scale with the business.

Next, we support the fund with a combination of internally developed and industry-leading tools, and we do this in a fully cloud environment. So, like many of you, we're looking to use the latest and greatest and ensure we're at the forefront. We do all this to enable the optionality and pace of development needed to remain a world-class fund. But it's clear those very same qualities and attributes help you succeed within your own respective industries as well.

»Azure Active Directory

Let's talk our tenant. Where did we start, and why did we start? The focus of today is on our Azure Active Directory tenant. It acts as our primary identity provider and services us in three key areas. First, it houses all of our application and service identity definitions. We're not the biggest tenant by any means, but we still have well over 600 applications already defined and that need management.

Next, it also houses the authentication configuration for all these applications. And many of these apps have additionally complex configuration set up to enable a single sign-on or whatever else we might need for the time being. A significant portion of the authorization is also controlled through our tenant. We have thousands of AppRoles and assignments to users or groups to ensure that we have the right permissions and privileges in place for all of our applications.

»Overdue for an Overhaul?

Now, as our tenant has continued to grow over time, we've noticed we've increasingly felt the friction of not abiding by our own principles. Let's take a quick look at what those are. Firstly, like many of us here in the audience, we strive to deliver everything as code. It's an industry best practice and reduces manual, hands on-work that we want to move away from under any circumstances possible.

Next, we have to ensure we have a strict and efficient change control process. We need to be able to protect our production services across the fund at any time. And lastly, we have to ensure we're following security best practices. We need to be doing and following least privilege at all times. That's always clear.

Then, when we took a hard look at where we really were, we realized that we weren't quite hitting the benchmark on all of those points. The reality is that for all of those 600 applications, from onboarding through to managing the entire lifecycle of that application, we were doing it manually and hands-on through the user interface. Not great.

As part of that, we had some applications that were managed by their owners and others that were managed on behalf of the owners by more experienced teams. So, we had a very inconsistent change control process where some users were involved and others weren't. Not great again.

These two points were symptomatic that we had a wider security problem. We had a huge number of application developers and over ten tenant-wide application admins to actually support doing all of this work manually and hands-on. So, not really following least privilege there either.

»When Is the Right Time for Change?

I mentioned at the beginning of the presentation that we don't all have the luxury of starting fresh. We all like to think we can, but well, business says no. So, what happens when you look at this brownfield problem?

We had to decide when the right time for us to begin was — and with that, we started out our maturity review. The cycle begins. We take a look at ourselves and say, "We are not quite at the benchmark that we need to be at." We say, "Well, that's right. Infrastructure as code would be the best practice. It would make us industry-leading, and we should be doing it. So, let's go for it." Great, we are now in the midst of it. We begin our market research and see what's out there. In our case, we began looking at the Terraform provider.

Inevitably, we don't quite hit all of the use cases that we need to cater to already. Our environment has a number of demanding applications, and we should be able to manage all of them through Terraform. But if we can't, we have to wait. We say we're not ready, and we should wait for the provider to release more features, mature some more, and be ready for when we do need it. The next step is that we scrap the project. We said, "We're not ready yet. It's not there. It's not mature enough. Let's wait.

Of course, we've come back full circle. And this cycle will repeat over and over and over again until we eventually decide that something has to change — spoiler alert, something does change. I'm here talking to you today. So that's good news.

But what was it? For us, it was the introduction of an engineering threshold. We had to critically evaluate how much work we felt comfortable investing in to reach the success criteria and the value that we needed to achieve to make this project successful. Thankfully, we actually found that we were ready. We felt like we had the engineering talent, we had the time, and the need was great enough to embark on this project. Here we go. We get started.

»The Problems With No Easy Solution

Now, the biggest point of pain typically comes right when you begin the project, and you've got this golden mist over your face, and you're thinking, "We're ready to go." But then you have to wake up and face some very hard truths. It's not that easy to build the solution that you need that's going to be perfect.

»Implementation

We looked at the first phase, and for that, that was going to be the design and implementation. For us, we realized that, because we were building to support existing applications, we were going to have to build for the past. And that meant recognizing that, well, it's quite literally impossible for us to build an application today the way it was built three years ago. There were some very real limitations that we had to overcome with what the provider could do for us.

»Operating Model

Next, we looked at the operating model and how people were going to use this. We are moving away from a user interface, and to ask that challenge of end-users is not a small thing. People like their UIs, and they're very accustomed to them. You have to give them a dang good incentive to go down this route. The same applies to the admins. When you take away admins’ permissions they're not happy. How do you appease them?

»Migration

When it comes time to use it for real, we have to think about what it means to migrate. We have a huge number of applications that are already there. We have to import them. That, in itself, is a challenge. But even more so, we have to make sure that we do it gradually. When you work in a lean environment, it's not enough to do a big bang change. So, how do we do this gradually?

»Terraforming for Today and Tomorrow

Well, I've given you a lot of information and a lot of challenges, so I'll take a quick check. Hands up if you're still with me. Wonderful. Most hands. Good. Let's talk about Terraform and what we've built.

»Design Principles

We started with our design principles, of which there were four. For each one of these design principles, I'll go through and explain what we did. First of all, we had to be human-friendly. It had to be easy to use and intuitive for users to get started. They had to follow patterns that they were already familiar with. As I mentioned, moving away from a UI is no small change.

Next, we had to give back control to our users. It's not enough for them to be able to do some things themselves. But others have to be reliant on another team to do for them. This notion of control needs to be back in the hands of the owner of the application.

Next, we have to fail safely. It's not good enough to say, "Sorry. It broke," when an application owner tries to do something themselves. They have to feel confident and empowered to do things on their own, knowing that they won't break everything at the same time.

And lastly, the change has to be easy. It has to be an easy migration to make sure that we get the highest rate of adoption and the most people that are satisfied and happy to use our solution.

»Human Friendliness

Each one of these, in turn, what did we do? On the note of human friendliness, we had to design for familiarity. I think the first thing that comes up on this slide is something that's quite eye-catching because I think it's almost the exact opposite of human-friendly. The first thing that we did is we said no to IDs.

I'm sure I'm going to anger some developers, and some hardcore advocates of this whilst I do this. But we had to say, "If you want people to use a codified approach and take them away from a user interface, you have to find something in the middle."

And for us, that was to make everything human-readable. We did the same when it came to the configuration language. If you're expecting users to change, you can't say, "Oh, do you mind learning this new proprietary configuration language for me?" You know what they'll tell you to do? Yeah!

We opted to use YAML. It's already well established within our business, and our developers are very familiar with it. That meant we could do it a lot more easily. And I think you'll all agree with me; what you see on the left is a bit of a pain compared to what you see on the right.

I think it's like if you had a friend or any number of colleagues and you tried to remember their name using their phone number instead of their real name, it's not great. It's not going to work.

Next, I mentioned the user interface. In Azure, you would normally click through a number of different panels to configure different elements of the application. How did this translate to code?

Well, we took an approach where we had to map what they were already familiar with into a simple and intuitive file structure. We started with a base.yml, which contains the very core elements of an application that needed to make it work.

We then progressively added more and more files to cover all of the remaining areas. Not too many, of course — just five. But more importantly, we made sure that everything was optional. You only needed a base.yml to get started, and when you were comfortable and you wanted more configuration, you knew where to look, and you knew how to find it. Nice and simple.

»Ownership

On the point of ownership, I mentioned that there are things that people are dependent on for the IM team to do on their behalf, but we need to make sure that they can do all of it themselves. Otherwise, they'll never become independent.

So, we opted for a GitOps workflow. We made it self-serviceable, so anybody could write into the repository, and we used tools that were available to us within GitHub — like code owners, so people could define who would be responsible for an application. This way, we had a much better change control process as well because team members and owners of applications had to approve every change. They were involved, and they owned the process.

»Fail Safely

On the point of failing safely, we had to ensure that there was going to be minimum risk of destroying or breaking our production systems that people were already dependent on. We did this in a combination of ways. We started with our Terraform based controls. Those of you familiar with Terraform — you understand the idea of a state file. We separated each of our state files out for each application — we isolated them.

Now, some of you would definitely, and rightfully so, complain about what that means. But I would like to remind you — or encourage you to think — if anybody here has done state file surgery previously, you'll understand how painful that is. I would liken it to doing CPR — it's great that you can do it, but you hope you never have to.

As part of building a custom module, we were able to ensure that certain configuration elements were always going to be set to what we wanted them to be. People couldn't make changes that might potentially put us at risk, and we could enforce certain security controls this way.

Of course, I did mention GitOps. We have options here for us to add additional checks to the way we normally pull requests in GitHub. We had linting and validation on every pull request, ensuring that people were referencing correctly created applications, the right group names, the right usernames — make it safe, like on a bowling alley, you've got those nice little rails to help you play safely.

We found we were improving our disaster recovery posture, as well, as part of this. We've codified all of our apps. And every time changes are made , we know exactly what it was that was going to break an application and how to get it back to a working state.

»Easy Migration

It came time to use our solution. How did we do that? I mentioned we had to import all our existing applications, and that can be a challenge. We had to find a reusable way of doing this —reusable and repeatable, of course — to ensure that whenever people were ready, they could begin using our solution.

The answer for us in this was to create a reusable process that meant they could generate the code definition, as well as import their application on a regular basis. So when they were making changes one day to the next, that would still be repeatedly up to date within the repository — so when they were ready to switch over, they would have the latest definition of their configuration in code already for them.

This also applies to making sure that when we use new applications, people can have templates readily available for them to use super-easily. Because everything is a code — copy and paste, change a couple of fields — it's much more friendly than going into a UI and trying to configure dozens of different fields and click through 15 different pages.

»Coming Together — A Workflow

All in all, what does this workflow look like from the end user's perspective? If we look at an existing application, what happens in the first two steps is essentially invisible to them. We generate a codified definition for them, as well as import the application. And we can do this repeatedly, as I've mentioned. We have separate state files for every application, so we simply recreate it whenever they need to. And whenever the user is ready, they submit their very first pull request.

As for creating a new application, copy a template. You have an app that you know that works — just copy the existing app. Simple as. You make a few changes, and again, you're back to the very first pull request. And the only time we need IM approval for any of this is at that first pull request, where you define the code owners and the people that are going to be responsible for managing that application from that point forward.

»How Did We Solve Our Problems?

Now, I have mentioned that we've covered a lot of different problems. In the first instance, I said we had to build and extend Terraform. How did we do that? Well, quite literally, we extended Terraform. We've added additional scripts and data sources, and we ran additional things that used the API alongside Terraform. As long as Terraform wasn't upset, we weren't upset as well. But it gave us the same control we needed ultimately.

For our operating model, we solved a lot of the issues we had by ensuring we used an easy-to-own and familiar structure for users. They were accustomed to what they had seen, and we did this very gradually. And on the note of migration, a safe and easy migration was part of it.

But something that I've failed to highlight is doing it incredibly gradually. We had a very long pilot period, six months, in fact. Maybe that's not too long, depending on who you ask, but for us, it was quite long. During this process, we invested in spending time with each of the teams, as well as technical and non-technical representatives across the business, to see how they felt using this solution.

We had very short feedback cycles and made sure when we got feedback, we listened to them. This, in turn, transformed them. They were now supportive. They were champions for the solution, and they understood the benefit that it gave, both to them, the business, and our security posture as a whole.

»Time for a Reality Check

I've talked you through a lot of different bits and bobs and different implementation options, but we all understand at the end of the day, you have to build something, and well, you're going to have to crack some eggs.

For us, what cost did this come at? The first is that there are some very real consequences of our design choices. I'll be the first to admit, Terraform has some very cool features, such as the dependency mapping between plans and understanding what order to execute things in. That's great. It is. And we have ultimately removed that because applications are now completely separate.

We have to have ordered changes. But we found, thankfully, we don't have a great deal of application changes that are completely dependent on each other — So we can stomach that for the time being.

Next, there are some very real limitations of the tools and strategies that we've used today. And we've added a lot of complexity because of the fact that we've had to engineer additional elements to what we've built.

So, we make API calls, for example, and we are now dependent on managing whenever the Azure API changes. If any of you have worked with a graph API before, you'll know how often, how frequently, and how unreliable it can be at times. There are a lot of breaking changes, and we have to be ready for those. The same applies for our module. We have to be aware of whenever our provider changes and ensure that we cast our version and limit it where we are and make sure we vet every possible change.

Of course, I wouldn't be here today if it was all that easy to do. There are actual engineering challenges here. It took a lot of time to build, and testing and repeatedly trying all of this — it's been quite cumbersome. There are certainly costs to be aware of. But I don't think that it's fair to say that it's all negative bits and bobs. Rather, we've actually got some positives as well.

»The Pay-Off

What were the tangible things that we got out of this? Well, I mentioned at the beginning, there's an incredibly high overhead for managing 600 applications manually. You can only imagine the number of support tickets that we were getting every week for changes of all sorts. Could you rotate a credential? Update a login URL?

»Halved Support Ticket Numbers

Users simply didn't feel empowered to own their own applications. And as such, they were requesting us to do it on their behalf. We've actually delivered a halving in the number of support tickets we're getting every week. That means we have much more time to build these sorts of solutions and ensure we're delivering values in more sustainable, long-term ways.

»Built Foundations

Next, because we've built all of this, we've actually built our foundations as well. This code isn't going anywhere, and it most certainly is reusable and repurposable. We've built pipelines to deploy Terraform, managed state files, and all sorts of other things.

This isn't going to change overnight, and we can continue to reuse it and make the most of it when the next project comes around — and we go through that cycle of deciding what the next thing is we're going to migrate to infrastructure as code. We'll be ready and prepared when that time comes.

»Security and Velocity Gains

Of course, like we set out to accomplish in the first place, we needed to make sure that the security and velocity gain was very real. We matured our security maturity, and we are moving much more quickly. When there's a user that wants an application, copy and paste something, submit a pull request, and they're ready to go. No more waiting days on a support ticket. And the same applies for every single change that you can imagine. It's now in the hands of the user, and they are empowered to do it when they feel ready.

With all that said, I think I've taken you on a little bit of a journey, and I see that it's not that easy to do all of this on your own. I hope that you've learned something today, and if ever you can apply it in your own environment, please do. My name is Peter Barta, I'm from NBIM, and you can find me outside to talk to me about the company or what I've done here today.

Thank you so much.

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

12/13/2022
PDF

A Field Guide to Zero Trust Security in the Public Sector

12/5/2022
Case Study

Enabling infrastructure as code at LinkedIn

11/30/2022
Case Study

How Weyerhaeuser automates secrets with Vault and Terraform

View all resources