Case Study

IAM and Compliance Automation at Red Ventures with Terraform

Learn how Red Ventures built their own Terraform provider to automate access management, meeting the requirements of various compliance frameworks while enabling self-service and solving disparate usage patterns across several different industry groups.

Red Ventures has also written a broader case study on its Terraform usage, which you can find on our case studies homepage.

Transcript

Mike Saraf:

Hi, my name is Mike Saraf. I'm a Senior Platform Engineer at Red Ventures. You may not have heard of Red Ventures, but you probably know of some of our brands. Brands like Bankrate, the Points Guy, GameSpot, CNET, Healthline, Lonely Planet. There are a ton more. I tell you that not just to introduce you to Red Ventures but to give you some context for this talk. 

Red Ventures started off with a simple business model. We partnered with companies and helped them market their products online. But as Red Ventures grew, we knew we wanted to start doing this for our own brands. Some brands we would build in-house and some we would acquire. 

Before we knew it, over a few years, we had a large portfolio of companies, nearly all of which used Amazon Web Services (AWS), and consequently, we ended up with as many ways for accessing AWS. This left us with a number of questions and a ton of challenges. 

First, this was becoming a terrible experience for our engineers. When a new engineer would onboard a new company, or if an engineer would move between different business groups, there would be lots of questions. It would also be a security nightmare because now there were multiple ways to access AWS and so multiple endpoints for them to audit. 

It also created compliance issues. When an engineer would come to us and say, "Hey, what accounts do I have access to?" It was always a well-it-depends type of answer, “Are you working on the health vertical or the education vertical? Well, you probably need the AWS tile. Oh, it's on the health vertical. Well, you need the health line AWS tile. Oh, you're an admin. Well, now you need the admin tile.”

That just was figuring out how to get into an account. Then people would ask, "How do I get access to the developer role?" And I would say, "Well, that's easy. Go into the HashiCorp Terraform for your account. Add yourself to the developer role. Oh, you need extra access. Oh, you just have to go in. You don't do Terraform for that. You go into CellPoint and request access there. Oh, you're an admin for that team? Oh, you need a different tile, and you need to have somebody add you by hand to that role." 

You can imagine it was confusing, and this wasn't anyone's fault. This was a natural result of rapid growth. But we came to a point where we needed to step back and say, "Hey, we need to redesign this system and solve not for existing problems, but challenges we're going to face in the future as we continue to grow."

Our Goals

Three of us from our central corporate support group formed a team. We researched options. We talked to a ton of engineers on different vertical groups and came up with some product requirements. We wanted overall a better user experience for our engineers. 

We also wanted a much better and more concise strengthened security posture. We wanted improved visibility for compliance.  And we wanted access to remain fast and easy. So we knew we wanted self-service for people to request access. We also knew we needed to integrate with our array of third-party tooling, part of which was for IAM, and part of it was for ServiceNow. 

High-Level Design

At a high level, we have a system that has a custom Terraform Provider and an API that it calls — and that API then orchestrates the integration, not just with the configuration but with all our third-party services. From here, I'm going to hand it over to Austin to dive into the details.

Austin Burdine:

Like Mike said, my name is Austin Burdine. I am a staff engineer on our corporate technology team, and I'm going to walk through how we approached the problem of integrating this solution and how we decided on our final implementation. 

Concepts and Terminology

Before I get into the technical details, I need to go through a few concepts so everybody's familiar with the terminology I use later on. IAM (or identity and access management) roles in AWS are things in each account that dictate who can access what and do what actions on what resources. On top of that, you have permission sets. This is a concept in AWS' single sign-on service, which is the bare bones of what our implementation uses. Permission sets — you can think of them as blueprints for IAM roles. When you assign a permission set to an account, it creates an IAM role using the specification in that permission set. 

We also have AWS accounts and AWS organizations. AWS accounts are the top-level organizational group for resources for a particular project. Organizations are a layer on top of that organize different accounts into various other groups based on how your organization as a whole is structured.

We also have active directory groups. They're pretty self-explanatory: They're groups of users that we assign to other things to dictate who can access what. We also have ServiceNow, which, as Mike mentioned, is what we use for our self-service platform. That has a number of things related to users submitting tickets — as well as how those tickets get approved, who has to approve them, and then how the final actions to add people to the right groups are taken. 

Steps to Set Up a New Role

Here's our flow for setting up a new role with all of these concepts. Don't worry about remembering all of this because there's a lot. First, say we want to create an admin role that people can access in an account. We have to first create the permission set for it. Then we have to create an IAM policy for admin and then attach that to the permission set.

Then we have to assign the permission set to an AWS account to make sure the role gets created. Then we have to create an Active Directory (AD) group for that permission set in that account to make sure people getting added to that AD group get added to the role in the account. Then we have to create an access profile and identity for that group so people can get assigned to that group. Then we have to take the access profile and all of the other information and put it into ServiceNow for approvals. Then finally an end-user can go and request it in ServiceNow. If that seems like a lot, you are not wrong. 

Sometimes it feels a bit crazy trying to connect all these things together in a sane fashion. And as such, our initial approach was, “OK there are a lot of things together — what if we made a human do all this?” No. We ultimately decided early on that was not going to work at the scale that we have at Red Ventures. Manual work was not going to cut it, so we needed some automation. 

We turned to our preferred infrastructure as code provider, Terraform. This seemed like a problem that Terraform could solve for us. We thought, “How would we design this using Terraform?”

Pros and Cons of Stock Terraform Providers

Our first approach looked at a lot of the stock Terraform providers that you can download from HashiCorp without doing any custom code. Those have a lot of benefits, but they also have a number of drawbacks. Some of the benefits are  a lot of these work right out of the box. You don't have to do anything other than some configuration, some credentials, and you're good to go.

They integrate directly with the various resources. The AWS Terraform provider talks directly to the AWS accounts and provisions resources directly. There's also an Active Directory provider that talks directly to Active Directory and a number of other providers. 

All of the providers also have validation. So you can say these are the values you need for these inputs and ensure that — with a reasonable guarantee — those values will be accepted by the various integrations like AWS, ServiceNow, etc.. 

However, the drawbacks were that the AWS provider — and a number of the other providers — are a little bit too granular in how they're constructed. To be fair, this is also on AWS' end in that you can't create resources for more than one account — or region — at a time easily.

Also, it requires lots of credentials. The single sign-on service is integrated directly into the organization management account in AWS. This means that credentials require a lot of extra things and require Terraform to have direct access to do all of those things. This means we would need to set up additional credential management — and that was going to be a lot of pain. 

There also are not necessarily enough stock providers to do what we needed to do. There might be one for IdentityNow. I didn't check. There is a community one for ServiceNow. But our ServiceNow set up is a bit custom, so it may not have worked right out of the box. 

We also took this approach on a previous project that did some private networking connectivity between accounts using the stock AWS provider. It very quickly became untenable. Plans would take 30 minutes to run. If they planned successfully at all, it very frequently ran into rate limits. Ultimately we didn't want a repeat of that because that quickly became a pain for most people that worked with it. 

Have We Already Invented This Wheel?

Given that the stock providers aren't going to work, we looked at other products at Red Ventures that used Terraform that solved similar problems or had to address similar things. We came upon a project that automates a lot of our transit gateway connectivity between AWS accounts. 

At a high level, the way that flow works is you have a GitHub repo. You make a pull request to it, that pull request uses the stock AWS Terraform Provider to upload a JSON specification to an S3 Bucket, a pseudo-API in that regard. Then an AWS Lambda responds to that upload and then does things behind the scenes.

We thought this approach might work well because it is a lot faster. It has a lot of benefits that counteract some of the drawbacks of the previous implementations. Some benefits and drawbacks of that approach, though. It was a lot more self-service in that it didn't require a ton of Terraform knowledge and a lot of interconnecting services. It could connect things across accounts easily. There was no direct talking of Terraform to multiple AWS accounts — it only talked to the one S3 bucket. 

But the drawbacks were that validation was tricky. Since it was a JSON file, it became a lot harder to validate certain inputs. Also, approvals were tricky because everything was in one repo, and all of that had to be approved by a single team, which led to bottlenecks when people needed changes. 

Also, the feedback loop wasn't great because all of this work was happening behind the scenes. Very infrequently would developers know if their changes had been applied successfully internally because all of it happened after the S3 object got uploaded.

Let’s Write Our Own Terraform Provider

After looking at both of these solutions, we decided, well, what if we could get the best of both worlds? That led us to why don't we write our own Terraform Provider? I'd done that before. I had done work with the AWS Terraform Provider, and we are a Go shop normally, so Terraform providers seemed like a natural thing to write. 

Our final design — like Mike mentioned before — is we have a GitHub pull request that uses Terraform Cloud with our custom provider to talk to our custom API. This orchestrates a step-function workflow behind the scenes to talk to all of the different third parties that we needed to talk to. 

I mentioned before that the stock providers are too granular and only operate on specific accounts. Let's talk about that for a minute. Red Ventures has more than 400 AWS accounts, and that number continues to grow, which is insane. 

We were trying to figure out the issue of assigning permission sets to specific accounts because that has to be set somewhere. We didn't want to have to maintain a hardcoded list of 400 account IDs because that's a ton — and would require manual adding and removal when accounts get destroyed or created. We came up with a solution of dynamic account filters. This allowed us to set ranges of selection for multiple accounts — and it made it a lot easier. 

To give a little demo of what that looks like, say we have the following accounts and then with this code snippet, with this filter, each filter has a field and then a number of values. Those use boolean logic behind the scenes to determine which accounts get included or not. You can filter by account ID, which made a lot of sense — that's the basics. 

You can do multiple account IDs. That works well. You can also do account name, which supports pattern matching. You can do things like get all production accounts based on the account name. That works well. You can do it based on the account tags — we use those to note things like the business vertical that they're under. You can also use the organizational unit IDs. They’re useful for, one, applying things across the whole organization, but two, doing things more selectively. 

If we have an org unit for a specific compliance framework like HIPAA, then we can select all of the accounts within that particular unit. There are also exclude filters. You can then get more fancy with your include and exclude logic to include a range of accounts except for a specific one. 

We figured that part out, and then we started looking at how do we make this self-service. Because the problem with the other project that we mentioned before is all requests had to go through a central team, which led to a lot of bottlenecks.

We figured out how to set up delegated management of resources in Terraform. This uses, at its core, the GitHub feature called CODEOWNERS, which allows you to require approvals for certain files to go to a certain team. 

We were able to use Terraform provider configurations, and — in the provider config — set things like: Here's the account filters that are predefined in this provider configuration. Here's a prefix, so that names in one configuration don't conflict with names in another. 

And then, we were able to parse that provider config to a sub-folder, which we then set up in CODEOWNERS to go to a separate team for approval. That way, teams can somewhat choose their own destiny with their specific range of accounts. They aren't blocked by us having to approve all their stuff. That's worked well so far. 

Lastly, I'm going to turn it over to Yates to talk about how we solved some of the security and compliance issues with this approach as well.

Yates Spearman:

Thank you, Austin. As he said, my name is Yates Spearman. I am a platform engineer who's been working on this project with Austin and Mike. Apologies in advance if I doze off during my own talk. I've got a newborn at home, and sleep deprivation is a very real thing in my life right now. 

Security Challenges

I'm going to talk through some of the security and compliance hurdles we faced when building out this solution for Red Ventures. Because what we were building was a one-size-fits-all solution that had the ability to grant access to the keys to the kingdom, there are a couple big questions we had to wrestle with early on. The primary one being, how can we even delegate access at the scale that we're at? 

With us being a team of three people working across several hundred AWS accounts, there's no easy way for us to know who should have access to a particular account. Or even for us to know who should know who should have access to a particular account.

In trying to answer this question, among others, we started to meet with the different teams that would be customers of this product — different platform engineering teams, business verticals, individuals responsible for several AWS accounts — and started to talk through what this solution needed to do. 

One benefit that came early on from this solution is we realized because Red Ventures maintains a concept of account ownership — essentially the individual who is responsible for an AWS account from a security perspective — we could lean on an existing table we had with these mappings to be a part of our approval process for the of roles that we defined. In meeting with the teams, we made sure that the data we had was fresh since it would be essentially what gate-kept access.

We were also able to define a standard set of roles that would solve 95% of our use cases at Red Ventures and an approval flow for those roles. We created a few roles, ranging from read-only and support access all the way up to administrator access — and an approval flow for those standard roles that started with an individual's manager. This made sure they're not clicking buttons trying to get access to random things, which happens more often than we'd like to admit. 

Then it went to that account owner because they were an individual who should have the context to know who should have what level of access in an account. And for elevated roles — such as the administrator role — it also went to the tech VP aligned to that account, adding another barrier of access to those elevated roles and ensuring we maintain a minimally permissive model wherever possible.

In these conversations we also realized that while we could solve most people's use cases with these standard roles, we needed our solution to be extensible for different teams that these standard roles didn't work for. Things like our data science team that do things across the org, our security ops teams, and different compliance-related teams that these standard roles just weren't appropriate for. 

Like Austin touched on, we were able to delegate these people their own folder in our automation code base and allow them to define their own custom roles to be applied in their accounts and a new custom approval workflow using any combination of individual AD users, AD groups, an individuals' manager, account owners, and tech VPs. This gives them the flexibility and granularity to define whatever they need for their environment. 

Through the use of CODEOWNERS, we could also make sure we maintain separation of concerns so they could do what they needed for their environment. We could work on maintaining those standard roles to solve most people's use cases.

Compliance Challenges

There were also a few compliance challenges we ran into when building out this solution. With Red Ventures as large as it is, we operate across several compliance frameworks already, such as PCI, HIPAA, LGPD in Brazil. And the company is very growth-oriented,. so anytime we come into the office, there's a very distinct possibility that we'll be told, “Hey, we work in this highly regulated industry now.” We wanted to be as forward-facing as possible in designing the solution, so we didn't have to make any large adjustments later on. 

We started meeting with the developers on these teams and their compliance teams as well, gathering requirements and making sure that in our initial phase of building out the solution, that we met all the needs they had. We built out different logging solutions, maintained separation of concerns, and designed this with auditability in mind. We built in real-time audits for user and account access and made it available via Slack. 

As a group of individuals who have been through audits before, we know there is a lot of data to gather, and the more easily and readily available that data is, the happier your compliance team is.

These teams also had a lot of existing workflows for getting access to their regulated environments. We needed to make sure our solution integrated with those out of the box so that it would be a drop-in replacement for them — and when it came time to roll this out across the organization, there was as little resistance as possible. 

Lessons Learned 

There were a few things we learned along the way from taking this approach. One of them being because we engaged with our customers really early and often, we were able to maintain a very consistent vision for what this product would look like throughout the whole implementation phase. 

Obviously, a few things changed here and there, but from the beginning, we knew what this product needed to look like and what it needed to do. It made it easy to estimate how much time it would take us to build it out.

We learned that by solving our edge cases first and working with those compliance teams, when it came time to solve the 95% of use cases, a lot of the functionality we had was already there. By solving those edge cases first, we were able to build out a more wellrounded solution. 

We learned that while people say “don't reinvent the wheel,” sometimes there's a good reason to do so. In this case, we thought we had a good reason, so that's why we built our own Terraform provider for this. 

If you'd like to contact any of us after the conference to chat about this, we love to talk about what we do from an engineering perspective. Enjoy the rest of your HashiConf. Thank you.

More resources like this one