Case Study

Cloud acceleration using Terraform at Zurich Insurance

See how Zurich Insurance automated its cloud landing zone setup with Terraform Cloud, saving at least 3,000 days of work throughout the program.

See how Zurich Insurance automated its cloud landing zone setup with Terraform Cloud, saving at least 3,000 days of work throughout the program.

»Transcript

Tony Hughs:

Hello again everyone. I'm Tony Hughes, a resident architect with HashiCorp, and I've been partnering with Zurich for the last nine months — and have had the pleasure of introducing Eamonn Carey.

Eamonn Carey:

Thanks, Tony. I'm Eamonn Carey. I'm head of Cloud Engineering at Zurich Insurance. We've been working with Tony, or Tony's been working with us, helping us with our Terraform Cloud adoption — and helping us use it to accelerate everything we're doing. I'll run through the first 20 minutes, and I think Tony will go through the more technical bit at the end.

You're going to hear a few keywords throughout this. I know it's a deck about acceleration, but there you'll hear me talk about collaboration, partnership, and empowerment, and they'll run all the way through the deck. That's quite on purpose, even though it's a deck about acceleration. 

»The opportunity

I didn't want to start with a problem statement because I want to see it as an opportunity rather than a problem statement. First, we have to solve how we automate everything. We have a host of legacy systems we need to migrate into the cloud. So, we have a very ambitious program at the moment to migrate a bunch of applications — hundreds of applications — into the public cloud. 

»Automation

We have to automate everything that we bring in. A lot of those are legacy systems. There are skill gaps in the teams managing those applications at the moment. They're not necessarily familiar with infrastructure as code or Ansible or Terraform, or any of those tools. We have to address that as we go along as well.

»Security 

Then there's the fear of the unknown by these teams as well, and that's both normal and perfectly understandable. We need to be concerned about security as we automate everything — how we make it secure and compliant. We are an insurance company at the end of the day. Everything we do needs to be secure, needs to be compliant, and we need to protect customer data. There is a cost factor as well. It can be expensive to automate on the way in, but the benefit far outweighs the cost. 

»Consistency

We want to build a globally consistent landing zone. That's key for us because for us to do this at scale and do it fast and accelerate, everything has to be consistent. We have to build uniformity into everything we do. 

»Scalability 

Then there is that scale problem. We are trying to do it at scale. We are going to create thousands of AWS accounts and Azure subscriptions. We can only do this at scale by collaboration with our internal developer teams through inner sourcing.

»Reusability

Then we have the reusability piece. By collaborating, we want to create reusable modules that our developer community can use and leverage throughout. Then everything has to be secure by design. That's one of the key things. We are an insurance company at the end of the day. Security by design is not just for application code. It's for infrastructure as code as well. All this should come together to build transparent compliance — or compliance transparency — for our auditors, regulators, and everybody else.

»Introducing the cloud platform team 

A bit about me and my team: I represent the cloud platform team. We are, as Dave was talking about earlier on, a platform team. We're not just a platform team. We're a product team. We treat the cloud landing zone as a product and want to run it as a product

We are part of what we call the Cloud Center of Enablement. We used to call ourselves the Cloud Center of Excellence. We abandoned that. We don't claim to be excellent anymore. We're here to empower and enable everybody else. We're still the CCE, but we are the Cloud Center of Enablement — and as I say, we are a product team. We want to run as a product team. We might be a bit conceited in thinking we are the glue that holds everything together, but there is a little truth in that.

If you want to build anything on public cloud, you have to have a symbiotic relationship with us. But we're not alone. Even within the CCE, we have a couple of other product teams built around database and storage and DB and analytics — and they work, very closely with us, helping us build the curated services and patterns that I'll talk about later.

Outside that, we have the other Zurich teams — our security, network, and legacy teams. Last but not least, we also have the customer developer teams who are important to us. We do work very closely with them and want them to work even closer with us. 

»Partnership and collaboration

I won't labor the slide; there's a lot on it, but we are very much aware of the task ahead of us to build at this scale. It's a scale that Zurich hasn't encountered before. We need to have strategic partners. We've partnered with AWS, with Microsoft through Azure, but we've partnered with HashiCorp —  which is why we're all here today. 

We're keen on Terraform because it's platform agnostic and because we want a globally consistent landing zone. Having that platform-agnostic tool was really important to us. But we also have our internal partners, customers, and teams, such as the security team and the network team I was talking about.

»How do we accelerate our customer and application teams? 

Well, collaboration and knowledge sharing. I think I've covered in the partnership bit. But coming out of that, we want to create reusable modules and create efficiency through reusable modules. We want to create modules that we will publish to the Terraform Cloud private registry that our customers can commune and pick them up. I'll talk more about the curated services and patterns in a bit. 

We want to create increased innovation and experimentation. So, if we can create reusable modules, it frees up time for increased innovation. If we do everything right and we build everything once — measure twice, cut once, I think is the expression. We want to do it right the first time and free everybody up to innovate. 

We also need to build quality into everything we do because we want these to be widely deployed. We want our customers to come along and adopt these and deploy them widely. So, we have to get the quality of this right. We are an insurance company at the end of the day. Everything needs to be secure, everything needs to be compliant, and we have to be able to prove that quality as well. All this feeds into the transparency and visibility of everything we want to do. We are heavily audited. We're heavily regulated. We really want to do everything right and be seen to do everything right. 

Last but not least, we want to build a community. We want to leverage the Zurich developer community. If we can build a community and we do everything in full view, our customers are less likely to be impatient with us and are more likely to work with us. We want to adopt and foster a GitOps way of doing things.

»Curated services and patterns

I've mentioned curated services and patterns a couple of times, and I want to talk about what we're building, why we're building it, and how it's accelerating us. We're all familiar with how infrastructure as code orchestrates the underlying infrastructure, and that's fine. We've tried to pull some Terraform modules together into what we would call a curated service. The curated service is very much an opinionated version of what a cloud service would look like.

If it was EC2, it's a very opinionated version of what EC2 is. It encapsulates our internal controls, guardrails, and the things that matter to Zurich. We pull that into a curated service which we publish into a version-controlled Terraform Cloud private registry, which our customers can then use.

But we abstract that even further again, and we have curated patterns which pull together one or more curated services. These are all modules, so they can be consumed from the Terraform Cloud private module registry. But a pattern will pull together an EC2 with a load balancer or an EC2 load balancer and a database in the background. Whatever, really — it just pulls them all together and it abstracts the complexity even more for our customers. 

So, when they come and consume the module, the higher up that stack you go, the less they need to know. On a created service for EC2, you'd have to pass in the AMI ID. Now we would bake our own AMIs, but you need to know the AMI ID to pass into that module. We've created  patterns that call that module. You only need to tell it you want Linux or Windows and we're taking care of everything else in the background. 

This all drives to reduce time and cost for our customers. It also wants to reduce barriers to entry for our customers and reduce the time to innovation — how to deliver features to our ultimate customers in the market. But it also raises team empowerment. It also raises compliance at the birth of the resources themselves. If we build it right and get the quality right, everything, the quality goes up and will ultimately raise customer sentiment as well.

»A familiar story 

I'll start with a very familiar story This is how we were doing things maybe about a year ago before we started using Terraform Cloud. This starts with an account request form. A typical way to start. You want to get an AWS account. We have a Jira request form where you would fill in the business unit that wants the account, the cost center, or the business unit ID. A whole host of things. Very basic form, but it's a request form. 

It then goes for a manual review. There's very little automation in that form. It has to go for manual review to ensure the data is valid and a correct business unit — that the business unit does own that cost center. There's a two-day SLA on that. Once that's done, the pipeline task kicks off and creates the AD groups that will manage access to this AWS account. Then another pipeline task will kick off, and it will create the AWS account itself.

At that point, control is passed back to the customer, and they're free to request access to the AD groups through an internal tool. All the while, we're waiting for those groups to sync to other tools like Octa in our toolchain. Once they've requested access to the AD group, it goes for manager approval, which is another required step. To be honest, I can't ever see us replacing that, and it's very much up to the manager as to how long that can take. Could take minutes. It could take days. Once that's done then, the user gets added to the AD group. But we also pass control back to the customer to fill in another form to request a VPC in the account they just requested, which goes for another manual review, and another two-day SLA — and so on. After that, a pipeline task kicks off when they get their VPC. 

This, in the best possible scenario, takes about 36 hours. Probably in a normal scenario, it takes about three days, best case. Assuming everything kicks off immediately and everybody does what they should, it takes 36 hours. This just gets us one environment. If you want a dev environment, a test environment, a prod environment, you have to repeat this multiple times. So, definitely not efficient.

»How did we accelerate that with Terraform? 

»Account request form

First, we adopted Account Factory for Terraform, which is a collaboration between HashiCorp and AWS, and it's ultimately a replacement for Terraform Landing Zones. Account factory for Terraform gives us the ability to parallelize everything we do. So, when we request an account, you can request your dev, test, and prod accounts in one request, and you can baseline them in one request.

We changed the request form to be somewhat data-driven. It now calls to a source  of data sources and will get valid data, and customers can go in and pick from pick lists of their business unit ID or cost centers and everything else. So, it eliminates a lot of the manual reviews that had to happen at that point. 

»Provision AWS accounts

It will then kick off a pipeline task that will provision the AWS accounts. So, not just one account now. It'll provision 2, 3, 4, however many  accounts the customer requested. It will then kick off another task to baseline those accounts. This is where we would add our guardrails and our controls, and everything else into those accounts. When we're doing it in the old method using Control Tower, it was subject to a lot of failures and everything else, and it would take about six hours to run. This is way more efficient.

»Map AD roles 

We would then map the AD roles the same as before. But at this point, we start using Terraform Cloud and provision Azure DevOps resources. We would provision —  and this is something we never did before — but we now build our customer teams an Azure DevOps project with a repo inside so that once we're finished, they're ready to start building their code. And we will put in infrastructure as code into that repo for them.

»Provision Terraform Cloud teams and workspaces 

At this point, then we'd also provision the Terraform Cloud teams. We would then provision the Terraform Cloud workspaces, the VPC, and identity access management into that account and VPC. So, we've gone from what was three days to get one account to 45 minutes to get as many accounts as you need to get at that point. That's a huge improvement. 

So, over the program, that will save us about 3,000 days of elapsed time because, by the time we do this for thousands of accounts, that's what it will save — and that's just based on a three-day best-case scenario.

»Compliance and governance 

How are we using it to accelerate our compliance and governance? As the platform product team, we aspire to be a case study in what is right and correct. We want to be seen to be doing the right thing. To be successful, we have to be the center of excellence. There's that excellence word again.

How are we doing this? We use Checkov, an open source tool for scanning your code to make sure it's compliant with a certain standard. We use the CIS benchmark. So, we would have Checkov in our IDs that would scan the code as we're writing it to make sure the infrastructure it's going to build is compliant.

Then we have a run task as well that integrates with Bridgecrew from Palo Alto, which also uses Checkov in the background. But when a customer does the Terraform apply, a run task will kick off in the background, and we'll scan that code again using Bridgecrew and Checkov to make sure that code is going to be compliant. 

Why do we do this twice? Well, this is putting the verify into trust but verify. So, the Checkov in the ID depends on developer behavior and them actually using it and reacting to what it's telling them. But you cannot apply code into production without going through that run task that will actually scan that code against that CIS benchmark to make sure it is correct.

We’re also starting to use Sentinel policies. We're only starting on the journey — that’s one of the things Tony's helping us with. We are using Terraform Cloud as we are keen on Sentinel policies. We want to leverage policy as code, to build everything we want to build. That's key to us as well.

We're also using dynamic provider credentials — or just in time credentials  —  that we were talking about this morning. At this point, I will pass over to Tony, who's going to get into the weeds a little bit on the technical side of things. Tony, I will pass to you.

Tony Hughs:

Thank you, Eamonn. Hello again. Everyone awake? Excellent. Well, I want you to fall asleep because you may have a nightmare of a developer committing credentials to source code or something like that. But fret not, dynamic provider credentials may be able to help with that.

»Dynamic provider credentials

In the context of Zurich and AWS —I know we have a talk on this later, and you may have heard a lot about dynamic credentials —  so I'll keep this brief. We had hardcoded long-lived credentials, IAM roles — IAM key pairs in this case — sitting in the workspaces, may be passed off by security teams and may be deployed by the developers themselves. As you can see, that's high risk. You have high rotations. You have a lot of oversight you have to do there as a security team or a cloud member.

With dynamic credentials, we're able to deploy in ABS an OIDC provider, an IAM role, the policy. The trust between them can specify the particular project that you want to scope to that and allow you to assume that as well as the workspace itself — and then even down to the plan and apply phase of that workspace run.

I won't belabor this too much, but it's helped a lot in that risk. We don't have any more rotation needs. We don't have to worry about developers asking for these credentials as they get their accounts. Once they're vended this account through that pipeline, they have automatic access to both Terraform Cloud, as well as AWS, and their Cloud account.

»Day 2 management. 

I know paradoxically, in a slide about acceleration, going to slow down, think about day two management. When I onboarded with Zurich, I found they had a platform team established, empowered, and able to make the decisions needed, which is generally an uphill battle, but I was ecstatic from that. 

However, other teams outside of that weren't considering how to scale day two management of Terraform Cloud and scale to multi-cloud when we use this product. I proposed — and you'll see here, a piece of the pipeline that Eamonn was showing earlier — a GitOps flow that used VCS-driven Terraform Cloud workspaces to bootstrap Terraform Cloud platform itself. Maybe familiar to some of you guys., 

The flow that I would recommend there — their platform team was familiar with the AWS account factory for Terraform utilizing Jinja templates. So, it was an easy walk to go over and deploy those same flows into Terraform Cloud, have them deploy out Terraform HCL itself into these pipelines, have them come in and empower the developers to manage these workspaces in Day 2 respects — going in and updating and configuring as they go through source control. Shifting that security left to the VCS provider, alleviating the burden on the platform team, which is my goal, to make it scale and have it not be an uphill battle for them.

»What it take to build out these accounts 

You build out your Azure DevOps resources, the repos, the projects, the teams for Terraform Cloud, and the workspaces for them to use. Not pictured, the AD groups and such, as well as the VPC and IAM. And in this case, we added in dynamic credentials, so. they can instantly auth and have that access that I mentioned before. Each of these BUs we're talking about has up to thousands of applications at this point. Each one would need all of these for each account — multiple dev, tests, and prod like you'd expect. 

Let's dig into what that would look like in the GitOps flow in the pipeline here. We start with that pipeline task that he mentioned, whether it be a Jira ticket, ServiceNow, or various inputs there. We pipe into your Python script, kicking off and building out, populating the Jinja templates, which would render these HCL files I'm mentioning here. The beauty of this is that these are being spat out into configuration repos for each landing zone. As a developer, you can go and update the Terraform version.

Moving on there, you'll see config the repo here. Once these files are generated and placed in these repos, you're triggering off these VCS-driven Terraform Cloud workspaces, which allows you to have drift detection and all these other features that you'd like to see from Terraform Cloud for your actual Terraform Cloud configuration. 

Moving on to the pipeline task — wait for it to complete — and then move on. Let's dig into what the file structure would look like here. The file structure, coming from your pipeline, can spit out these Jinja templates —  mimicking what you'd see in the account creation and AWS Account Factory for Terraform, which the platform engineers liked. 

We'd spit out and generate these Terraform HCL files, which then go into your hub repos for each landing zone separated by environment types. You have security and isolation, and RBAC taken care of at the VCS level. Great on the security team. They really celebrate on that. 

Let's dig into my favorite, a little bit of Terraform code itself —  a little HCL — to show you what will inevitably change here. Ultimately I would say Terraform version is the most important. Now being 1.5. You may want to upgrade your Terraform version to take advantage of the import feature. 

One other piece here I like that shines is I'm an app developer trying to understand, well, hey, where do I have to go? What VCS repo do I need to configure this workspace? Make these changes programmatically through a pull request and satisfy the platform team, have it filtered to them, and not be blocked until that pull request has merged the source URL.

You can press in the top right of your Terraform Enterprise workspace, go to that VCS repo, and have your arguments there to configure and change. You may have a developer that needs to change the branch or VCS repo to source that. Then  as we mentioned, Terraform dynamic provider credentials here are as simple as adding three variables in your Terraform Cloud workspace for that particular role you’re gonna assume.

This is where it really shines. When you need to make these updates to the scale of upwards of 10,000 workspaces, it becomes a great scalable solution. I'm excited that we were able to build this out, and Eamonn's team was able to take this on. I'm excited that they enjoyed it. And on that, let me bring back Eamonn Carey to close this out.

Eamonn Carey:

Thanks, Tony. That's gone really down well with our developer community — really well.

»Conclusions

I'm going to close on this slide, and I'm not going to labor it too much, but it's pulling together a lot of what we've already discussed. It pulls together to say how we're leveraging Terraform Cloud to accelerate our cloud adoption. How it's bringing consistency to everything we do — through how we use versioning, how we use state management, how we use policy enforcement, and how we're doing config as code, and many other things. 

But it's also bringing together the abstraction we're trying to do by making it easier for our developer community to start building applications — and how we're abstracting that through reusable modules and our curated services and patterns.

At the moment, they can get an account within 45 minutes. If they’re using the created services and patterns, they can have the code in their project repo and they can start building modules or building services through the created services — and have something up and running within two hours. From nothing to something running in two hours, which is pretty good.

But we're also doing the security and compliance piece and leveraging Terraform Cloud to enhance our overall security posture through policy as code, run tasks, everything else. The only way we can actually accelerate and get this to work — tools is one thing, but we have to leverage the developer community within Zurich.

 So through partnership and collaboration, it's the foundation of everything we do here. Partnering with our ISPs, partnering with HashiCorp, it's the way it works. I believe the word "Hashi" means bridge in Japanese. I would like to think Terraform Cloud is acting as a bridge between all this and helping pull all this together. 

That's it, I think. Thank you very much. If you see either of us around, feel free to ask questions. We'll be around for the afternoon. Thank you very much.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/5/2023
  • Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

zero-trust
  • 12/13/2022
  • White Paper

A Field Guide to Zero Trust Security in the Public Sector