Case Study

Automating PKI: Paddy Power Betfair's Journey with Vault

Paddy Power Betfair has a hybrid cloud environment with 21,000 VMs on-prem, a growing public cloud estate, and mostly Java microservices that require TLS cert provisioning. See how HashiCorp Vault became the core for their PKI solution.

Paddy Power Betfair operates in a hybrid cloud environment. They have around 21,000 VMs in their on-premises environment, and their public cloud environment is growing. Each of their mostly Java microservices require TLS certificate provisioning. As they transitioned to this hybrid cloud state, their challenge was to find a way to issue a massive number of TLS certificates and apply them to such a large estate.

What You'll Learn

In this talk, Cristian Iaroi from Paddy Power Betfair will show how their journey to PKI progressed and share a broad process that you can apply to your own PKI journey. This model will help you determine how much you're paying for a certificate and how you can move PKI from an infeasible manual process, to a scalable automated one with tools like HashiCorp Vault. You'll also see how much money they saved and how much money you might save.

Further Reading

For a complete guide to achieving your own modern PKI transformation, read our solutions engineering guide: Modern-Day PKI Management with HashiCorp Vault

Transcript

Hi everyone. Hope you're enjoying the conference so far. I know I am. It's amazing to be live and in person at a conference again. And let me quickly get some info about myself out of the way.

 

My name is Cristian Iaroi. As you know, I'm a Principal DevOps Engineer with Paddy Power Betfair, and my team and I take care of the CI/CD tooling that deploys the developer's code — all the way from version control to our on-prem or AWS estate. 

Flutter, the company I work for, is a global sports betting and gaming provider for over 18 million customers, and the parent company for all these brands that you can see on screen — Betfair, Paddy Power, Sky Betting and Gaming. 

With the boring stuff out of the way, I want to talk today about our journey to PKI automation and the costs associated with this migration in particular — but it can relate to any migration.

 

While on Twitter, I noticed this little gem. If you’re considering replacing one tool with a different tool that mostly does the same thing, you need to take care of the productivity gains that you get from switching to that tool.

 

Don't do it unless you expect a 10x productivity improvement. And it makes a lot of sense to me because if you factor in the replacement cost, the learning curve, onboarding new applications to your new tool having everyone onboard, you need to be able to pay for that in productivity improvements.

Some Context on Our Deployments 

We have a big on-prem estate that spans two data centers. It's roughly around 20,000 VMs, 1,000 hypervisors, deploying around 1,200 applications or microservices. And this isn't even taking to account the AWS side of things where we have mostly containerized workloads. 

But all these services require TLS certificates, as you can imagine. They're mostly Java-based applications, so they require keystores, truststores and all of the good stuff. It's around 4,000 certificates that we need to manage, and for this, our old solution was bad, I would say.

 

It employed a manual process where the user had to log into this particular tool, download his certificate — issue it, download it and apply it where it was needed manually. Like version control systems, then changing keystores, truststores, all of that. And as you can imagine, it's a pretty error-prone way of doing things. There are a lot of risks when it comes to manually handling certificates. There are also security concerns around storing the certificates in version control —, so just don't do it. 

A Time-Intensive Process 

But probably the biggest cost that we face here is the time cost. It took a DevOps engineer roughly a day to take that certificate, make the necessary changes to KeyStore, TrustStore's various repositories where it was needed. Then you need to talk to our networks team that need to apply the given leaf certificate onto load balancers because only they have access there.

 

You  need to raise a ticket with them — all of that. Then you have to wait to test this particular certificate in multiple environments, validate that it's the right thing to do and if it was correctly done.

 

This requires other teams as well to cross-validate the changing of the certificate. Roughly it takes about — I don't know — a week for a certificate to be fully deployed onto every single VM or container where it was needed. That's five man-days for each certificate. 

We ended up with DevOps teams being stuck in this constant changing of certificates for two months, which is bad. But luckily, it's only done once a year — this was the main feedback we got from the teams doing it when we started this process.  Everyone hated the certificate management part.

 

Well, everyone hates it, not you and me, because we're probably nerds — but that's beside the point. But it's because it's complex, it's risky, it takes a lot of time. Don't tell security about the TTLs, and it's over a year. 

Desired PKI Flow

So, say hello to my little friend, Vault. I know it's a shocker at HashiConf. We made use of Vault's PKI backend. How does the whole process look like for the tier-two certificates?

 

These are the non-customer-facing certificates that you use internally. You have one route and one ICA for each availability zone. You have one set for development, one set for production, and each ICA has roles defined for every single application that we have. 

These roles are defined in version control. Security approves the roles. They contain the certificate details like CNs, SANs, TTLs — every bit of information that pertains to a certificate. All the user has to do is trigger a pipeline. This pipeline connects to our various APIs that we use. Like load balancer APIs, OpenStack APIs for on-prem and, among other things, Vault as well which delivers a leaf certificate based on that role that security have accepted as a merge request.

 

This particular certificate is automatically generated — each time you redeploy your pipeline it gets auto-renewed. and it's app-specific —  there are no wildcards involved. We used to have that.

 

It then gets uploaded to our load balancers via this framework, where it's bound to the service that you're using — and auto-injected into your VM where you can use it directly as is — as a pen file.

 

We've created this automation to set up your truststore and your keystore, injecting third-party certificates as well into these various sources of trust. And we support multiple certificates for various business flows. There are Java applications that use multiple keyStores depending on said business flow.

 

This is the automation we've built around the keystore management. This is an Ansible module and a Chef provider — Chef custom resource for keystore management. Please ignore the bugs. There's the Truststore version for these as well, so we do the truststore management as well. 

But I hear what you're saying: Where are the tier one certificates? Where are the customer-facing certificates? The ones that are supposed to be publicly trusted? We needed to cater for that aspect as well because the old solution did that as well.

 

For that, we've built a Vault plugin that connects to our CA provider, which is Hydrant ID —on't know if you've used it. This issues, manages, or revokes the public certificates the way you would expect. 

This is just the interface — as you can clearly see — we got it open sourced. So, pull requests are welcomed if you're using the same CA provider. Hopefully, soon, it'll be made available on the HashiCorp plugin portal. If you see the HashiCorp people, bother them to approve the merge request, it's already there — up on their site.

Summarizing the Process

To recap, you get a new certificate every time you deploy your service. All you have to do as a developer is raise this merge request in a given repo with the certificate details — so domain CN, SANs, all of that.

You deploy a pipeline, click a button, you automatically get a certificate or multiple certificates from Vault. This ends up on your VMs, bound to your load balancer config. You get the truststore and the keystore as well. How long does it take? I don't know, maybe an hour, and this is just the first time you do it because afterwards, it's just a button click. The first time you configure the certificate details, that's when the work is being done. 

This is also done for AWS. We have a cron job that runs every week, regenerating your certificates, adding them to AWS certificate manager where we use CDK pipelines to automatically bind your certificate that is already present in cert manager to your load balancer, or inject it into your VM — or whatever deployment you have going on there. But it's obviously far better than a year without changing certificates.

Resolving the Time Cost Issue 

Going back to the original time cost, presentation, we went from having multiple teams that involved around a week to change an actual certificate to an hour the first time you do it. And all the developer has to do is kick off a pipeline, and that's it, they'll be done with it.

So, going back to the initial tweet where I said — you need to have a 10x productivity improvement to justify changing a tool with the same function with another tool. — I think we managed to do that. Also, in the process, eliminating various security risks regarding big TTLs, storing the certificates and version control — having huge unmanaged TrustStores that probably have certificates that don't belong in there. So it's been really good.

More resources like this one