Hear how YNAP used AWS Lambda functions to reduce the disaster recovery time for HashiCorp Vault to mere seconds.
The Yoox Net-A-Porter group solved their challenges around hosting HashiCorp Vault cluster operations on AWS Lambda and share their solutions in this talk.
Learn about the processes Yoox Net-A-Porter group created for demoting/promoting Vault disaster recovery clusters, generating DR operations batch tokens, switching the weights of a DNS weighted-policy between DR clusters' endpoints and more. With these processes being run as AWS Lambda functions, they managed to have a Mean Time To Recovery (MTTR) of just a few seconds.
Disaster recovery is always a debated task and a hot topic in companies — and there are different approaches that one could take regarding this topic. In particular, one could say, well, we'll never have a disaster, so why should we care about that? Or we could even plan a bigger disaster to let our own disaster become irrelevant.
But talking more seriously, when we speak about Vault, it is a central piece in our IT infrastructure. So, it's nice to have a disaster recovery plan in case of a disaster.
We'll quickly go through our basic architecture. Here we are leveraging the HashiCorp Vault Enterprise license, so we have access to Vault’s disaster recovery. We start here with a Vault cluster, which is five EC2 instances behind an autoscaling group — and everything is behind a network load balancer.
It makes sense to create an endpoint — a Route 53 endpoint — for this network load balancer. In case it will be destroyed, we can have this endpoint, which we call, for example, vault-primary.company.abc. This constitutes our primary cluster. And we'll then have also a higher-level endpoint —, another Route 53 CNAME or alias, which we call vault.company.abc — which will be used by users to interact with our primary Vault cluster.
Leveraging the enterprise license, we have another cluster — a complete replication of the primary. Its endpoint will be defined as, for example, vault-dr.company.abc. And the data will be clearly in sync between those two clusters. If the primary goes down, alerts will come in, and everything will be a disaster indeed.
We need to see now how we could recover in case of a disaster. The procedure in case of a disaster for a Vault cluster is to, first of all, promote the secondary to a new primary. This process involves the use of recovery keys — or to use a DR operation token.
Clearly, this token needs to be created before. And then after that, considering the architecture that we saw before, we also need to change the Route 53 records to point to this new primary. Now, this is slow and error-prone. Also, when talking about changing Route 53 records, we can incur some caching problems.
First, we are going to go through the workflow that we designed at YNET to overcome these issues using AWS Lambda functions. We'll go into more details regarding the design of these functions. And finally, we'll inspect the repos that we use to construct all of this infrastructure.
Hello, I'm Kevin. I'm a DevOps engineer at YOOX Net-a-Porter Group. And this is how you can find some of my online presence. If you want to contact me, please be my guest and drop a message.
We'll start with the high-level introduction and architecture. Then we'll go through the design of the AWS Lambda functions. Then afterward, we'll go infor a more in-depth analysis of the solution — and then we'll draw some conclusions.
This is the picture that we saw before of the architecture. To overcome the DNS issues — the caching problems — we created a weighted policy. In Route 53, you have the possibility to create DNS routing policies.
For our high-level domains — this vault.company.abc — we pointed it to both the clusters, and we set the weight for the primary cluster to be the maximum weight, so 255. And the weight for the secondary cluster to be the minimum one, namely zero. In this way, all the traffic will be redirected to the primary and not the secondary.
Now, let's see, in case of a disaster, what is our workflow? Disaster happens, an IT human is triggered. We'll start with the first Lambda in our picture, which we call the DR Operations Lambda. The IT human will invoke this Lambda, and this one will retrieve the DR Operation token. This is the token that is needed to interact with the secondary cluster — to promote it, and to eventually demote the primary cluster.
This DR Operation token will be put in secrets manager by another Lambda, which we call the DR Operation token gen Lambda. And finally, once the DR Operations Lambda has taken this DR Operation token, it will be able to promote the secondary cluster to a new primary.
After that, we will invoke another Lambda, which we'll call the DNS switch. It will swap the weight of the Route 53 DNS record. It will put to zero the old primary and put the maximum value for the new primary cluster.
I'd like to demonstrate these concepts in a quick demo, and we'll see how these two scripts will be used and will make all these changes in the primary cluster and the DNS. We can see here that we have our primary cluster, vault-primary.hashiconf.demo. And there is the login screen.
Then we have our secondary cluster, which is in replication mode, vault-dr.hashiconf.demo — and that we have our higher-level endpoint, which will point to the primary cluster. Now we can go to our batch terminals. We have two terminals here, the first for the promotion demotion and the second to the DNS switch.
In the first terminal, we can launch the promotion and demotion script, and we can start the timer here. It will check the primary endpoint and the secondary. It will demote the primary, promote the secondary — and it will link them together.
Now we can go back to the Chrome tabs. You can see here that the primary is the secondary, and the old secondary is now a primary. We can go to the DNS switch part and deploy it. And this will change the DNS weight. It will allow the higher-level endpoint to point to the right one.
Here we have some caching from the browser. We clear Chrome from the DNS caching. We refresh, and here we have the new primary. Our endpoint points to the new primary, which was the old secondary. And all of that took less than a minute.
Let's dive into the design of these AWS Lambda functions. Firstl, let's look at the general architecture that we have. We have a central Lambda, which we'll call Auth Lambda. This Lambda will be in charge of authenticating to HashiCorp Vault and return a token to the other Lambdas.
The step zero for the other Lambdas is to invoke this Auth function. This will authenticate to Vault using the roles passed as payload. It will get the token, it will return the token to the Lambdas, and then all these Lambdas will be able to interact with Vault.
In our story here for the disaster recovery plan, we have four Lambdas. The AWS Auth Lambda, the DR Operation token gen Lambda, the DR Operations, and the DNS switch. Let's inspect all of these one by one.
Let's start with the first. This Lambda, first of all, takes as an input the AWS Auth method role from calling Lambda. Then it will sign the payload using the AWS STS service. It will make a request to Vault to authenticate, and then it will return the token‚ so the calling Lambda will be able to interact with HashiCorp Vault.
Now, let's see a graph. We have our Auth Lambda, which will be invoked by another function, passing the role as a payload. This Lambda will sign the payload using the AWS STS service. It will make a post request to this /auth/aws/login endpoint of our Vault cluster — and then Vault will return a token. After that, the Auth Lambda will be able to return the payload to the calling Lambda.
This Lambda is in charge of generating the DR Operation token and uploading it to secrets manager. First of all, it is scheduled every eight hours using CloudWatch events — and as the DR Operation token itself — it will have eight hours of TTL.
Then it will invoke the AWS Auth Lambda — the previous Lambda — to get a Vault token since it needs to interact with Vault. After that, it will make the request against Vault to get this DR Operation token. Finally, it will upload this token to secrets manager. Once the token is there, it can be used by other Lambdas or by anyone else. It will be available in case of a disaster.
Let's see a picture of it. We have already DR Operation token gen Lambda, which is scheduled every eight hours by using a CloudWatch event. This Lambda calls the Auth Lambda, which will authenticate to Vault, get a token, and return a token to our initial Lambda.
In this way, it can request a DR Operation token to Vault since it is authenticated, and then Vault will return this token. Afterward, our DR Operation token gen Lambda will be able to upload and/or refresh the DR Operation token in secrets manager.
This Lambda is in charge of promotion, demotion, and linking the Vault clusters.First, it will infer which endpoint is the real primary and which is the secondary. As we saw before, we have the second layer endpoint — which is vault-primary or vault-DR — but the name does not always reflect what is the primary and the secondary. First, it will infer which is the real primary. Then it will retrieve the DR Operation token from secrets manager.
After that, it will try to demote the primary. Clearly, in case of a disaster, it will not be able to do that, but it will go through anyway — and then it will promote the secondary. The important thing is to promote the secondary in case of a disaster. After that, it will try to link the two clusters together. Clearly, in case of a disaster, this will not go through, but it's OK.
Let's see a picture of that. We have our DR Operations Lambda, which will first retrieve the DR token. Then it will check the endpoint to see which one is the real primary. Then it will demote the primary, promote the secondary, and then try to link these two together.
Now, we have our final Lambda,. This too first of all, will check which endpoint is the primary or secondary. Then it will set to zero the weight for the real secondary — and it will set to the maximum value the weight for the real primary.
We can take a look at the picture. We have the DNS switch Lambda function. We'll check, first of all, which cluster is the primary one. And then, we will set the weight to zero for the real secondary. After we swapped the two primary and secondary, we can also lower the weight for the old primary — which is now the secondary. And then, we can upgrade the weight for the new primary cluster to 255, so to the maximum number.
Now we have analyzed the Lambda functions from a higher level, we can go into more details. We can analyze the repos that we developed for our solution. We have two repos. One with the live environments and one with the modules. The first one is the IAC-Vault-DR, and this is the repo containing the live environment.
What is deployed will be in there. We have three environments here, as we can see. The dev, the stage, and the prod clusters. This repo will reference the module repo, which is the other one — the Terraform-Vault-DR. And here, we will define all the code — so, the Terraform code for the Lambda functions for the Vault policies and the source code for the functions.
All of this you can find in my GitHub. If you want to take a look at these repos (iac-vault-dr and terraform-vault-dr), please go on. If you find some bugs some errors, please contact me, and I'll be more than happy to fix everything — and to talk about that.
Here we find three environments, the dev, the prod, and the stage environments. We have four Terraform files, the backend.tf, the main.tf, the locals.tf, and the providers.tf.
If we take a look at the locals.tf, we are defining the local variables. The only one that will be different between each environment is the environment one. In this case, we are inspecting the dev one.
Then we have the hosted zone name, in this case, hashiconf.demo. Then, we have the three endpoints, the higher level and the two lower levels endpoints. And finally, the Vault address, which will be used for the Vault provider.
Then we can look at the main.tf. Here, as you can see, we reference the Terraform-Vault-DR module. The cool thing about that is we can use a specific version. For the dev cluster, we usually at YNET, take from the master branch so we can develop our module in the master branch. Then we can deploy immediately in the dev cluster, so we can see the changes.
If everything is fine, we can target this specific commit with a tag. Then, we can use the tag to deploy, for example, the stage environment. Now, the stage environment is used for more in-depth tests, and then we can propagate the tag also to the prod environment.
In this case, we can see we pass all of these variables to the modules, the environments, all the endpoints, and an identifier — which in this case is just a random ID. Now, the combination of the environment and the identifier — as we will see shortly — will be defined uniquely for each cluster and for each deployment. So, we'll not have any clash with the deployment names of our.
Then we can see the providers.tf. Here we have just two providers, the AWS provider and the Vault provider.
And finally, we have our backend.tf file. Here we are defining where we store the state file — in this case, on an S3 bucket. This will also differ between each environment. This is the dev case — we have a specific dev.tfstate key for that cluster.
In this repo, we can also find the Makefile and the Jenkinsfile.In the Makefile, we define the targets that will be used in the Jenkinsfile to install dependencies, make some tests, and eventually deploy the cluster.
We can take a look at the Makefile here. At the top, we have some variables which define the versions of the binary we are using. Then we have some targets — if you want to take a look, you can go to the repos on my GitHub and you can see the code.
In particular, here, we have some targets for installing the dependencies, Terraform, checkov, and another thing. You can just see that. Finally, we have the test target, `test/checkov`. Checkov is a tool for making tests on the code — the Terraform code — to make it compliant with some best practices. Here we also make this test.
And finally, we have the deployment targets. We have all the terraform init, plan, apply and eventually destroy. And we have the environment variable, which will define where we are deploying. If we want to deploy in dev, stage, or in prod.
The Jenkinsfile here is an example. Here we use a declarative pipeline. We define everything to be run inside the docker container — and the image is just a custom image. It's centos:my-centos. This is also one that I created for the purpose of this demo — and you can also find this on my GitHub.
There are some stages and a post action. The stages in this case, the first two stages are common. Here we are using a GitOps approach. The first two stages will be run in both commits per request merging. The first is to gather the Vault tokens, so you manually put a Vault token to interact with the cluster.
Actually, we at YNET, avoid putting the token manually since we leveraged the AWS Auth method. But for the purpose of this demo, I placed this stage to insert the token manually to deploy resources into the Vault clusters.
Then we have a stage for install the dependencies. Then we have the stage that runs the test. This stage, which is the third one, will be run when we are committing on a branch that is not master.
Then we have the `terraform plan`, and this plan will be run when we are making a pull request. Finally, the `terraform apply` will be run only when we are merging to master.
Finally, we have post actions that will be run always, which is the removal of some Terraform resources. This is needed as we are running the container using root privileges, which is not good practice — but always for the demo, it's OK. And we needed to remove some Terraform resources that are created at root. If we're not to delete these resources, Jenkins will not be able to delete them as a Jenkins run, a Jenkins user, and their resources are created as root. So, we need that stage.
We can see the pipeline here. When we push to a branch, we gather the tokens, we install dependencies, and we run some tests. Right now — only the Checkov test. We can see 44 sources passed. Then we make a pull request.The plan is triggered, and we see here 44 resources to be added. Then on merge to master, only the deployment part is run. In this case, 44 resources added.
Here we can find Makefile, Jenkinsfile and a terraform folder. In this folder, we can find all the definitions for the Lambda functions, the Vault methods, policies. And also the source folder, the SRC folder, which will contain the code for the Lambdas themselves.
As we can see here, we have three Lambdas written in nodejs and one written in Python. That's the good thing about Lambdas — we can write the code in whatever language Lambda supports without any problems. And the Makefile will contain the installation of Golang, which we will use to make some terra tests and the installation of Terraform and nodejs.
Then we have a target to make the tests, which are contained in the test folder. The test folder, as we can see, as an example folder and a main_test.go file which will contain the terra test. The example is a mock of the environment, so it will make use of the Terraform module here. And the Go file will make a `terraform init plan and apply` of the cluster, then destroy it.
We can see here, the Jenkins pipeline will gather the Vault token, install dependencies, and run the test. In the test here, we can see we have two pass tests, which is — one the creation of all the resources — and one for the destruction. Clearly, one should put more tests in the Go file — but always, for the purpose of this demo, this is OK.
If we go back to our Terraform resources, we can inspect our module. Let's see, first of all, the inputs of our return from the module — which is the environment, the Vault endpoint, the endpoint of the primary, the endpoint of the secondary cluster. Then we have the hosted zone name, and then we have the identifier.
Now, this module will produce our four Lambda functions that we saw before, with also all the resources around them. And it will also deploy Vault policies needed and the Auth methods roles on both. Now, every resource will have its own unique name as a resource — name-environment-identifier.
Here, we have three variables, the unique ID, which is the combination of the environment and the identifier. Then we have the name of the operation token and the role name.
Looking at the code for the AWS Auth Lambda. Here, we have our module. We make use of the service.tf Lambda module, which is a cool module to deploy Lambdas. And we have defined a function name like that. And the runtime is node.js.
We make use here of the source path, and we point to the source that we saw before. And as an environment variable, we just have the Vault endpoint. For the other Lambdas, we'll not go through the code as there is no time for that, but you can look at the GitHub.
Here are the resources that are needed. For example, for the DNS switch. And here are the environment variables that it needs to deploy this switching mechanism correctly. Then we have our DR Operation token gen Lambda that will need the secrets manager access to put the new secret value and then invoke the Auth Lambda.
It will also contain the schedule the CloudWatch trigger, and the creation of the secret in the AWS secrets manager. These are all the environment variables that this Lambda needs.
Finally, we have the DR Operations Lambda This Lambda will need access to the secret, so it needs to have these permissions for secrets manager, and it will need these environment variables to make the promotion and demotion operations.
In the Terraform-Vault-DR, we can see here that there is a Vault_policies.tf. And first of all, we have the policies taken by the DR Operation token itself. And it will have access to the `/dr/*` path to make all the promotion, demotion, linking, etc. with the capabilities to do everything. That's our token.
Then we have the token that will be given by the Auth Lambda when it authenticates Vault. And this token will just need to be able to create the DR Operation token. This is the path that it needs — It has the capabilities to have the update and then create one.
Now the authentication method. As we can see in the module, there is Vault_Auth_backends.tf. and it will contain the roles for the backends. The first is for the DR Operation token, that will be a batch token. A batch token is always orphan and not renewable, so maybe these two attributes are not necessary but put in there. And we reference the policy that we created before.
Then we have the AWS Auth method role, which we make use of the data source to make sure that the backend is mounted. Then, we define the token TTL to be five minutes and 10 minutes top. And we bound the IAM role of the AWS Auth Lambda that we created before.
I do not have time to go into more details on that, but you can check the code — please, you should do that. And you can message me too if there are any problems, I will be more than happy to go through it.
We have defined four Lambda functions and two repos in a GitOps approach and three environments. All of this allows us to come to a sub-minute recovery time of the Vault cluster in case of a disaster.
Also, as a scalability, we can use this AWS Auth Lambda function for other Lambdas. For example, we can use it for a crawler. If we want to crawl Vault, we can use it to create a specific token for that to be used. Or to create a Lambda to move secrets around even between clusters or whatever else.
These are some use-cases. We can also use this procedural promotion, demotion to upgrade the clusters. We can upgrade the secondary cluster. Then we can swap primary with secondary. We upgrade the old primary, which is now the secondary one. Then we can swap again. In this way, we have minimized the downtime — and if some errors occur, they are always on the secondary side.