Learn how Wix automatically upgrades their multi-regional, replicated Vault clusters running on Kubernetes at scale.
Wix uses multi-regional, replicated Vault clusters running on Kubernetes. Upgrading these at scale can be a chore, so they took the documentation and automated them using a CD pipeline.
In this talk, you will learn how they went from manual upgrades to fully automated upgrades to Vault clusters at scale. They’ll take you through the step-by-step process and how it makes it simple to update Vault throughout their architecture throughout the year.
My name is Moran Frimer. I'm a security platform team leader at Wix Infrastructure. Before I arrived at Wix, I spent four and a half years working at Check Point as a DevOps team leader, and all around, I've been, in total, in the DevOps world for more than 10 years.
When I first started working at Wix, my first assignment was to upgrade all of our Vault clusters. Although I had no past experience working with Vault, I thought it was a straightforward task. Man, I was wrong.
Let's start. As you know, every few weeks, HashiCorp releases a new Vault version, usually with cool new features and sometimes also important security fixes. So, I guess now, most of you, like me, need to manually upgrade all of your Vault clusters.
Why must we upgrade our Vault frequently? Every year, only on Vault alone, there are several CVEs which have been fixed, and so many new features and bug fixes. You also don't want to wait longer for a new version to arrive. The longer you wait, the gap between your current version and the latest one will make the upgrade even more difficult.
Although it can be time-consuming, upgrading your Vault to the latest version is just something we can't avoid. At Wix, we decided to set up a CD pipeline that automatically upgrades all of our Vault clusters. Today, I will discuss our experience with upgrading Vault frequently, and together, we will save you and your company a lot of valuable time and money.
This is our problem; making safe changes to all of our Vault clusters, without any stressful manual action — this is what we will focus on. Just for you to get a better understanding of what we need to deal with, it's important for us to talk about the Wix way of making changes in production. We're going to use the same capability that already exists in Wix to do safe changes on our Vault clusters.
Let's talk about the Wix way. We are all about our users. This is our focus. In our production environment, that means we won't tolerate any downtime. And as you guys know, secrets usually are the basic configuration for every application. An application without its secrets will just not work.
So Wix made sure our production environment is fully highly available across multiple regions, and we can shift our user traffic between regions without any user impact — not even in performance or latency. This setup allows us the ability to conduct infrastructure changes in a very safe way on our production environment. Let's talk about how we are doing it.
First, we have the ability to completely drain a production datacenter (DC) of its user traffic. That means all of our thousands of applications running on top of Kubernetes will automatically scale down.
The DC will be completely drained.
Then, we can safely perform our infrastructure change we want. For example, we can upgrade our Kubernetes clusters or make configuration changes to our Nginx. We will also run a synthetic test on our change to test specifically the functionality we want.
Now, the next part is cool. Full test is the step where we're starting up all the Wix applications.
By starting up all Wix applications, the thousands of applications will immediately scale-out, and they all together will start consuming secrets from our Vault clusters.
We call this step a pre-warm DC. We're getting the DC ready to resume live traffic. And the last thing we have to do is return the production traffic to the DC. So now we will see how we will use the same capability already in the existing Wix ecosystem to safely conduct Vault upgrades on our Vault clusters.
As you may have guessed, our Vault setup is completely identical to our production one. Multi-regional DC, full high availability. We use the HashiCorp official Helm chart to deploy it. We have our primary cluster in one DC, and we have performance secondary clusters on other DCs — and they all get DR replication. You cannot get more bulletproof architecture than that.
It’s important to know that it is not safe to replicate from a new version of Vault to an older one. When upgrading replicated clusters, we need to ensure the upstream clusters are always on an older version than the downstream one.
Because of that, we will start first with a DR secondary cluster, move away to the performance secondary ones, and finally, on the last step, we will upgrade the primary cluster. Also, important to mention: all of our Vault clusters are deployed and running on top of Kubernetes.
We will start with the DR secondary clusters. It should be simple. All you need to do is to upgrade the Vault Helm chart and apply it. Well, by default, when you apply it, nothing happens. No new pods spinning up; the existing pods don't get terminated. This is because the Helm chart is configured with an on-delete rolling strategy.
On-delete prevents the controller from automatically updating its pods. You will need to manually delete the existing pods, one by one. And then only after that will a new pod with a new version spin up.
We will first need to upgrade all of the standby pods, one by one — and only at the last step will you need to upgrade the leader pod. HashiCorp has great documentation about it — and trust me, it is all there.
We need to make sure the new pod is functioning right. You can ask at this point, isn't the built-in readiness endpoint of Vault good enough for functional testing? It will get you to some points. You will know Vault is fully initialized, unsealed, and ready to receive connections. But what about consuming a secret —a secret you know exists? For that, we decided to configure a tiny sidecar container into our Vault pod.
Using some basic Helm configuration like an extra container, we decided to set up a tiny container. All it does is perform a simple GET API request, which consumes a secret we know exists. If we have a valid response, we are all good. If not, we will immediately roll back. j. For the DR secondary cluster, you can change and do the synthetic test immediately. They are there for backup; they are not serving any client requests.
Now that we are done with upgrading DR secondary clusters, we can tackle our performance secondary ones. Any Wix application on any DC can connect and consume secrets from any Vault cluster. We can achieve that by shifting traffic based on DNS. For local optimization every Wix DC is configured to communicate and connect to the nearest available Vault cluster.
We also have a DNS-based fallback to the next nearest Vault cluster in case of a failure. This DNS failover capability enables us to control traffic to our Vault cluster. So, now we gain two additional pipeline capabilities. We can remove traffic from each of our Vault clusters, and we can also return traffic to our Vault.
Combining Wix traffic capabilities alongside our Vault traffic capabilities, we can complete a pipeline for creating a safe and controlled change to our Vault. Let's go through the pipeline together.
First, we're going to remove the Vault traffic away from the cluster. We want to upgrade it. We want to do it safely. The second part is to do the change and synthetic testing on that cluster.
After that, we want to proceed to doing a full test. But to proceed to doing a full test, we first need to remove Wix production traffic from the DC. Then, we can safely return the Vault traffic to the cluster. After that, we can pre-warm the production DC, so all the applications will start up and start consuming their secrets. Only after that we will be able to return production traffic to the Wix DC.
We have a pipeline for conducting changes to our Vault cluster, except the performance primary one. The way Vault works to maintain persistence is all write operations are routed to the performance primary cluster.
When you want to create a secret, update a secret, or delete one, all those actions are going straight to the primary cluster. You can control this part of traffic. Based on HashiCorp documentation, at of this last stage, you will need to perform an in-place upgrade on the performance primary cluster.
The only choice we have left is to schedule a write maintenance window. We are collaborating with HashiCorp on finding an alternative way, something that can help us avoid that. Here is a full pipeline under the write maintenance window. Because we automated all these components and all this control — and built a complete full CD pipeline — the write maintenance window will be short and acceptable.
Thank God we did it. We have pipelines for all of our Vault clusters. But what if you want to do something else besides upgrading the Vault version? Let's say, for example, you want to migrate from Consul storage to Raft integrated storage. For that, you will have to utilize the blue-green deployment methodology.
The blue-green deployment process means spinning up a completely new entire cluster. And then, after doing the change on the new cluster, you're going to migrate all the traffic from the original cluster to the new one.
Spinning up a new cluster has its own complexity. You will need to automate the initialization part and also the replication setup. Follow us as we are working on this pipeline as well, and we'll release it soon.
After going through all the details of our pipeline, let's do a quick recap. First, we talk about Wix traffic control. We already use the existing capabilities that we have in the Wix ecosystem. We can remove Wix traffic from our production DC, we can do a full test, and we can return the Wix traffic also.
For DR secondary, we can do the change and synthetic test immediately. They are there for backup, as we mentioned. For performance secondary clusters, we have the full pipeline remove the Vault traffic away from a cluster, doing the change, remove the Wix traffic from the correlated Wix DC. Then, we can safely return Vault traffic and conduct the full test — like pre-warming the DC, consuming all of our secrets.
At the final stage we will return traffic.nd that is it: We conducted our change in production in a safe way. For performance primary, as I mentioned, put the whole pipeline under write maintenance window.
Before I say goodbye, I will be available for questions, I think, at the booth, after this presentation.I can give an example for a question I receive while dry-running it inside Wix. I've been asked which tool did we use to construct our CD pipeline? In our case, we use GitHub Actions, but you can go and use whatever workflow engine of your taste.
One last thing I was asked to say, Wix is opening a new site right here in Amsterdam. Great news, right? Thank you so much for listening to that talk.
Hopefully, it was interesting for the last talk. Thank you, guys.
How OVHcloud Migrated to Terraform Enterprise
Using Terraform to Build a Self-Service GitOps Infrastructure as Code Platform at AppFlyer
TomTom's Secrets Management Journey with HashiCorp Vault
Using Terraform Enterprise to Support 3000 Users at Booking.com