Hear how Sky Betting & Gaming built its secrets management backup strategy to recover the unrecoverable, without breaking production Vault.
Sky Betting & Gaming has been using Vault for several years, so they have a mixture of older KVv1 and newer KVv2 secrets. In this talk, Lucy Davinhart will explain how their architecture and backup strategy was built to recover the unrecoverable, without breaking production Vault. Lucy will talk about how they would do it better, if they ever needed to do it again, and Lucy will give advice so you can learn from their mistakes and hopefully avoid getting into various situations in the first place.
Hello, HashiConf. I'm going to tell you a fun story about something which happened to us a couple of months ago that we've not had to deal with before. We received a support request from one of our Vault users that looked a little bit like this: “Help. I've tried to add a password to Vault, and somehow it's ended up removing all the other passwords. Can you restore this for us, please?”
I was on support duty that day; my first reaction was that's fine. They've accidentally deleted something—that's not the end of the world. Vault has this helpful Key Value undelete feature. Let me remind myself how that works. At this point, I realized that the secret they wanted us to restore was Key Value Version 1. This is the older version of KV, which doesn't have a helpful undelete feature.
So my next reaction is best summed up as a combination of panic and oh dear. We've been using Vault for quite a while now. We've got a lot of secrets in it, and a lot of those are still using the old KV Version 1. As much as we've been evangelizing KV Version 2 efore something like this happened. Originally, I let the user know, sorry, there isn't an undelete feature for this. This is KV Version 1, but depending on how bad it was going to be, we'd at least see what we can do.
So hello. My name is Lucy Davinhart. I'm a Senior Automation Engineer at Sky Betting and Gaming. I've been here for about four years—about as long as we've been using Vault. My team looks after various things—but pertinent to this talk, we manage our Vault clusters, handle all the tooling and integrations, and some supporting services around Vault. And we handle all of our internal users who need help and support using the product.
Writing to Vault and having it delete stuff instead sounds a bit odd. But I had a theory. If you've been using Vault for a while, you might already know what's going on here. Normal Vault behavior looks a bit like this. I'm using the UI here, and I have a secret that has the one key in it.
I'm going to add a new key to my secret. I'm going to save that, and we're all good. As an experienced Vault user, I know I've updated my secret. Rather than having the one key, my secret now has two. But if you're unfamiliar with Vault, you could be forgiven for not quite understanding what happened here. As a new user, you might think I'm in the secret test directory, and I've created a new secret. While it's fine if you have that misconception whilst you're using the UI, if we switch to the CLI, suddenly we're going to have issues.
Let's see what secrets we have in this secret test directory. We've got these two secrets, foo and bar. Let me add a new secret to this directory, and—oh—where have my secrets gone? You see, it's an easy mistake to make as a new Vault user if you don't know the difference between a key and a secret. Or if you don't realize when you're writing a secret to Vault—if you're doing something like
kv put—you still have to specify all of the other keys even if you only want to edit one or add a new one.
There's nothing wrong with what the CLI is doing here—this is correct behavior. You need to specify all the keys. Otherwise, deleting specific keys later could be messy. While we waited for confirmation from my user that this is indeed what had happened, we were able to confirm it for ourselves reasonably quickly by checking our audit logs—which in our case—we shipped to Elastic. Now, there weren't any deletions on this particular secret, but we can see that there's been an update.
This is a good example of one of the reasons you should be using Vault's audit logs. It's not just this feature you enable because your compliance tells you to. You can do all sorts of useful investigation and even alerting based on these logs.
Initially, nothing too bad happened as a result of this. Services that relied on this secret either had it cached or didn't need to access it that often. Then a few non-critical services started failing. The keys they were expecting to exist simply didn't anymore. And that on its own was going to be a problem. But worse, if we didn't fix this issue soon, business-critical services would start failing that evening when they were scheduled to run—and that would not have been fun.
What are we going to do about it? Now, I'd initially told the user, sorry, Key Value Version 1—there's not a lot we can do about this. But if business-critical stuff is going to break, we at least need to consider all of our options.
The first—probably the obvious—answer is, can they recreate the secret from scratch. Could they regenerate all those keys, and then write it back to Vault?
Yes, they could do that, but they wouldn't have been able to do that fast enough to prevent things from breaking. This particular secret had quite a few keys in it, upwards of 50. It was used in quite a lot of places. And it turns out the only central record of what keys the secret had was in Vault itself. It was going to be time-consuming to regenerate all those individual keys. But not knowing what keys there were to begin also meant that would take far too long.
Can we reset Vault to an earlier state? We've got backups, right? That should be possible. For some context, each of our Vault clusters is set up similarly to the HashiCorp reference architecture. We've got some Vault servers, and we're using Consul as our storage backend. We also have Enterprise replication set up in both DR and performance mode replicating to our Vault clusters in other sites. But that application can't help us here because—ironically—it's too good. The update to this particular secret will have propagated to those other sites almost immediately.
If you read through the reference architecture page—which you should, by the way—there's a lot of good stuff in there, and some of it we're not even doing yet, you'll find this bit. To protect against corruption of data, you should back up your Vault storage backend. And this is supported through the Consul snapshot feature, which you can automate for regular archival backups.
That's what we've been doing. In our case, we're running these snapshots hourly on one of our Consul servers. Then we're storing several days worth of these snapshots on an NFS mount. We're going a step further, and we're shipping all of those snapshots to a disaster recovery site. Then we also have regular snapshots of the NFS mount itself at the storage layer. We're good for backups—that's not a problem!
Can we take one of these earlier snapshots and restore Vault with it? Technically yes, but we didn't want to for a couple of reasons. Firstly, we'd lose everything that had been written to Vault since the snapshot had taken place—and we didn't know how much that would be or, more importantly, how much of that was important. This was the middle of a workday, and we get approximately 1,000 updates to our Vault state every minute. Probably quite a few of those in the middle of a workday are going to be important.
There would also be some downtime. To restore a snapshot, we'd first need to seal Vault, restore the Consul snapshot. Then we manually unseal, using our Shamir unseal shots—downtime is inevitable. It would also mean replication to our DR and performance sites would break; we'd have to set that up again from scratch.
Restoring Vault to an earlier state is definitely an option. But the impact of losing these particular keys would have to be bad to justify us doing that. We didn't think it was.
We needed the ability to travel back in time to before the secret was updated, look at what Vault was like at the time, read those secrets, and then bring that knowledge back to the future. Obviously, we don't have a time machine, but we have these snapshots—and they contain the earlier version of the secret.
Instead of restoring those to Production Vault, what if we restore one of those snapshots to an entirely new cluster? In our case, our Vault clusters run on VMs. I manage those with Chef, and it's on-prem. It's not like we're having to build this from scratch by hand. We also don't need a full cluster—we can probably make do with a single vault and a single Consul. But even so, spinning that up would require building new VMs, configuring Chef, possibly setting up some firewall rules. It's not quite a simple copy-paste from Production. We did build out a new cluster fairly recently and getting everything set up there took us a couple of days.
Fortunately though, we do have a much quicker way of spinning up a cluster that vaguely resembles Production—and that's that we have a Docker Compose stack. We put this together to test newer versions of the Vault Open Source and Enterprise binaries in a semi-representative set of local clusters. We found it particularly useful when raising support tickets. We can say, "Here, if you run this thing, you should be able to replicate the issue that we're seeing—sometimes."
We can spin up a cluster. Now we need to figure out where we're going to run it. Spinning up a new cluster on our Dev laptops is one thing, but we don't want to put Production data onto our local machines. And similarly, we don't want to put it in test.
Even if it's encrypted, it's still Production data, so we needed it somewhere in our Production network to spin this up. As it happens, we do run some supporting services through Vault. We run those in the same network as the Vault cluster—and all of that gets run in Docker. That was a good candidate for somewhere we could run it. This would mean users—whose secret needed restoring—wouldn't have access to this new temporary Vault. But we could work around that.
Let's see that in action. I've got a local version of Vault running here, which I'm using to re-enact what happened when we did this for real. As you can see, it's so brand new I've not even bothered to initialize it yet.
Taking a look at the Consul logs—I'm going to restore one of these snapshots. I have one Consul snapshot here. It's formed quite a while ago. If I do a
consul snapshot restore on that, we should see from the Consul logs that that has indeed restored stored successfully.
Let's switch over to Vault. I've got Vault running here with the open source binary because the secret that we care about is in the route base space. We try to avoid using Enterprise binary if we don't need to—and we figured we probably don't need it for this; it's fine. Let's unseal this. Two of my teammates have already provided their unseal shots. I'll provide mine with my Vault operator unseal command.
No, that doesn't look good. When we were doing this for real, this was the point we worried that this whole plan wasn't going to work. Maybe you can't take a snapshot from one Vault cluster and restore it to an entirely different one. Looking more closely at these logs, the error that we're seeing is they can't cope with namespaces in the map table, which makes sense—that's an Enterprise feature. We were trying to get away with using an open source binary.
Oh well—guess we do need an Enterprise binary, after all. Let's try this again. And you can say this time I'm running with the Enterprise version. I'll provide an unseal shot—that is looking a lot better.
First we noticed that a bunch of leases have been revoked. In this case, these are the tokens. When we did this in Production, we saw hundreds—possibly thousands of these. These are all of the leases that have expired between the snapshot being taken and us restoring it. I scrolled up, and we can see that it's also successfully mounted all of the Auth and secret backends in all of the namespaces.
But has it worked? Well, I certainly seem to be able to log in—and there's a secret here which definitely wasn't there before. If you remember, this was an empty, uninitialized Vault cluster earlier. Unfortunately, it wasn't quite this simple when we did it for real. You see, the secret that we wanted to restore wasn't one of ours—it was some other team's secret, so we didn't have permission to view it.
Even as Vault operators, we give ourselves access to the bare minimum that we need day-to-day—that means we can't access all the secrets. It also means we can't modify our own permissions in our policies to give ourselves access to this. All the changes, the policies, and Auth methods, etc.—those all have to go through Terraform. We can't point Terraform at this temporary Vault for the same reason that the users who can access the secret can't access it either; time pressures—we don't have enough time for that.
So to work around this, we decided we were going to generate a root token. If you've not generated one of these before; you essentially provide Vault with some of your Shamir unsealed shots. Then you get back an all-powerful root token with which you can do whatever you like.
To make that as safe as possible, we decided we were going to have a second person watching at all times from just before we generated this root token until right after the cluster itself has been destroyed. We were only interacting with a temporary Vault cluster—we weren't technically touching Production, but it still contains production data. And the root token will give us the ability to read everything. We could potentially read a secret we didn't need to or even generate some dynamic credentials. We trust each other on the team, but we shouldn't have to rely on trust.
We've done this now once before. It worked, and we documented it in case we need to do it again. But there were a couple of things that we would do differently next time. Mostly, that comes down to trying to avoid using a root token. There are several ways we could go about that that we've considered.
Firstly, we'd like to make it possible for the users whose secret needs restoring to access this temporary cluster so they could get the secret out instead of us. We couldn't do that this time because we didn't have enough time, but one of the first things we tried after restoring a Consul snapshot was logging in with our other users—and that worked. If we had a temporary cluster set up in advance with all the access control, we'd rather do that.
We could also theoretically avoid using the root token by making our admin policy a bit more permissive, giving us access to read all secrets, for example. But like the rest of our Vault users, we only grant ourselves access to whatever we need day-to-day—and day-to-day, we do not need access to all the secrets. On balance, we would much prefer to keep it that way—great power, great responsibility—and all that.
But there are ways we could elevate our permissions if we needed to. Giving ourselves permission to modify policies by hand wouldn’t be ideal. If we could do that, we can do anything—we may as well be using root tokens. But we could make use of an Enterprise feature called Control Groups. This would allow us to give ourselves temporary access to secrets that we wouldn't normally have access to—just so long as a few people on the team have provided their approvals for that.
Also, we didn't consider network access. Inbound traffic to the cluster to this Vault would have been blocked by default because this VM doesn't normally run Vault. We weren't too worried about that—but what about outbound traffic, say to a cloud provider like AWS? This is something we didn't check, and that could potentially have caused us issues with dynamic credentials.
Let's say—for example—that someone has requested a short-lived IAM user from just before our Consul snapshot had taken place, and then they were renewing that lease with Vault. That IAM user could still exist hours or even days later.
But if our temporary cluster had access to AWS, because it was restored from this earlier snapshot, as far as it's concerned, nobody's been renewing that lease. When that lease expires, it will attempt to delete that IAM user, and then those credentials are no longer valid. We didn't even consider it could be an issue until quite a while after we'd done this. Thankfully, we didn't have any issues like this as far as we're aware—but it's definitely something we need to pay attention to next time.
This is the first time we've done this. It was an untested process. We were making it up as we go along—but we've done it once before, and now we know how it works; we know the mistakes we made along the way. We can refer to that if we ever need to do it again, and it should go a lot more smoothly.
Even though this is now a process that we've done before and documented, we'd rather never have to do it again, please. There are a couple of things that we can do—and you can do—to hopefully never find yourself in this situation in the first place.
Firstly—and I've hinted at it already—Key Value Version 2. I still think of this as a relatively new feature. But it's been around in Vault since version 0.10—that's over two years now. It has a number of useful features. I'm not going to go into them all here—but two main ones: Firstly, secrets are now versioned. That means you can update your secret, and the previous version could still exist in Vault if you need to refer back to it or roll back. Secondly, there’s now a concept of a soft delete versus a hard delete or destroy—a secret —or version of a secret that has merely been deleted can be restored if necessary.
In this case, the issue was that the secret was huge—it had 50+ keys in it. If you're storing passwords for a bunch of related service accounts, it makes sense to keep them in roughly the same place in Vault. But if it's unlikely a single Vault client is going to need to access many of those at once, do they need to be in the same secret?
With smaller secrets, you prevent issues like what we demonstrated today—but you can also be more granular in your policies. Maybe you don't want to give out the admin service account password to anybody that just happens to be able to read the rest of them.
Also, one of the reasons our users couldn't just create their secret by hand is that in addition to having a lot of keys, they didn't have a central record outside of Vault of what keys they had. If you don't know what secrets or keys you have, where they used, how you're generating them, you're not making it easy for yourself. We don't generally consider the existence of a particular secret to itself be a secret. You could even link directly to them in your documentation.
It's also worth taking a look at what your policies grant access to. You should already be limiting access to who can read and write to secrets, but Vault's permission model is more granular than that. For example, you could allow most people to create new secrets, but then only allow a smaller group of people to update existing ones. This would naturally discourage things like having larger secrets with lots of unrelated keys that build up over time.
Similarly, you can grant delete permissions to only a smaller set of people. Personally, I think separating update and create like this is a bit overkill for most situations and certainly will make your secret rotation a bit more awkward if there are fewer people that can do that—but there are use cases where that makes sense. My general point is that it's worth considering what your policies granted access to and figuring out a comfortable middle ground between absolute bare minimum and the flexibility you need day-to-day.
Next thing is considering your tools. If you're using the CLI to edit your secret, if all you want to do is edit one key, then with a
kv put, you still need to specify the current value of all the other keys in that secret. If we compare this to the UI—in the UI, you can click, edit the key you care about it and then save it.
I know as techies, we feel like gods when we're using command-line tools, especially if we get the command right the first time without looking at the documentation—but the Vault UI is good, and you all should be using it if you can.
For the cases where you need to use the CLI, if you have a big secret that you need to edit, you can use the CLI's JSON mode. You could read a secret in JSON format, save it to a file, edit that and then write it back to Vault. That way, you don't need to worry about copy-pasting all of those keys.
Or if you're using KV Version 2, there's a handy KV patch feature that will handle all of that for you—it's magic! With the KV patch, you specify the keys that you care about and that you value. Then the CLI will automatically merge that in with the content of the secret; that way you never have to worry about having the secret in a file or disk.
We can have all the technical solutions we like, but maybe this is more an issue of education. Quite often, we find that our users confused by the difference between keys and secrets. That usually results in a pull request for a policy that grants access to
Such a policy would grant access to all the secrets that are prefixed
secret/foo. But the user really wanted to access all of the keys in
secret/foo, and such a policy wouldn't do that. This doesn't happen quite as often anymore, but it still happens every now and then. Besides encouraging wider use of things like KVv2, maybe the larger issue is the need to better educate our users. Our users will see our documentation before they ever go on to read HashiCorp docs—our docs need to be good, they need to be up-to-date—and where appropriate, we need to signpost to the HashiCorp docs.
It's also worth spending some time with your users—learning what they do, finding out how they're making use of Vault. As Vault operators, we know the product well. But watching other users, you might be surprised how many issues or frustrations you didn't know about or how many things you've taken for granted.
So before I wrap up, I'd like to take a moment to mention some other storage backends that Vault has. Thinking back to the reference architecture, it was using Consul as the storage backend, and it's pointing in the direction of Consul Snapshot as your backup tool.
But because the reference architecture uses Consul doesn't mean you're out of luck if you're using something else. Many of the available storage backends will have their own snapshotting tools—even if that means doing a disk-level snapshot. But I wanted to call out the new raft storage backend specifically, otherwise known as Vault Integrated Storage. That’s now out of beta as of Vault 1.4.
With raft, you can save a snapshot similarly to how you might do it in Consul. But unlike with Consul, when you restore a snapshot, Vault knows that it’s happening. That means—at least from my testing—when you’re restoring it to the same cluster that the snapshot came from, Vault never gets sealed—there’s no downtime to Vault when it’s restored. You could do all sorts of stuff if Vault knows a snapshot is being restored.
So if you gain nothing else from this talk, I want you to take away a couple of things. Firstly, help your users learn how Vault works. Be this with your internal docs, any training courses you run, or helping people out. User error is inevitable, but you can try to minimize the likelihood of it happening.
But because it's inevitable, make sure you have backups—that way, you're covered in case you need to roll back or extract a single secret from an earlier version. Make sure you're also testing your backups and give KVv2 a try if you're not using it already. Yes, the API for it isn't quite as simple as KVv1, but many of the useful features that it brings more than make up for that.
The link to my slides is on-screen now. If anything from this talk has been useful, I've Lucy Davinhart, and my Twitter (and Sky Betting & Gaming Twitter) is on screen. If not, I'm somebody else, and you could pretend you've not seen me. Thank you all for listening, and I hope you enjoy the rest of the conference.