Despite hours of usability testing and every effort to make a piece of software intuitive and easy to use, user errors still — and will always — happen.
Research suggests that software user error costs companies nearly $2 billion a year in the U.S. alone. HashiCorp Vault users aren’t immune from making small errors that could, if handled incorrectly, snowball into larger problems that ultimately impact business-critical applications.
Fortunately, Vault offers a number of features to help manage the risk of user error. Vault customer Sky Betting & Gaming (SBG), a U.K.-based betting and gaming company shared their story at HashiConf Digital 2020 about the costly user errors they’ve experienced in the past and showed how they’re using some of the lesser-known Vault capabilities to prevent errors in the future.
Midway through an otherwise normal day at SBG’s help desk, a message arrived: “Help! I tried to add a password to Vault, and it removed all the passwords!”
SBG has been using Vault for about four years, so it has a lot of secrets that are stored in a database called Key/Value (KV). The new version, KV Version 2, comes equipped with a useful undelete feature, but wasn’t backward compatible with the previous iteration.
“I realized the secret they wanted us to restore was KV Version 1, which doesn’t have an undelete feature,” says Lucy Davinhart, senior automation engineer at Sky Betting. “It was a matter of time before something like this happened.”
Using a workaround, initial restoring efforts seemed to be working, but soon a few non-critical services failed. The team knew that failing to correct that issue could result in business-critical services failing, a potential catastrophe for the business.
The Vault user interface (UI) enables users to add and save new keys into a secret intuitively. But adding or editing a key in the command-line interface (CLI) requires specifying the other keys affected, otherwise it can create a potential problem.
SBG considered regenerating those keys from scratch, but they wouldn’t have been able to do that fast enough to prevent services from failing. They also looked at restoring Vault using backups, but found that it couldn’t work because SBG replicates to backup sites almost immediately, overwriting many of the previous versions.
After more closely examining HashiCorp’s extensive reference architecture documentation they realized they’d been snapshotting Vault’s storage backend hourly, which they could leverage to restore Vault — but doing that would also require unacceptable downtime and maintenance.
Not satisfied with taking their systems offline to restore a backup, the team conjured a creative workaround using other HashiCorp tools. Specifically, the team determined they could restore Vault to a new cluster from an earlier Consul snapshot. First testing the idea in Docker, the workaround in Vault Enterprise worked as expected and the deleted secret appeared in the new Vault cluster.
Despite the seemingly successful trial run, executing the backup plan in real life wasn’t simple. “Even as Vault operators, we didn’t have permission to view the secret we wanted to restore,” says Davinhart. “All policy changes go through Terraform, and we didn’t have time for that, so we generated a root token that allows the user to read all the secrets on the system and gives us a second pair of eyes on the process.”
As is often the case when a novel software issue arises, the SBG team was essentially making up a process as they went along. Despite a few false starts, the team ultimately achieved its Vault backup objective and documented each step to create some best practices to follow if your team ever finds itself in a similar situation.
Use Vault’s audit logs
This will allow you to investigate user errors, and even alert you based on logs.
Use KV Version 2
In KV Version 2, you can update a secret while keeping the previous version in Vault. There’s also a soft delete feature, so you can restore a secret if necessary.
Use smaller secrets.
The secret in SBG’s case had 50+ keys in it. Use smaller secrets to prevent issues created by dependencies while also enabling more granularity in your policies.
Document your secrets
If you don’t consider the existence of a secret to be a secret itself, document it. Always provide a record of secrets and the keys it contains so users can recreate their secret without issue.
Think through policy changes
Vault’s permission model is granular and allows users to create new secrets, but restricts updating capabilities to a smaller group. It’s worth finding a middle ground between the bare minimum, and the flexibility you need day to day.
Use Vault UI and KV patch
Developers like to use the CLI, but its process for editing keys is error-prone. The Vault UI is simpler. You just click, edit a key, then save.
Educate the team
Spend time with users to help them learn more intimately how Vault works, either through thorough internal documentation or by direct training.
User error is inevitable. No matter how much training, onboarding, and user experience testing you offer, there’s virtually no way around experiencing the occasional user error. But, while you can’t eliminate user errors completely, you can minimize the likelihood it happens and mitigate how severely it might impact your business.
Naturally, having backups of your secrets logs is always a good place to start, so you’re at least covered in case you need to roll back or just extract a single secret from an earlier version. But it’s also imperative to test your backups regularly to identify any potential issues you can address proactively. That includes testing backward compatibility between database instances, like KV Versions 1 and 2, so you’ll not have to rely on helpful Vault features like the undelete feature — even though they’re there in a pinch.
We invite you to join us at our next HashiConf Digital, October 12-15 (PDT timezone). Registration is free to attend. Real-time product workshops are also available, and will require a nominal fee to reserve your seat. Register here.
Learn about a Vault SlackBot made by DigitalOnUs.
When multiple teams use Consul, it becomes difficult to correlate manually managed policies with the identity accessing it. In this blog, we'll show you an automated method to ensure least-privilege access to Consul using Terraform and Vault.
We are happy to announce that we have an officially supported HashiCorp Vault GitHub Action. GitHub Actions allow you to easily automate your CI/CD developer workflows to run actions against repositories based on triggers within GitHub. The Vault GitHub Action allows you to take advantage of secrets sourced from your HashiCorp Vault infrastructure for things like static and dynamic secrets and inject these secrets into your GitHub workflows.