Customer Stories & Success

How HashiCorp Vault Helps Minimize the Impact of User Error

Aug 28 2020Kara Spinelli

Despite hours of usability testing and every effort to make a piece of software intuitive and easy to use, user errors still — and will always — happen.

Research suggests that software user error costs companies nearly $2 billion a year in the U.S. alone. HashiCorp Vault users aren’t immune from making small errors that could, if handled incorrectly, snowball into larger problems that ultimately impact business-critical applications.

Fortunately, Vault offers a number of features to help manage the risk of user error. Vault customer Sky Betting & Gaming (SBG), a U.K.-based betting and gaming company shared their story at HashiConf Digital 2020 about the costly user errors they’ve experienced in the past and showed how they’re using some of the lesser-known Vault capabilities to prevent errors in the future.

»“Oops, We Deleted All the Passwords in Vault!”

Midway through an otherwise normal day at SBG’s help desk, a message arrived: “Help! I tried to add a password to Vault, and it removed all the passwords!”

SBG has been using Vault for about four years, so it has a lot of secrets that are stored in a database called Key/Value (KV). The new version, KV Version 2, comes equipped with a useful undelete feature, but wasn’t backward compatible with the previous iteration.

“I realized the secret they wanted us to restore was KV Version 1, which doesn’t have an undelete feature,” says Lucy Davinhart, senior automation engineer at Sky Betting. “It was a matter of time before something like this happened.”

Using a workaround, initial restoring efforts seemed to be working, but soon a few non-critical services failed. The team knew that failing to correct that issue could result in business-critical services failing, a potential catastrophe for the business.

»Initial Triage Attempts Didn’t Work

The Vault user interface (UI) enables users to add and save new keys into a secret intuitively. But adding or editing a key in the command-line interface (CLI) requires specifying the other keys affected, otherwise it can create a potential problem.

SBG considered regenerating those keys from scratch, but they wouldn’t have been able to do that fast enough to prevent services from failing. They also looked at restoring Vault using backups, but found that it couldn’t work because SBG replicates to backup sites almost immediately, overwriting many of the previous versions.

After more closely examining HashiCorp’s extensive reference architecture documentation they realized they’d been snapshotting Vault’s storage backend hourly, which they could leverage to restore Vault — but doing that would also require unacceptable downtime and maintenance.

»Other Tools Create Other Options

Not satisfied with taking their systems offline to restore a backup, the team conjured a creative workaround using other HashiCorp tools. Specifically, the team determined they could restore Vault to a new cluster from an earlier Consul snapshot. First testing the idea in Docker, the workaround in Vault Enterprise worked as expected and the deleted secret appeared in the new Vault cluster.

Despite the seemingly successful trial run, executing the backup plan in real life wasn’t simple. “Even as Vault operators, we didn’t have permission to view the secret we wanted to restore,” says Davinhart. “All policy changes go through Terraform, and we didn’t have time for that, so we generated a root token that allows the user to read all the secrets on the system and gives us a second pair of eyes on the process.”

»Lessons Learned and Best Practices for the Future

As is often the case when a novel software issue arises, the SBG team was essentially making up a process as they went along. Despite a few false starts, the team ultimately achieved its Vault backup objective and documented each step to create some best practices to follow if your team ever finds itself in a similar situation.

Use Vault’s audit logs

This will allow you to investigate user errors, and even alert you based on logs.
Use KV Version 2

In KV Version 2, you can update a secret while keeping the previous version in Vault. There’s also a soft delete feature, so you can restore a secret if necessary.
Use smaller secrets.

The secret in SBG’s case had 50+ keys in it. Use smaller secrets to prevent issues created by dependencies while also enabling more granularity in your policies.
Document your secrets

If you don’t consider the existence of a secret to be a secret itself, document it. Always provide a record of secrets and the keys it contains so users can recreate their secret without issue.
Think through policy changes

Vault’s permission model is granular and allows users to create new secrets, but restricts updating capabilities to a smaller group. It’s worth finding a middle ground between the bare minimum, and the flexibility you need day to day.
Use Vault UI and KV patch

Developers like to use the CLI, but its process for editing keys is error-prone. The Vault UI is simpler. You just click, edit a key, then save.
Educate the team

Spend time with users to help them learn more intimately how Vault works, either through thorough internal documentation or by direct training.

»Hedging Your Bets

User error is inevitable. No matter how much training, onboarding, and user experience testing you offer, there’s virtually no way around experiencing the occasional user error. But, while you can’t eliminate user errors completely, you can minimize the likelihood it happens and mitigate how severely it might impact your business.

Naturally, having backups of your secrets logs is always a good place to start, so you’re at least covered in case you need to roll back or just extract a single secret from an earlier version. But it’s also imperative to test your backups regularly to identify any potential issues you can address proactively. That includes testing backward compatibility between database instances, like KV Versions 1 and 2, so you’ll not have to rely on helpful Vault features like the undelete feature — even though they’re there in a pinch.

See Lucy’s full HashiConf Digital presentation or click here for more in-depth Vault information.

»Join HashiCorp in October

We invite you to join us at our next HashiConf Digital, October 12-15 (PDT timezone). Registration is free to attend. Real-time product workshops are also available, and will require a nominal fee to reserve your seat. Register here.

Case Study

How HashiCorp Vault Helps Minimize the Impact of User Error

»“Oops, We Deleted All the Passwords in Vault!”

»Initial Triage Attempts Didn’t Work

»Other Tools Create Other Options

»Lessons Learned and Best Practices for the Future

»Hedging Your Bets

»Join HashiCorp in October

Sign up for the latest HashiCorp news

More blog posts like this one

Helvetia’s journey building an enterprise serverless product with Terraform

SPH Media shares its custom HCP Terraform operational dashboard

Fannie Mae’s process for developing policy as code with Terraform Enterprise and Sentinel