Running Vault on Nomad, Part 3

Learn how to automate the unsealing and snapshotting of HashiCorp Vault using HashiCorp Nomad and Vault Unsealer.

Rob Barnes

May 14, 2024

Rob Barnes

This post is the third and final part of a series on how to deploy the underlying HashiCorp Nomad infrastructure and configuration to run HashiCorp Vault as a Nomad job. Part 1 focused on deploying the infrastructure to run Vault on Nomad, while part 2 took a deep dive into running Vault as a Nomad job. This installment looks at automating operational tasks for Vault using Nomad jobs.

Specifically, it covers how to automate:

Unsealing Vault
Taking snapshots of Vault

»What is unsealing Vault?

As a secrets management platform, Vault stores sensitive and often mission-critical data in its storage backend. This data is encrypted using an encryption key, which is required to decrypt the data in the storage backend. Vault stores this key with the encrypted data and further encrypts it using another key known as the root key.

Unsealing is the process of decrypting the encryption key using the root key and then decrypting the data using the encryption key. Until this process has been completed, you can perform only two operations on the Vault server:

Checking the seal status of Vault
Unsealing Vault

The root key is normally split into a configurable number of shards, known as unseal keys, using Shamir's Secret Sharing algorithm, and these are distributed to engineers responsible for unsealing Vault. The unseal process consists of a pre-specified threshold of unseal keys being entered by the key holders.

»Auto unsealing Vault

When a Vault server is started or restarted, it comes up in a sealed state, which means that only the two operations mentioned above can be performed on it. There are many reasons why Vault might need to be restarted, from OS patching to resource consumption issues. Whatever the reason, unsealing presents a potentially huge management overhead burden for the Vault servers.

To address this, Vault’s auto-unseal feature delegates the responsibility of unsealing Vault to a service like a cloud key management service (KMS) or a device like a hardware security module (HSM). As the name suggests, using auto-unseal means that the Vault servers will be automatically unsealed when they are started or restarted.

»Vault Unsealer

There are many reasons why some organizations cannot use auto-unseal. Their security policies may not allow cloud services, or they may not want to pay the high procurement and operational costs of HSMs. Vault Unsealer, is a proof-of-concept tool designed to automate the process of unsealing Vault using the unseal keys.

»How Vault Unsealer works

Vault Unsealer checks the seal status of each Vault server in the cluster and unseals any servers reporting their status as sealed. Under the hood, Vault Unsealer uses the Vault API to perform these tasks.

»Configuring Vault Unsealer

In order to use Vault Unsealer, you’ll configure a JSON file to tell it which servers to manage the unseal state on, the unseal keys to use, how often it should check the seal status, and the log level to output to stdout. Here is an example configuration file:

{
 "log_level": "debug",
 "probe_interval": 10,
 "nodes": [
   "http://192.168.1.141:8200",
   "http://192.168.1.142:8200",
   "http://192.168.1.143:8200"
 ],
 "unseal_keys": [
   "aa109356340az6f2916894c2e538f7450412056cea4c45b3dd4ae1f9c840befc1a",
   "4948bcfe36834c8e6861f8144672cb804610967c7afb0588cfd03217b4354a8c35",
   "7b5802f21b19s522444e2723a31cb07d5a3de60fbc37d21f918f998018b6e7ce8b"
 ]
}

NOTE: The unseal keys are sensitive pieces of data, so we recommend that the config file is rendered with the unseal keys’ values coming from an encrypted store that you trust.

»Deploying Vault Unsealer as a Nomad job

For this post, the code is located within the 2-nomad-configuration directory.

Writing a Nomad jobspec for Vault Unsealer is similar to the process in part 2 of the blog series with some subtle differences because the requirements for this job are slightly less than that of the Vault cluster. Here is the vault-unsealer.nomad file:

job "vault-unsealer" {
 namespace   = "vault-cluster"
 datacenters = ["dc1"]
 type        = "service"
 node_pool   = "vault-servers"
 
 group "vault-unsealer" {
   count = 1
 
   constraint {
     attribute = "${node.class}"
     value     = "vault-servers"
   }
 
   task "vault-unsealer" {
     driver = "docker"
 
     config {
       image      = "devopsrob/vault-unsealer:0.2"
 
       command = "./vault-unsealer"
       volumes = [
         "local/config:/app/config"
       ]
     }
 
     template {
       data = <
 
{
 "log_level": "debug",
 "probe_interval": 10,
 "nodes": [
{{- $nodes := nomadService "vault" }}
{{- range $i, $e := $nodes }}
   {{- if $i }},{{ end }}
   "http://{{ .Address }}:{{ .Port }}"
{{- end }}
 ],
 "unseal_keys": [
   {{- with nomadVar "nomad/jobs/vault-unsealer" }}
   "{{ .key1 }}"
   , "{{ .key2 }}"
   , "{{ .key3 }}"
   {{- end }}
 ]
}
EOH
 
       destination = "local/config/config.json"
     }
 
     resources {
       cpu    = 100
       memory = 512
 
     }
 
     affinity {
       attribute = "${meta.node_id}"
       value     = "${NOMAD_ALLOC_ID}"
       weight    = 100
     }
   }
 }
}

Key points to note about this jobspec include:

Vault Unsealer is deployed as a Docker job
It runs on the same node pool as the Vault servers
Only one instance is running
It renders the configuration file using Nomad's templating engine.
- The unseal keys are stored in Nomad variables as seen in part 2 of this blog series. The template renders the values in the configuration file.
- The list of Vault servers to manage their respective seal statuses are populated from Nomad's built-in service registry

Vault Unsealer can be deployed using Terraform, similar to how the Vault cluster was deployed. Here is the code used to deploy the job to Nomad:

resource "nomad_job" "vault-unsealer" {
 jobspec = file("vault-unsealer.nomad")
 depends_on = [
   nomad_namespace.vault,
   nomad_variable.unseal,
   nomad_job.vault
 ]
}

This Terraform code specifies some explicit dependencies, all of which were explained in part 2 of this blog series. This job ensures the Vault servers are all unsealed and ready to accept requests.

»Automated snapshots of Vault

The storage backend of the Vault cluster, (Raft integrated storage) replicates its data across the Vault servers to create a highly available cluster. This may improve availability and redundancy; however it does not provide disaster recovery if the storage for all three Vault servers is lost irreparably.

This is where snapshots come into play. Snapshots take a point-in-time backup of Vault's data which, in the event of a total loss, will allow a new cluster to be provisioned and the data from the snapshot can be used to restore Vault. Vault provides an API endpoint to take snapshots.

Best practice dictates that snapshots should be stored away from the things they are there to protect. Part 1 of this blog series showed how to provision a Vault backup server to accommodate this best practice.

You create a Nomad job to take regular backups of the Vault cluster. The first step is to create a Vault policy that allows the Nomad job to perform this task. This code snippet writes a policy to Vault named snapshot_policy:

resource "terracurl_request" "snapshot_policy" {
 method         = "POST"
 name           = "snapshot_policy"
 response_codes = [201, 204]
 url            = "http://${data.terraform_remote_state.tfc.outputs.nomad_clients_public_ips[0]}:8200/v1/sys/policy/snapshot_policy"
 
 headers = {
   X-Vault-Token = jsondecode(terracurl_request.init.response).root_token
 }
 
 request_body = <
{
 "policy": "path \"sys/storage/raft/snapshot\" {capabilities = [\"read\"]}"
}
EOF
 
 destroy_method = "DELETE"
 destroy_url = "http://${data.terraform_remote_state.tfc.outputs.nomad_clients_public_ips[0]}:8200/v1/sys/policy/snapshot_policy"
 
 destroy_headers = {
   X-Vault-Token = jsondecode(terracurl_request.init.response).root_token
 }
 
 destroy_response_codes = [200]
 
 depends_on = [
   nomad_job.vault-unsealer
 ]
}

The contents of the policy file are written as the value of a JSON key/value pair, so the quotation marks have been escaped. Here is the resulting policy in Vault:

path "sys/storage/raft/snapshot" {capabilities = ["read"]}

Now that the policy has been written to Vault, the next step is to create a role under the JWT auth method that allows the Nomad job to take snapshots. A Vault role is a set of parameters to define the actions authorized by specific entities. The role will specify the claims required within a JWT and the Vault permissions to assign to the resulting Vault token as part of the authentication process. In this case, if the JWT has the following claims, it will issue a Vault token with snapshot_policy assigned to it:

nomad_job_id must be vault-backup: This will prevent other jobs from obtaining a Vault token via this role.
nomad_namespace must be vault-cluster: This will prevent a job running in the wrong namespace from obtaining a Vault token via this role.
nomad_task must be vault-backup: This will prevent any other tasks within the job group from obtaining a Vault token via this role.

The code snippet uses TerraCurl to create a snapshot role in Vault that will be used by the Nomad job:

resource "terracurl_request" "snapshot_role" {
 method = "POST"
 name   = "snapshot_role"
 
 response_codes = [
   204
 ]
 
 url = "http://${data.terraform_remote_state.tfc.outputs.nomad_clients_public_ips[0]}:8200/v1/auth/jwt/role/snapshot"
 
 headers = {
   X-Vault-Token = jsondecode(terracurl_request.init.response).root_token
 }
 
 request_body = <
{
 "bound_audiences": "nomadproject.io",
 "bound_claims": {
   "nomad_job_id": "vault-backup",
   "nomad_namespace": "vault-cluster",
   "nomad_task": "vault-backup"
   },
 "role_type": "jwt",
 "token_policies": "snapshot_policy",
 "user_claim": "sub"
}
EOF
 
 destroy_url    = "http://${data.terraform_remote_state.tfc.outputs.nomad_clients_public_ips[0]}:8200/v1/auth/jwt/role/snapshot"
 destroy_method = "DELETE"
 
 destroy_headers = {
   X-Vault-Token = jsondecode(terracurl_request.init.response).root_token
 }
 
 destroy_response_codes = [
   200,
   201,
   204,
 ]
 
 depends_on = [
   nomad_job.vault-unsealer,
   terracurl_request.snapshot_policy
 ]
 
}

The final piece of the puzzle is to write and deploy the vault-backup job. Here is the Nomad jobspec written for vault-backup:

job "vault-backup" {
 namespace   = "vault-cluster"
 datacenters = ["dc1"]
 type        = "batch"
 node_pool   = "vault-backup"
 
 periodic {
 
   crons = [
     "@daily"
   ]
 
   prohibit_overlap = true
 }
 
 group "vault-backup" {
   count = 1
 
   constraint {
     attribute = "${node.class}"
     value     = "vault-backup"
   }
 
   volume "vault_data" {
     type      = "host"
     source    = "vault_vol"
     read_only = false
   }
 
   task "vault-backup" {
     driver = "docker"
 
     volume_mount {
       volume      = "vault_data"
       destination = "/vault/file"
       read_only   = false
     }
 
     config {
       image   = "shipyardrun/tools"
       command = "./scripts/backup.sh"
       volumes = [
         "local/scripts:/scripts"
       ]
     }
 
     template {
       data = <
#!/usr/bin/env sh
 
{{- range nomadService "vault" }}
 vault_addr="http://{{ .Address }}:{{ .Port }}"
 {{- break }}
{{- end }}
 
# Authenticate to Vault using JWT
vault_token=$(vault write \
 -format json \
 -address $vault_addr \
 auth/jwt/login role="snapshot" jwt="${JWT}" | \
 jq -r '.auth.client_token')
 
vault login \
 -address=$vault_addr \
 $vault_token > dev/null
 
# Find the cluster leader
leader_address=$(curl \
   ${vault_addr}/v1/sys/leader | \
   jq -r '.leader_address')
 
# Take snapshot
date=$(date -I)
 
vault operator raft snapshot save \
 -address $leader_address \
 "/vault/file/${date}.snap"
 
EOH
 
       destination = "local/scripts/backup.sh"
       change_mode = "noop"
       perms       = "777"
     }
 
     resources {
       cpu    = 100
       memory = 512
 
     }
 
     affinity {
       attribute = "${meta.node_id}"
       value     = "${NOMAD_ALLOC_ID}"
       weight    = 100
     }
 
     identity {
       env         = true
     }
 
     env {
       JWT = "${NOMAD_TOKEN}"
     }
   }
 }
}

Note these key points:

This is a periodic job, a batch job that runs on a predefined schedule. This job is scheduled to run daily at midnight.
The job will be run within the vault-backup node pool to ensure the backup is stored away from the Vault cluster.
Host volumes are used here to allow the job to store the snapshot.
This job uses the Docker task driver.
The container image is shipyardrun/tools, which is a community image containing the binaries for most HashiCorp tools as well as other useful packages, such as jq.
The workload identity JWT is exposed to the job via an environment variable.
The workload identity environment variable is exposed to the container.
The template within the task renders a shell script that:
- Reads the workload identity environment variable and uses this to login and obtain a Vault token.
- Checks the cluster leader address.
- Takes a snapshot of the cluster via the cluster leader.

This jobspec is then deployed using Terraform with this code snippet:

resource "nomad_job" "snapshot" {
 jobspec          = file("nomad-jobs/snapshot.nomad")
 purge_on_destroy = true
 
 depends_on = [
   terracurl_request.snapshot_role
 ]
}

This shows how to automate the process of taking snapshots of Vault using workload identity to authenticate to Vault. Any Nomad job that needs a secret from Vault can use a similar process for Vault-aware workloads. For example, if a job needs to use the transit secrets engine, it will need to make a call to Vault within the application code. In these cases, authentication via workload identity is a good pattern to implement.

»Summary

Part 1 of this series explored the infrastructure needed to deploy Nomad servers and clients, how to configure them for workload identity, how to spin up host volumes for stateful workloads, and how to configure the Docker plugin to permit the required Linux capabilities. It also covered how to enable and bootstrap the ACL system to secure the Nomad deployment.

Part 2 took a deep dive into jobspecs and constructing a job to run Vault, as well as the Nomad templating engine and built-in service registry. It also looked at the initialization process and how Vault's seal mechanism works.

This third and final installment of the blog series showed how to use Vault Unsealer to provide auto-unseal capabilities without external dependencies using Nomad variables. It also looked at the process of automating Vault backups using periodic jobs and how to authenticate to Vault using workload identity.

In summary, running Vault on Nomad has a lot of operational benefits that can reduce management overhead. This is a good approach for smaller organizations with limited resources. The alternative approach is to use the Vault integration within Nomad, which is helpful for authenticating to Vault and obtaining secrets for workloads without them being aware of Vault. This could be a better fit for organizations with dedicated teams managing Vault.

What is unsealing Vault?
Auto unsealing Vault
Vault Unsealer
Deploying Vault Unsealer as a Nomad job
Automated snapshots of Vault
Summary