Consul Auto-Join with Cloud Metadata

We work in a world of distributed systems which operate in rapidly changing environments. Servers come and go, they move across region and distribution groups, and somehow they need to communicate and connect to one another. To solve this problem, HashiCorp created Consul, which among many other things enabled service registry and service discovery. Application instances register themselves with Consul, and dependent instances query Consul to discover each other. Since Consul itself is a distributed system, this creates a chicken-and-egg problem - how do you boostrap your service discovery.

Nic Jackson

Consul

Mar 28, 2017

Nic Jackson

To solve this problem, HashiCorp created Consul, which among many other things enabled service registry and service discovery. Application instances register themselves with Consul, and dependent instances query Consul to discover each other. Since Consul itself is a distributed system, this creates a chicken-and-egg problem - how do you boostrap your service discovery.

»Automation Challenges

How do you discover your service discovery? Traditionally this has been a challenge for distributed systems. The technique often involves spinning up a cluster in one operation and then performing a second operation once the IP addresses are known to join the nodes together. This two-step approach not only makes automation challenging, but also raises questions about the behavior of the system when losing a node. Autoscaling could bring another node online, but an operator would still need to manually join the node to the cluster.

»Consul Auto-Join for EC2

Consul 0.7.1 introduced new functionality which allows it to discover other agents using cloud metadata. This blog post explores leveraging AWS metadata to auto-join and auto scale a Consul cluster.

The latest documentation for Consul shows new options we can specify in the Consul configuration file or startup parameters.

-retry-join-ec2-tag-key - The Amazon EC2 instance tag key to filter on. When used with -retry-join-ec2-tag-value, Consul will attempt to join EC2 instances with the given tag key and value on startup.
-retry-join-ec2-tag-value - The Amazon EC2 instance tag value to filter on.
-retry-join-ec2-region - (Optional) The Amazon EC2 region to use. If not specified, Consul will use the local instance's EC2 metadata endpoint to discover the region.

The new feature requires permission to read the AWS instance state, and there are a variety of options available to grant these permissions.

Static credentials (from the config file)
Environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
Shared credentials file (~/.aws/credentials or the path specified by AWS_SHARED_CREDENTIALS_FILE)
ECS task role metadata (container-specific)
EC2 instance role metadata

The startup process for the AWS instance is as follows:’

The instance bootstraps and installs consul
Init system starts consul with the configuration to join via EC2 metadata
On start, consul queries the EC2 metadata service with ec2:DescribeInstances to list all instance tags
Consul extracts the private IP addresses of other EC2 instances which have the configured tag name and tag value from the metadata
Consul runs consul join on those private IP addresses

The method we are using in this example is the EC2 role metadata. By assigning the ec2:DescribeInstances permission to the instances IAM role, we can give Consul this permission without leaking any other control over your AWS account.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "ec2:DescribeInstances",
      "Resource": "*"
    }
  ]
}

»Auto-Joining in Action

The repository at https://github.com/hashicorp/consul-ec2-auto-join-example includes a Terraform configuration to demonstrate this functionality. To start and bootstrap the cluster modify the file terraform.tfvars to add your AWS credentials and default region and then run terraform plan, terraform apply to create the cluster.

aws_region = "eu-west-1"

aws_access_key = "[AWS_ACCESS_KEY]"

aws_secret_key = "[AWS_SECRET]"

Once this is all up and running, you will see some output from Terraform showing the IP addresses of the created agents and servers.

Outputs:

clients = [
    34.253.136.132,
    34.252.238.49
]
servers = [
    34.251.206.78,
    34.249.242.227,
    34.253.133.165
]

After provisioning, it is possible to login to one of the client nodes via SSH using the IP address output from Terraform.

$ ssh ubuntu@34.251.206.78

The cluster should be auto-joined, since the instances share the same auto-join tag value.

Running the consul members command will show all members of the cluster and their status (both clients and servers).

$ consul members
Node                  Address          Status  Type    Build  Protocol  DC
consul-blog-client-0  10.1.1.189:8301  alive   client  0.7.5  2         dc1
consul-blog-client-1  10.1.2.187:8301  alive   client  0.7.5  2         dc1
consul-blog-server-0  10.1.1.241:8301  alive   server  0.7.5  2         dc1
consul-blog-server-1  10.1.2.24:8301   alive   server  0.7.5  2         dc1
consul-blog-server-2  10.1.1.26:8301   alive   server  0.7.5  2         dc1

This cluster automatically bootstrapped with no human intervention, but what about failure scenarios?

Without the auto-join functionality, scaling Consul servers can be challenging and often involves operator participation. With the new auto-join functionality, scaling (up or down) is incredibly easy. It is so easy, that we do not have to do anything. To demonstrate this, edit the terraform.tfvars file and increase the number of instances to 5 and re-run terraform plan and terraform apply.

$ terraform plan 
Plan: 2 to add, 0 to change, 0 to destroy.

$ terraform apply
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.

State path: terraform.tfstate

Outputs:

clients = [
    34.253.136.132,
    34.252.238.49
]
servers = [
    34.251.206.78,
    34.249.242.227,
    34.253.133.165,
    34.252.132.0,
    34.253.148.148
]

Run consul members again after the new servers have finished provisioning. It might take a few seconds for the new servers to join the cluster, but they will be available in the memberlist:

Node                  Address          Status  Type    Build  Protocol  DC
consul-blog-client-0  10.1.1.189:8301  alive   client  0.7.5  2         dc1
consul-blog-client-1  10.1.2.187:8301  alive   client  0.7.5  2         dc1
consul-blog-server-0  10.1.1.241:8301  alive   server  0.7.5  2         dc1
consul-blog-server-1  10.1.2.24:8301   alive   server  0.7.5  2         dc1
consul-blog-server-2  10.1.1.26:8301   alive   server  0.7.5  2         dc1
consul-blog-server-3  10.1.2.44:8301   alive   server  0.7.5  2         dc1
consul-blog-server-4  10.1.1.75:8301   alive   server  0.7.5  2         dc1

The same applies when scaling down - there is no need to manually remove nodes, so long as we stay above the originally-configured minimum number of servers (3 in this example). To demonstrate this functionality, decrease the number of servers in the terraform.tfvars file and run terraform plan and terraform apply again. The deprovisioned server nodes will show in the members list as failed, but the cluster will be fully operational.

Node                  Address          Status  Type    Build  Protocol  DC
consul-blog-client-0  10.1.1.189:8301  alive   client  0.7.5  2         dc1
consul-blog-client-1  10.1.2.187:8301  alive   client  0.7.5  2         dc1
consul-blog-server-0  10.1.1.241:8301  alive   server  0.7.5  2         dc1
consul-blog-server-1  10.1.2.24:8301   alive   server  0.7.5  2         dc1
consul-blog-server-2  10.1.1.26:8301   alive   server  0.7.5  2         dc1
consul-blog-server-3  10.1.2.44:8301   failed  server  0.7.5  2         dc1
consul-blog-server-4  10.1.1.75:8301   failed  server  0.7.5  2         dc1

»Summary

The Consul EC2 auto-join functionality enables seamless bootstrapping and auto-scaling of Consul clusters by leveraging cloud metadata. This post shows the functionality using AWS EC2, but the same functionality is also available for Google Cloud, and Consul's roadmap includes adding support for additional cloud providers in the future. We hope you enjoy this new functionality and look forward to future improvements.