Consul Auto-Join with Cloud Metadata

Consul Auto-Join with Cloud Metadata

Mar 28 2017 Nic Jackson

We work in a world of distributed systems which operate in rapidly changing environments. Servers come and go, they move across region and distribution groups, and somehow they need to communicate and connect to one another.

To solve this problem, HashiCorp created Consul, which among many other things enabled service registry and service discovery. Application instances register themselves with Consul, and dependent instances query Consul to discover each other. Since Consul itself is a distributed system, this creates a chicken-and-egg problem - how do you boostrap your service discovery.

Automation Challenges

How do you discover your service discovery? Traditionally this has been a challenge for distributed systems. The technique often involves spinning up a cluster in one operation and then performing a second operation once the IP addresses are known to join the nodes together. This two-step approach not only makes automation challenging, but also raises questions about the behavior of the system when losing a node. Autoscaling could bring another node online, but an operator would still need to manually join the node to the cluster.

Consul Auto-Join for EC2

Consul 0.7.1 introduced new functionality which allows it to discover other agents using cloud metadata. This blog post explores leveraging AWS metadata to auto-join and auto scale a Consul cluster.

The latest documentation for Consul shows new options we can specify in the Consul configuration file or startup parameters.

  • -retry-join-ec2-tag-key - The Amazon EC2 instance tag key to filter on. When used with -retry-join-ec2-tag-value, Consul will attempt to join EC2 instances with the given tag key and value on startup.

  • -retry-join-ec2-tag-value - The Amazon EC2 instance tag value to filter on.

  • -retry-join-ec2-region - (Optional) The Amazon EC2 region to use. If not specified, Consul will use the local instance's EC2 metadata endpoint to discover the region.

The new feature requires permission to read the AWS instance state, and there are a variety of options available to grant these permissions.

  • Static credentials (from the config file)

  • Environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)

  • Shared credentials file (~/.aws/credentials or the path specified by AWS_SHARED_CREDENTIALS_FILE)

  • ECS task role metadata (container-specific)

  • EC2 instance role metadata

The startup process for the AWS instance is as follows:’

aws_access_key = "[AWS_ACCESS_KEY]"

aws_secret_key = "[AWS_SECRET]"

Once this is all up and running, you will see some output from Terraform showing the IP addresses of the created agents and servers.

clients = [ 34.253.136.132, 34.252.238.49 ] servers = [ 34.251.206.78, 34.249.242.227, 34.253.133.165 ]

After provisioning, it is possible to login to one of the client nodes via SSH using the IP address output from Terraform.

$ ssh ubuntu@34.251.206.78

The cluster should be auto-joined, since the instances share the same auto-join tag value.

Consul Aut0-Join AWS List of Tags

Running the consul members command will show all members of the cluster and their status (both clients and servers).

$ consul members Node Address Status Type Build Protocol DC consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1 consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1 consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1 consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1 consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1

This cluster automatically bootstrapped with no human intervention, but what about failure scenarios?

Without the auto-join functionality, scaling Consul servers can be challenging and often involves operator participation. With the new auto-join functionality, scaling (up or down) is incredibly easy. It is so easy, that we do not have to do anything. To demonstrate this, edit the terraform.tfvars

The state of your infrastructure has been saved to the path below. This state is required to modify and destroy your infrastructure, so keep it safe. To inspect the complete state use the terraform show command.

State path: terraform.tfstate

Outputs:

clients = [ 34.253.136.132, 34.252.238.49 ] servers = [ 34.251.206.78, 34.249.242.227, 34.253.133.165, 34.252.132.0, 34.253.148.148 ]

Run consul members again after the new servers have finished provisioning. It might take a few seconds for the new servers to join the cluster, but they will be available in the memberlist:

Node Address Status Type Build Protocol DC consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1 consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1 consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1 consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1 consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1 consul-blog-server-3 10.1.2.44:8301 alive server 0.7.5 2 dc1 consul-blog-server-4 10.1.1.75:8301 alive server 0.7.5 2 dc1

The same applies when scaling down - there is no need to manually remove nodes, so long as we stay above the originally-configured minimum number of servers (3 in this example). To demonstrate this functionality, decrease the number of servers in the terraform.tfvars file and run terraform plan and terraform apply again. The deprovisioned server nodes will show in the members list as failed, but the cluster will be fully operational.

Node Address Status Type Build Protocol DC consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1 consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1 consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1 consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1 consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1 consul-blog-server-3 10.1.2.44:8301 failed server 0.7.5 2 dc1 consul-blog-server-4 10.1.1.75:8301 failed server 0.7.5 2 dc1

Summary

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×