Case Study

HashiCorp Consul for multi-cloud

At Thomson Reuters, operators needed a tool that could discover newly introduced services while also being able to configure all of those loosely-coupled services in a single place. Here's how HashiCorp Consul met those needs.

Infrastructure in cloud environments is very dynamic as we periodically shut down specific areas and introduce new infrastructure on the fly. This becomes more complex to manage in the modern multi-cloud environment, as each tool needs to be aware of the multi-cloud. Keeping track of running services and infrastructure configurations causes further challenge. We also need an automated mechanism to alert and stop bad nodes from serving requests.

When hosting a web application on cloud, we may be setting up a load balancer like HAProxy with a pool of front-end and back-end servers for load balancing and high availability. We need to scale servers up and down based on traffic load. Yet, when a new server comes up, the infrastructure does not automatically know it exists or which service it provides. We also need to remove entries of the servers which are no longer needed from HAProxy, so that traffic does not get directed to them.

Imagine this happening in real time with hundreds of servers in multi-cloud environments. The role of administrators to manually manage and control all of these moving parts suddenly becomes extremely complex.

This calls for a service discovery system where we can query newly introduced nodes and services, and also get notified when a node or service become unavailable. We also need a configuration management system to update and validate the configurations of the servers and the infrastructure in multi-cloud.

HashiCorp Consul is the tool which solves the above use case by providing easy service discovery and configuration management for IT infrastructures. Consul is distributed, highly available and data center aware. Consul runs as a lightweight agent in servers (an agent can be either a server or a client). Consul supports DNS query lookup in addition to providing system-level and service-level health checks-- which provides a fine-grained monitoring and control of nodes and services. As cloud infrastructure is scattered in various zones, we can create Consul data centers distributed across these zones. Consul supports multiple data centers, which enables aggregating different zones or multiple cloud providers, to supervise them together while also creating a highly available and responsive cloud infrastructure.


  • Kameshwara Rao Marthy
    Kameshwara Rao MarthyLead engineer (cloud & DevOps), Thomson Reuters


Hello everyone, so my name is Kamesh, and I work for Thomson Reuters India. I love to learn the new technologies in the cloud and DevOps space. So, today I'm going to talk about the features of Consul and how they're going to help us with multi-cloud environments, the problems in multi-cloud environments. So, let's talk.

Multi-cloud environments

How many people here are working with Consul in production? Oh, that's like 30 percent. Okay, so, we are running Consul in a multi-cloud environment, and it's a big cluster. Basically, what multi-cloud means is that we are actually using the storage and compute from multi-cloud environments and trying to put them into a single architecture. Every enterprise typically does this to get their applications in the cloud environment nowadays. Multi-cloud or maybe hybrid-cloud is going to the future of the applications, because we want to secure some of your stuff where as you want to put some of your stuff in public clouds.

Advantages of being in multi-cloud is that you can avoid vendor lock-in, you can choose any cloud at any point in time. You can be cloud agnostic with your approach. Obviously with that comes some of challenges. Multi-cloud offers its own set of challenges as well. So we can't completely rely upon the traditional DNS service to let us know the changes in the fast-moving environments and obviously we will not be able to know where our services are running or have a comprehensive list of all services which are running in our environments. And also of how can consumers buy into whether they're doing it at the host level or port level or all, basically all the DNS hookup challenges, which comes in the picture when you're running your loads on multi cloud.

Microservice environment challenges

In microservices environments your obligations are not guaranteed to live up to a hundred people on the same host or IP address because now there are many dynamic changes happening all the time. You need to map the services to make them talk to each other so you need to locate them and bind them together every time. When it is a small environment it's easy because you'll be trying to hard code stuff into the config files.

But when the stuff grows up you can't really rely upon the DNS lookups and do this stuff. So what do you do to manage the failovers? You just put a load balancer in front of it and try to locate the services, but problem with this is that it's a single point of failure if your load balancer fails then what happens? So you're still unable to route your traffic to the correct node because you don't know which node has failed and your load balancer is still unable—even though your services are running and but since your load balancer has failed—you are unable to route the traffic to the nodes. Even though your services are up, you're still down. So that's that's a typical challenge with this approach. This this gives us a point that your services have to be decentralized, it should not be a single point of failure just like this.

But even with that how does your load balancer or whatever service registry you're using keep track of all the all the changes that are happening dynamically at the runtime. You need to keep track of these containers which are running across the data centers and you need to stop the traffic to the unhealthy nodes and put it back to the nodes. How do you minimize the time which is needed to actually do all these mappings behind the scenes? We need to minimalize that part. Those are the typical challenges of microservices approaches.

Challenges of multi-cloud containers

So how are these changes tracked? How are these version artifacts getting dragged in a single place and how are these monitoring and service discovery challenges being managed? Containers are getting cracked so that's also one more problem area.

When you're in single data center these are the set of problems which are in a single operation but when you span across data centers or you span across clouds this information to get synchronized between the clouds is also a big problem because of obvious latency within the cloud space, and the traffic has to come from one data center to another data center and it has to happen in real time. At least near to real time.

Those are the challenges you need to keep in mind when you are architecting solutions. We need to start using the smart tools in market. That's where we are we are going to use tools like Zookeeper or maybe etcd, or our Consul. But coming to things like Zookeeper or etcd—they do only one thing at a time so they do only service discovery. But we need monitoring along with that, so that's why we are looking at the Consul. That's what everyone is here for today.

To solve these problems, we have to build robust systems which can find services and traffic to the unhealthy nodes and can be configured externally. Also they should be free open source because you don't have to really would use some money in order to buy tools.

We have to live with the fact that there are definitely latencies associated. I mean with all the sets of tools in place so you just need one tool to do all the stuff for you and use that tool correctly.

The features of Consul

So obviously we need a consistent and highly available service discovery platform and a mechanism to register services and monitor service health. We also need a mechanism to look up and connect to the connector services. You have to do that very simply. You do not want to do a lot of complex things. It should require less effort and the cost factor has to be taken care of. Also, the learning curve of the tools and technologies that you are going to have to put in.

Consul brings all of them together and it's like your DNS and Zookeeper, Nagios, everything bundled into one solution. So that's what gives it an edge when you're trying to address the sort of problems in your cloud environment. This is one tool that can comprehensively answer all your challenges which you have discussed. The standard features for what Consul is known for is so this is discovery, KV store, health checks, and data center.

Discovery and health checks

The first one is discovery. It's obviously your source of truth—your bible to keep your infrastructure up and running. No single point of failure and still reads are better than no reads and that's all you try to handle these discovery problems.

Consul uniquely blends service discovery with health checking, which is absolutely crucial because you only want healthy nodes to be discovered you don't want all the stuff to be in one place and you don't even know which one is running and which one is not so when registering services with Consul it doesn't need any code because it's zero touch. You just need an adjacent file in which you can discard the services or know what you want do you want to put in.

Here's a simple example [9:26] of putting out a service check. You just give an IP address or a port number. Consul's DNS interface is zero touch, so no need for any configuration changes at any level of application code all you need to do is put in your checks and these are provided to register services, and discovery is done at the DNS level. So when you query the results these are round robin and you will get random results at any given point of time. And entries, for those services which are failing will be automatically removed so that you are equating only the healthy nodes at any given point of time.

So that's a sample UI [10:30] when you are putting some health checks for service and hosts and you can try to strike the stuff. With Consul the other good part is that it's only a push based mechanism where in the agents are sending the straight information and health information to the service. And because of this, it's only their agent to get updates which are getting reflected so it will use a lot of bandwidth and other resources, so you're saving on your compute you're saving on your memory and saving on your network traffic as well in the cloud environments, because each resource is money in the cloud so it's always good you're saving stuff because the number of requests which are passing between the nodes is less.

And this approach provides multiple orders of magnitude reduction in terms of number of rights, right volume and bandwidth traffic as I said. So if you assume a scenario when you're trying to talk between data centers every five minutes and you're trying to push that you're trying to pull information from a system like Nagios opening it's not so much a flow to the service and the resources in of the service are getting just to answer that is requests, so it is unnecessary so it's triggering is something which can help you, it's use of Consul is definitely the solution to address that problem.

The KV store

The next one is KV store. So, you're using KV store, and it's just clicks we'll give KV store to store your configuration files, to store your raw artifact versions, to store your network details information it's also one data store which you can grab any data center to do to get your details. Obviously you will get base-64 values when you're trying to create the results from the KV store. So that's, that's in order to provide the ASCII support for all the other stuff which you are trying to save in the KV store.

With Consul actually there is a limit that you can store only five 12 Kb of of data in the value but if you want to go more are you have a requirement waiting. You want to store your documents and keys or anything then then it's better to go for solution and with Consul as backend. So a typical architecture looks at Consul as a backend. Consul is a tightly integrated with Vault so it can help to have the KV store and secret management at the same time as well as health checking. So that's one one more combination of having Vault and Consul together.

And Consul's supports multi-data centers out of box and this means users of Consul do not have to worry about building additional layers to grow multiple data centers all they need is the WAN ports to be open and these data centers participate in gossip protocol. There is a gossip pool for each data center right in the nodes. All participate in that particular gossip pool. So there is no need to configure clients for this or all things will be taken care of by the participating service between the data centers or the WAN.

In the architecture for a single data center basically your clients communicate with each other with gossip protocol, and you're clients are communicating with the service using our PCs and servers will follow. Several followers forward requests to the leader and leader replicates it's data to the followers.

In my limited data center set up it's the same thing but the only thing is that this service from each data center is participating with each other in order to replicate the states. And then in the modern data center a typical set up is one use case that has multiple data centers and each can connect to everything but there is only one data center which probably you want to keep it in a secure zone you can't just allow it outside. Even with that kind of approach it's just connected with only one data center that hosts the services and it can still replicate the data amongst others so you can speak of the state.

You have a HTTP APIs for everything to do and you have APIs for the agent catalog. For health and KV store and services so, when integrating with HTTP APIs when you're using them you try to always pay to the local host not not the sellers because you don't want to increase the load on the servers by creating the service

You have these endpoints defined in the documentation properly so all you need to do is curl and then you will get the required response. As I said the values in KV store cannot be longer than [inaudible] so if you want something more than that than obviously it's better you went with Vault.

Performance factors

Let's talk about a few performance factors, which are in context of the multiple data centers. Consul services take a lot of load than the agent and so they need more resources so from architecture standpoint you need to give them more raw memory and CPU as well as it's better you give access to the disk because the write will be more. So Consul services generally participate in consensus algorithm Raft, so they need more resources because they're writing a lot of data to the disk after, because they participate in the Raft.

Coming to the memory requirements. It's once again the servers because they're participating in more stuff than the agents that bring more load, it's always good that you allocate four times the RAM, down the working sets. So working segments it includes, the KV store, the services, the requests which are coming in so it's always better you need more RAM for the service in order to address these normal requirements.

The documentation of Consul states that it's always better to have your assistance in place for the service also, there will be, if there are scenarios and there are unscrupulous leader election's happening because of network failures or network latencies it's always better to use fine tuned factors or maybe, maybe use the, bigger instance if you are in cloud environment so that it suffices the need for more resources.

It is also good to set a reasonable DNS to value density to zero because we certainly need some caching when you're dealing with bigger environments where in your dealing with more than a dozen servers. In situations where there are applications which perform high volumes of reads against Consul it's better to consider using the state consistency mode so that it will get scaled across the data centers. In Consul's latest versions I think there is a new factor called limits configuration which will contain the requests which are made from the clients for the service so these are some of the general recommendations which are good when you're trying to set Consul and big clusters in big data centers on maybe multi data centers.

So that's all that's all I wanted to talk about today thank you.

More resources like this one