Infrastructure changes at lightning speed today. Large companies move services from machines to virtual machines to containers. They use multiple clusters and cloud providers and add numerous microservices every year. And one question central to any infrastructure change is how it will affect humans—the end-users who rely every day on services and applications to do their jobs.
But when infrastructure changes are made, users are often totally dependent on DevOps for support, resulting in confusion about the level of service available to them and lost efficiency. To be able to make necessary infrastructure changes today without impacting operational efficiency, DevOps teams need to put more capability and control in the hands of users.
HashiCorp customer Criteo—an advertising platform on the open Internet—recently spoke at HashiConf Digital 2020 where they discussed how the company has moved from service-on-machine to pure service architectures using HashiCorp Consul.
Just as importantly, the company highlighted how they’ve been able to give their people the tools to discover, learn about, manage, and investigate the services they use. Here’s a closer look at how Criteo uses Consul to manage networked services and bring more autonomy to service users.
Criteo employs more than 2,500 people with around 500 on the R&D team using their services directly. R&D team members often have questions about their applications, including how to call a particular microservice API or how to create a new service, and the development team is often tasked with answering those questions — frequently a time-consuming process.
Criteo was already using Consul for service discovery and networking, but the DevOps team wanted a better way for employees to know what was happening on their systems to reduce the time they had to spend fielding questions. They created a custom UI, called the Consul TemplateRB, as an open-source application that shows live information about Consul displayed via a simple static web server. The Javascript-based Consul TemplateRB can be plugged into any organization and is easily customizable and scalable to accommodate thousands of concurrent users.
The Consul TemplateRB also shows metadata semantics, which is especially useful because it’s data used by other systems as well as for making decisions about how to provision infrastructure. The Consul TemplateRB reflects exactly what’s provisioned with Consul and is updated live.
With the Consul TemplateRB, users can:
The data these features produce also makes it possible for the DevOps team to proactively address additional user challenges by:
To prevent users from performing modifications on other services, Criteo’s DevOps team makes sure all services are standardized and that the data can only be changed at startup. By disallowing the data from being changed later, they can avoid introducing entropy into the system.
Similarly, the data registered in Consul can only be changed by the machine itself, preventing others from modifying the service on a given machine and changing what is published into Consul. In Criteo’s world of thousands of services, identifying the stakeholders — service “owners” — is important when there’s an emergency. Identifying service owners is also key to leveraging information on usage, such as consumption of resources, to further refine and enhance the application’s functionality and value.
Enabling self-service through an intuitive UI makes it easy for users to investigate services on their own, helping to eliminate the need for operators to answer questions. Criteo’s small, four-man DevOps team went from dealing with dozens of questions per day to around one question per week.
To help users help themselves even more, the DevOps team put in place some additional policies and mechanisms, including:
Criteo has more than 4,000 applications enterprise wide, with some of the larger services seeing upwards of 2,000 instances on a single server. To create more control around their SDK, the DevOps team implemented default queries to simplify search and discovery while improving load management.
With Consul, instead of all requests ending up on a single server (as happens even if multiple servers are provisioned), the team set up Consul to answer calls from any server to enable easier application scaling across infrastructure — or horizontal scalability — using a common query. In addition, they also added a feature for external applications that enables every Consul cluster to serve them the data instead of just one machine in one place.
Now, when an application is too loaded, Consul’s health checks — Passing, Warning, and Critical — provide vital, on-demand status updates about server loads and help proactively identify potential points of failure or degradation. Warning and Passing statuses have also been weighted, so that when an application is overloaded and tells Consul it’s in a Warning state, all the applications targeting this specific instance will reduce the traffic they send to the overloaded application to let the application recover from the excessive load and help to prevent successive instances from also going down when there’s too much traffic.
At the same time, Criteo uses Consul for automated:
Consul registers new applications and any modifications to instances, acting as a single repository for all of Criteo's infrastructure and creating a clear and unified view of all infrastructure on all applications.
Essentially, users can publish their service on the infrastructure and change, on their own, the metadata in their service. Because all services use the same business semantics, it eliminates the need to feed a Git repository. Instead, the infrastructure takes care of everything automatically, freeing the DevOps team of administrative burdens and enabling them to provide sufficient support for each of the company’s 4,000+ services.
For the Criteo DevOps team, defining for all stakeholders what matters to Consul also defines a clear SLA. Instead of using technical metrics, the team opts for business metrics, such as:
The team measures these outside of Consul with other applications and uses a dashboard to note for users when they’re under the SLA, when they’re close, and when they’ve exceeded it. By giving users a clear set of expectations around Consul and the SLA, the DevOps team can further reduce the number of questions coming into them and ensure users know exactly what is agreed with them.
As infrastructure changes continue apace, Criteo is making sure no human user is left behind. The company believes it’s important to boost operational efficiency and user autonomy through intuitive self-service capabilities, supported by the watchful eye of its DevOps experts. With Consul and Consul TemplateRB, Criteo users can take charge of their services and applications within defined parameters and without depending on DevOps.
To learn more about how Criteo uses Consul, view their HashiConf Digital presentation and read their case study.
We invite you to join us at our next HashiConf Digital, October 12-15 (PDT timezone). Registration is free to attend. Real-time product workshops are also available, and will require a nominal fee to reserve your seat. Register here.
Use Minikube to create multiple Kubernetes clusters with Consul and test cluster peering configurations in your local development environment.
Consul 1.16 adds new reliability, scalability, security, service mesh UX, extensibility, and service discovery capabilities.
The HCP Consul management plane now offers deeper insights to your Consul deployments via cloud-based observability and seamlessly links new and existing self-managed Consul clusters.