HashiCorp Vault and Consul empower many production deployments, but how do you monitor them? Learn how to configure Prometheus for Consul and Vault in this talk.
Consul and Vault empower many production deployments. Therefore, monitoring them and understanding their behavior is critical to ensure business continuity.
This talk will go over how to use Prometheus to monitor those services and will provide useful patterns you can apply to your local stack to do so. Lessons learned from the Nomad production stack will be highlighted as well as experience gathered at different customers.
Today, Julien and I are going to talk about observing the HashiCorp ecosystem from Prometheus.
Let's talk about who we are first. My name is Kris Buytaert. I used to be a developer, and then I became an ops person. For the past two decades, I've spent my life helping people to deploy software. Yes, that's painful. I do that in multiple roles. I basically started Inuits about 15 years ago — Inuits is an open source consultancy. About four months ago, together with Julien, we started a spinoff which is called o11y, which is focusing on Prometheus.
Next to helping customers, I also tend to speak at a couple of conferences. Apologies to everybody who has to call themselves a DevOps engineer; it just shows that your employer doesn't understand what DevOps means. That was not the idea when Patrick Debois and I started the conference.
Next to DevOps Days, I also started Config Management Camp, LOADays, and all the other things. Most of you, when you run into a DNS problem, you curse at me on Twitter. Please keep doing that, so I know that everything is still a freaking DNS problem.
My name is Julien Pivotto. I am a maintainer of Prometheus. I have been active in the open source mentoring ecosystem for a long time now, so I've been in the open source ecosystem for like 10 years. I am now working also at o11y, where we mainly do support and services on open source observability. I also believe in the diverse ecosystem and environment, and you can find me on GitHub and Twitter as @roidelapluie.
o11y, Inuits — confusing, I guess. With Inuits, we built over 15 years of open source experience, helping customers to deploy platforms. We've got multiple Nomad deployments, which we're actively building and operating, so we learn from our mistakes. Part of the things we were seeing is that there's a lot of demand to do open source monitoring and metrics the right way. Julien being the maintainer of Prometheus, we basically figured out we're going to do a spinoff. We're going to start a new entity really focused on professional services and support around Prometheus. Next month, as you might have seen on the Prometheus list, we are actually going to release the long-term support version of Prometheus, and eventually a distribution.
So, monitoring. Who likes monitoring? Has that always been that way? I said 15 years ago, which kind of means I'm old — or as I like to say, experienced — I together with Tom De Cooman did a research paper for the Ottawa Linux Symposium, back when you still had to write papers to present at a conference, and we were comparing all the open source tools that were available back then in the ecosystem.
Who still remembers Nagios? What we figured out back then was that there were a set of really bloated Java tools that basically you needed more compute power to run than the actual platform you were monitoring, so those didn't work. You had a bunch of Open Core platforms that really forced you to — once you wanted to really do monitoring — do something you didn't like: go to a proprietary solution. And then we also found a bunch of tools like Zenoss — back in the day, once you had more than 20 nodes, you actually needed the DBA to do performance tuning of the application. Your monitoring was already your pain point. Really back then Nagios was the king in the open source world.
Fast forward a couple of years, 2011. We had started DevOps Days, and John Vincent (@lusis on Twitter) tweeted out, "Monitoring sucks!" It became a sub-movement of the whole DevOps movement. We had a good repository where we were evaluating what tools were around. There were a lot of new things popping up, but the frustration really was that most of what people were doing was manual configuration, which was absolutely not in sync with reality. We were doing some monitoring on hosts, services maybe sometimes. But application monitoring 10 years ago? The exceptions were doing that.
But it was a fast-changing point, because even six months later, at DevOps Day in Rome in 2011, Ulf Mansson gave an Ignite talk about his newfound love for monitoring. He had his kids draw his slides with these hearts, and he was talking about why he started liking monitoring again. The reason why was not only the new era of tools — in his case, it was Sensu — but it was mostly because he managed to automate everything. He managed to get to a point where there were no manual changes in his ecosystem. When he spun up new resources, they were monitored.
And that newfound love — not that much later, we see in the ecosystem a new tool popped up, Prometheus. Going from “monitoring sucks” to monitoring love, new tools started popping up. It's improvement. It's how our community works. It's how we as a group improve software.
But what really is monitoring? Well, the idea is that you have a high-level overview of the state of every component in your infrastructure, whether it's an application or an underlying functionality, and that you look at if it is still available. If your customer calls you that your application is down, it's too late.
You focus mostly on the technical components. Sometimes you're like, the customer calls because “It's slow.” Maybe it's slow for them, not for you, but you kind of get a view on what's going on.
And in that traditional view of monitoring, even though we've been preaching automation for ages, we still see that a lot of people are doing their monitoring manually. The monitoring platform has totally drifted from reality. They have things partially automated, and they have a lot of work to actually keep things in sync. When they decommission an instance, the monitoring isn't reconfigured automatically.
It's also not really a good view. It's On or Off. It's not like “it works for 75% of the cases.” All of these typical problems with monitoring basically result in alert fatigue. Who has alert fatigue here? The rest are just too lazy to raise their hands because they've already been doing it all day. Those pitfalls are things we want to fix. We need to improve these things.
What is observability? Well, observability is the idea that we look at it differently. We're going to look at how these services actually behave. We're going to look at it as if we were in the place of the applications and not on the outside, without actually instrumenting it — trying to figure out why is this happening rather than, "Oops, it happened."
And they're both required. Monitoring is required, but if you're lucky, it's enough. Julien is the one who would claim that observability is removing that luck. That's how monitoring and observability fit together.
In practice, there's three pillars in the observability ecosystem. There's metrics like my random graphs, like this one, where you use Grafana or Perseus. There's logs, where you have clear views on what the applications have been telling you. And there's traces, where I really want log files on what this device actually reads. With those three components, you can do a lot, including using Prometheus.
Let's go with Prometheus itself. Prometheus is free for the metrics part. Prometheus is an open source project, which is part of the CNCF. The CNCF is the Cloud Native Computing Foundation, and it is the foundation that also owns projects like Kubernetes, Jaeger. There is not a single company behind Prometheus, but rather it's a collaboration between multiple companies.
What Prometheus is doing is that it will collect and store your metrics in order for you to understand your infrastructure, understand your applications. It is pull-based, which means that Prometheus itself is at the center of your infrastructure and it will go query every application to get the metrics that they have to offer. Unlike other monitoring systems, Prometheus will pull quite frequently — like for example, every 15 or 30 seconds, you will get the latest metrics, and then you can run your alerting base on top of that.
There is service discovery, which is really important, to make sure that what Prometheus is monitoring is the current state of your infrastructure. That means that if you are removing a service, if you are adding a service load, Prometheus will notice it directly and will start the monitoring of that node of that new service.
We are fully compatible with Consul, and for a very, very long time. There is a very long relationship between Prometheus and Consul, and it just worked really well. We are looking forward to maybe adding native Nomad support in the future as well.
Prometheus is also an alerting solution, which means that not only can you take your metrics, collect them, store them, but you can also alert based on Prometheus. So you can replace your existing monitoring solution with Prometheus end to end.
The Prometheus ecosystem is really a big, big, big ecosystem. There are thousands of exporters, open source and closed source. There are also many, many applications that directly output Prometheus metrics. We are working on the long-term support list to help enterprises adopt Prometheus even more. We already see that many, many companies are using Prometheus nowadays. We want to help them further.
The Prometheus data model is quite simple. It's all based on metrics, and the metrics basically you have a name, and then you have a set of labels and a value. For example, you can collect a number of HTTP requests, and then the label will be the error rate or DR code. But you can also have labels for your datacenter name, your Consul cluster name — all that kind of thing are also labels that you can alert on or that you can see in your metrics.
When you get your alerts, you get the full context of your alerts, you get the full context of your metrics. And then you can easily query them thanks to a language called PromQL, which is a Prometheus querying language, and you can just type. If you are using Grafana and Grafana 9, you now have a new PromQL editor to help you write those queries, but basically it's very powerful.
This is a very simple example to see: can I get the rate of HTTP requests? But the language itself is quite powerful, and you can use it in many ways to really understand your application and your metrics.
Now let's see the relationship between Prometheus and Consul from the Prometheus standpoint. As I said, Prometheus can do service discovery, and it can integrate with Consul, which means that you can use Prometheus to directly discover and scrape your targets, which are all your Consul services. For that, there is a configuration item, which is called
consul_sd_configs as the means of discovery. It will just stream all the Consul services list to Prometheus, which means that we are using the Consul API, we are using watches, so we are not always polling your Consul server. Even if you have a busy Consul server, we are still very optimized.
And then you get an up-to-date services list. You can filter it before or after you have got the list. You have a lot of flexibility with the labels that you want, so you can really adapt the labels you need for your infrastructure, for your alerting, based on everything that you have in your Consul service discovery information.
When I say that you have labels, the way that this works in Prometheus is that when you are using service discovery you will get a number of
meta labels that you can use. This is the Consul example, but you have the service tags, the node name, you have all the metadata that you are putting in your service. You can find them back in Prometheus. You have the datacenter. If you look at Kubernetes, you will have the port name, the annotation, the labels, all of that will also be available directly to be consumed by Prometheus.
With those labels, you can decide to keep them — like, “I want to see the datacenter in all my metrics.” Or you can just decide, "Okay, I will take the service name and I will only monitor, for example, the traffic nodes." So you can really do the filtering. You can change the addresses. You can change a lot of things based on those labels, which enables you to monitor the infrastructure exactly the way you want it, and to organize your metrics the way you want.
Before we go deeper, I want to highlight the alerting philosophy that I like, the idea that you should only page someone if there is a critical failure which is actionable. If you receive an alert and you are like, "Okay, it's fine," and then you close your phone, then you should not have been paged by this alert.
Another thing is that you need to know what you are alerted on. For example, in Nomad clusters, we initially set up an alert when a Consul check is failing, but when you have thousands of different services and you see, okay, “Consul check is failing,” what does it actually mean? This is not really actionable. This is not a useful alert, so you should not be paged on that kind of alert.
However, I think it's still fine to keep some alerts. I called them “ambiance” alerts, which means that when you are actually paged and you can see, "In my datacenter, I also have this, this and this and this alert, so I think that the issue is that specific alert," and then you can actually start and fix the issue more quickly. But really, alerting on everything the first time is not a good idea, and that's actually when you can get alert fatigue.
In the talk later, you will see that we will not recommend a lot of different errors because that's how you get alert fatigue, and alerting on everything — on every metric, on every latency spike — is not something that is recommended, especially with services like Consul and Vault, which are mostly like backend services. You want to alert more on your own services when they are slow, and then you can go and look, "Okay, I see that this is caused by Consul being slow or Vault being slow."
Let's look at Consul itself and the Consul telemetry. First, in Consul, you will find two different ways to monitor them. You have the Consul Exporter. The Consul Exporter is an official Prometheus exporter which has been there for quite a long time. What it will do is that it will expose the Consul cluster errors that it gets from the Consul API.
You can also expose the key-value stores — the key values that are in the KV store — so that if you want, for example, to set the threshold and monitor on the threshold, you can store it in the Consul KV store, and then you can use it for graphing or for alerting. If you say, "I want to alert on that specific metric," you can just put it in Consul, and then you can reuse that in your alerting queries. The Consul Exporter is connected to a single instance of your single Consul instance.
But Consul also has telemetry. This was added in Consul, three or four years ago. It's built in, and you get metrics that you don't have with a Consul Exporter. For example, you get all the runtime metrics like the memory, the CPU usage, the autopilot, Consul-to-pilot metrics, raft metrics, and all the calls that are made to Consul like key-value store calls, but also service API calls. All of that, you can see inside the Consul telemetry.
The Consul telemetry, it is not Prometheus-specific. You can also configure it with Datadog, and other monitoring solutions.
That's actually how you will configure it for Prometheus. You want to disable the host name in the telemetry because by default, Consul will expose the host name, which in Prometheus will just cause a duplicate. It will not be useful, because when Prometheus will scrape the metrics from Consul, it will already add the host name itself.
And then you add the retention time, because some of the metrics will have some historical data like summaries, I think, and then you want to have a small buffer. I have seen people that have a retention time for the metrics of 24 hours, so that will depend on your use case. But this is what you need to enable Prometheus. The point is retention time is actually monitoring, so you need to set it in order to enable the Prometheus exporter inside Consul.
In Prometheus, this is pretty basic except the two last lines. You will just say, "Okay, these are my Consul servers." If you have quite static infrastructure, you can just point to the Consul servers, but you can also discover the Consul sizing, your Kubernetes API, or any service discovery that Prometheus has. And then you need to specify a metrics path because Prometheus expects by default that you will call
/metrics, but in Consul, you need to use
/v1/agent/metrics. And then there is an HTTP parameter called
format that you need to use, and that needs to be Prometheus.
I think that's about it for this slide. If you need to look into the alert that you want to do with Consul, you want to make sure that you can get the metrics from the Consul Exporter. You have two metrics there. You have the Up metric, which means, “Okay, the Consul Exporter is running,” and
consul_up means that it can query Consul. You need to check both of those metrics. Otherwise, you don't know the metrics you actually have, or if you have another issue in your infrastructure.
Then you have, is there a leader? There is a metric in the Consul Exporter that tells you if there is actually a leader in Consul, so Consul can make decisions. And then you can check the number of peers that are in the raft, in the cluster of Consul, and you can see that it matches your expectations, that you don't have a Consul that's gone.
Then you have the Consul telemetry, and then there is a very handy metric. You can check that you can scrape Consul with the Up metric, which is added automatically by Prometheus. When it's 0, then your target is down. And then you can check the autopilot, which is the Consul thing. If the autopilot says that it's not evident, then you need to react and you need to see what's going on in Consul.
I also want to add that, of course, in the Prometheus configuration itself, you can authenticate to Consul. That's not an issue. If you are using ACLs, you can totally pass the token to Prometheus and it should just work. You will be able to connect to your Consul environment with ACLs, of course.
A few words about Vault. Vault uses the same configuration as Consul. This is an ACL configuration. Again, it's quite simple. The Prometheus configuration is also very similar. The difference that you have there is that the metric path for Vault itself is
v1 agent metrics. You need to use
v1/sys/metrics. Again, I've not added that there, but you can also give Prometheus a token. It can read the token from a file, or you can put the token in the Prometheus file. What we have done in Prometheus is we have added the ability to read the tokens from files, which means that if you need Vault to connect to your applications, for example, then you can use the Vault agent to write to a sync file. Prometheus will pick that file on every request, and it will actually be able to use your tokens that Prometheus will hit from the file to connect to your targets.
In Vault, you want to check that the Vault is up, and then it's also very important to check that Vault is unsealed, because if it's unsealed, you can wake up a few people to actually start looking into it and enter their keys. And another thing, there is the audit log of Vault, and basically the Vault audit log must work. If your audit log starts showing failures, then you know that Vault is not working. That's a security feature of Vault, that you need to check the log request and the log response — audit logs — and they must succeed. That's another alert because then you are 100% sure that Vault is not working.
Another thing is that in the alert manager, it's also nice to reduce alert inhibition, because you will say, “Yes, but if Vault is unsealed, I will be alerted by my application because it's no longer working.” Yes, but there is also alert inhibition. Alert inhibition, it's a concept in Prometheus, which means that if you have a certain alert which is firing, then you can decide to silence other alerts. For example, if Vault is sealed, you know your application will fail. Then you can start inhibiting the other alerts, so the only page that you get will be that Vault is unsealed.
Prometheus will still collect the metrics for the application, so you will still be able to see what's going on with your application. But the only page that you should get is that Vault is sealed, because that's the actionable alert that you get. All the other alerts the application is giving 500 down because Vault is down — yeah, you probably already know that, and it's not actionable. What's actionable is Vault being down.
So this is an example of inhibition. You can see that we have
source_match, which is the alert name. We have
alertname, which is called
VaultIsSealed. And then if you have an alert
VaultIsSealed, then you don't want to receive the alerts that the error rate is too high. You can also add
equal at the end, which means that you'll only see that field in the same datacenter. So if you have another application in another cluster that has an error rate too high that should not use that Vault, then you will still get those alerts. The datacenter level should also be the same.
At the end, unless you are really running Consul as a service, you should not really try to have 10,000 alerts on your Consul cluster. I would even say that the customers that I have seen with nice Consul dashboards, Vault dashboards, they just never use them, because they directly see at the application level — in the traces — that the issue is coming from Vault or from Consul. When you want to monitor concerning Vault, I would not jump directly into making the maximum dashboard, the maximum errors that you can set up, because you will just lose a lot of time for things that you will just barely use.However, the very few concerns in Vault that you have presented there then can really help you pinpoint what causes when you have failures in your application that might be caused by Consul or Vault.
Once again, thank you for being at this talk.