CrowdStrike will discuss the challenges and opportunities presented by introducing service discovery to existing cloud infrastructure via HashiCorp Consul.
CrowdStrike will discuss the challenges of introducing Consul to massive-scale brownfield hybrid cloud infrastructure. They will focus on Consul's service discovery capabilities for two primary use cases: providing an alternative to load balancers, and generating scrape targets for Prometheus.
In this talk, you'll learn how to integrate Vault, how to deal with large gossip pools and various disaster recovery scenarios, manage bootstrapping, and configuration with Chef and Terraform.
Hi everybody, and thanks for coming today. I hope you've had as much fun at this conference as I have, and a big shout-out to everybody who's working behind the scenes to make this happen for us.
My name is Jaysen. I'm a senior infrastructure engineer on the platform team at CrowdStrike, and I'm here to talk to you today about our journey deploying HashiCorp Consul at scale. A little bit about me: I have some prior experience deploying Consul at a smaller scale in a previous company and was actually at this conference in 2018 as part of that project. So I'm really excited to be back as a speaker.
A little bit of background about CrowdStrike: our company was founded in 2011. We are a cybersecurity company that provides a cloud-native platform for securing endpoints, cloud workloads, identity, and data. We have over 3,000 engineers on staff; we have over 95,000 compute nodes that are split across EC2 and on-prem networks. They're serviced by 40 separate Consul clusters and seven HashiCorp Vault clusters.
A little bit of background about this project. A fleet of this size doesn't happen overnight. We were a startup once. We had a small customer base, and as that grew, the size and complexity of our network infrastructure grew along with it. Problems that had once been easy to solve out of the box became difficult to manage or expensive as we grew. And now, as you've probably seen in your own journeys, there's not always an obvious point at which it becomes worth the time and the investment to adopt some of these transformative technologies like service catalog, discovery, network infrastructure automation, secrets management, even CI/CD (continuous integration/continuous delivery), right? This is the challenge for DevOps in general, I think.
If you're out there right now and you're seeing all of these cool technologies and you're trying to figure out how you can get buy-in from the people that you work with — from your bosses, from the people that make the money decisions — then I would say to find a specific use case that has an obvious value to the business that you can demonstrate, and hyper-focus on that. Set some guardrails. It's really easy to get excited about all of the things that these technologies can do — especially Consul, anymore, can do a lot — so hyper-focus on that, set yourself some guardrails, and sell that to your partners.
For us, that use case had to do with load balancers. They're that classic example of an out-of-the-box solution that is easy and reliable right out of the gate. But as your infrastructure grows, they can become expensive, and there are some technological limitations that you can run into, like a hard limit on the amount of instances you can have in a target group, things like that.
We're a really big AWS customer. We traditionally spend millions of dollars a year on simple east-west load balancing. So for us, there was a lot of obvious value in finding a tool that could help us provide an alternative way for just connecting clients to endpoints —simple service discovery and registration.
We came up with a list of requirements for the tool in this space. I'm a HashiCorp fanboy. I knew Consul was the tool. I'd been pushing it since I started here. But I still wanted to demonstrate to the business that this was the tool for the job. And so we came up with a list of requirements: it needed to be fault tolerant, highly available; it had to support our network infrastructure; we wanted robust health checks as good as or better than what we were getting out of the box with our load-balancing solutions. And it needed to easily integrate with DNS.
So we analyzed all the tools that were in this space, put them through this requirement matrix. The few that came out at the other end we built proof of concepts for. We built the clusters, put them through their paces, kicked the tires, and at the end of that process, surprise surprise, Consul came out on top as the best fit for our use case.
A little bit about our network infrastructure. We are hybrid multi-cloud. What does that mean? There's some different definitions for that. We have multiple private clouds — each one is comprised of one or more AWS VPCs (Virtual Private Clouds) and datacenter on-prem regions that are all connected in a flat network. But each cloud is an island. We don't replicate data between them. Each one exists unto itself. Our largest production cloud has around 60,000 compute nodes in it, and the next largest is pretty close behind, coming up on about 30,000.
We knew when we first sat down to build this architecture that gossip was going to be one of our biggest challenges. If you're an enterprise customer, you've got a simpler solution here. You've got an easy way to separate your gossip layer into network boundaries and have a single cluster backend for that. But if you're using open source, then it takes a little bit of creativity. We drew our inspiration from Bloomberg's fantastic case study about 20,000 Nodes and Beyond. Google it, watch it. It's a great talk and it will help set the stage for Consul if you're dealing with it at scale.
In our implementation, we decided to divide our gossip layers by availability zone — AZ1, AZ2, AZ3. Clients in those availability zones have a dedicated Consul cluster that they connect to. The clusters themselves still span AZs for resiliency, but the clients for each of those AZs, both on-prem and in public cloud connect to a dedicated cluster. Not only does this create a logical boundary that's really easy for everybody to grok, but it makes your configuration management dead simple. You don't have to come up with some kind of complicated hashing algorithm or ordinals to decide what clients are going to go to what clusters. Every client knows what availability zone it's in, and it can connect to that easily.
This created for us a pretty simple federation model. Essentially, we have six clusters per cloud; we don't have to worry about replicating data between them. And now with some of the new announcements around Consul's new capabilities with peering, we may even take advantage of some of that to extend this further. But for right now, it's a pretty simple federation model.
Within each of these clouds, we make use of prepared queries for a localized form of geo-failover. This just says that if the service in my local availability zone is down, give me the results from the next two closest.
So we've decided how we're going to build this thing, right? We know we're going to need a lot of clusters, and they're going to be spread across EC2 and on-prem networks. Our next challenge was how are we going to provision this — definitely not by hand. It's a lot of work and it's too much. We already have a pretty strong HashiCorp Terraform workflow in our platform team. We use it for deploying applications already, and we also use it for building out new clouds, deploying the VPC infrastructure and all the requirements for that. So that was a pretty obvious choice for us. We built a couple of Terraform modules to make the process repeatable and took advantage of some of the cool stuff like the
for loops that you can do to really dry up our code.
I want to talk a minute about why we didn't put this in Kubernetes. Why are we building this on virtual machines in EC2 and on-prem? At CrowdStrike, we do use Kubernetes extensively, but we build it the hard way, from scratch. There are provisioning requirements there for just maintaining the ongoing health of the clusters and for building new ones. We wanted to avoid a lot of complicated interdependencies between Consul and Kubernetes that could make it difficult to either stand up new clusters or to resolve problems with existing clusters.
We know how we're going to build it, we know where — now we have to decide how big it's going to be. There's a lot of great information online about sizing. In fact, this graphic was cribbed from HashiCorp's own website. The thing that you want to keep in mind is that your size needs to be tailored to your use case. This goes back to what I was saying earlier. Set the guardrails on what your service is going to do and hyper-focus on that. That will help you in this garden of forking paths of configuration options and sizing. It's going to help you focus in on — what are the most important elements to maintain the stability and performance of your cluster.
The things to keep in mind — you've probably heard this in some other talks — Consul is write-limited by disk I/O, it's read-limited by CPU, but memory is a factor either way. You need to allocate two to four times the amount of memory as your working set of data within Consul. If Consul's internal data set is a GB, you need at least 4 GB of memory. And remember that the catalog, service information — all that's included. Even if you're not using KV or some of these other capabilities, there is still a data set that you need to consider. Give yourself the ability to scale vertically as needed, and let your metrics be your guide on that.
Once we decided where to build Consul, how big it's going to be, we had to come up with a security plan. As you've heard elsewhere, Consul has three pillars to its security framework:
Serf gossip encryption
Access control lists (ACLs)
As you've probably seen with other applications, you need to strike a balance between security, usability, and functionality. These three elements exist as separate axes on a pyramid, and the closer you get to one, the further away you get from another. With Consul, you're going to need to strike that balance for each of these three components.
For us, even though this is all internal east-west traffic, we're a cybersecurity company. It was super important for us to enable these methods and enforce them — like we heard earlier, assume breach. We knew, though, that we needed a secure and reliable way to manage and deliver these secrets across all of these sprawling clusters.
This is where HashiCorp Vault comes in, and I get really excited because I love Vault, and I hope you guys do too. It's one of the most amazing tools in this space, in my opinion, and I just personally enjoy working with it.
The first problem we had to tackle were Consul server certificates. In theory, I guess you could use your existing public key infrastructure (PKI). We obviously have our own preexisting PKI infrastructure at CrowdStrike; but these Consul server certificates, since they're being used for mTLS (mutual transport layer security), they're part of your authentication mechanism. And so you want them to be signed by a dedicated certificate authority (CA). You need to have some certain things in the sand to help with the functionality of the cluster, localhost and domain names for each of your federated clusters, and you don't want them to be long-lived.
I'm not going to cover too much of this. I think we've all heard enough about static and dynamic secrets. Dynamic secrets, good; static secrets, less good, right? But it's part of the crawl, walk, run model, and you're going to need to incorporate them at some point. As you'll see here in a little bit, we do have one static secret that we're using for this model.
For our dynamic secrets, we're taking advantage of Vault's PKI secrets engine to create our dedicated CA and to generate and deliver the certs for our Consul servers on demand. We use the Vault agent running on each Consul server. It authenticates with whatever mechanism is appropriate for where that cluster lives. For example, in EC2 we take advantage of IAM authentication. The agent will authenticate, fetch your cert, write it to disk, reload Consul. When the cert reaches two-thirds of the way to expiration, you automatically request a new one, write it to disk, and reload the agent.
Not only does this make our life easier as operators so that we don't have to deal with the delivery and rotation of these certificates when they expire, but it makes the provisioning for these new clusters a snap. You get Vault up, you configure it correctly to deliver its PKI, and the very first Consul node that comes up in your primary datacenter has certificates right from the start. You don't have to go through some complicated provisioning process where it comes up disabled, you add the certs, and then you enable it. Right from the start, you've got mTLS enabled.
The Consul client certificates — I guess in theory, you could probably use the Vault agent for those as well. But you don't need to because Consul Connect acts as an internal CA for Consul and allows your clients to present a CSR (certificate signing request) for a leaf certificate to be signed by the Consul backend automatically. You don't have to deliver any certificates except for the public CA certificate. (You don't technically have to if you disabled verification, but you should because you want to make sure your clients are connecting to the correct backend.)
Something to keep in mind when we're talking about doing this at scale is that there is a CPU cost to this operation. So it's a good idea to limit the number of cores that are available for signing so that you don't overwhelm your backend when you're initially provisioning all of your clients or maybe when your root CA expires.
The next challenge with security is the access control list. If any of you have dealt with Consul's ACLs, you know that they can be a little bit tricky. You can run into some chicken-and-egg scenarios. I would recommend that if you can and if it works in your workflow, take advantage of Vault's dynamic Consul backend. That will give you the same benefits that we talked about with the certificates, but for your Consul ACL tokens.
We decided not to go that route for a couple of different reasons, one of which is that we already have an existing secrets management system that's tied into everybody's workflow. They're already familiar with how to use it, and it has its own way of our backend dealing with permissions. For us, it made sense to generate these tokens and store them statically. This goes back to that balancing act. We get some additional usability at the sacrifice of some security that we would get if these were more shorter-lived credentials. It also keeps Vault out of the hot path and keeps us so we don't have to put agents on a hundred thousand nodes. But we do use Vault to manage tokens for human users and for CI/CD.
Gossip encryption, the next element of this, is a lot simpler. It's a string. As far as I know, there's no way to manage it dynamically, but you do want to rotate your gossip encryption keys, tokens. So you want to give yourself the ability in whatever your configuration management or operations tools are to disable the verification — the incoming and outgoing verification for gossip — so that you can rotate that key at some point in the future. This gossip encryption key is our one static secret. We actually store that in Vault rather than in our traditional secrets management system, for reasons that I'll talk about here in a minute.
I want to talk a little bit about Vault post-provisioning. These things that we've talked about — these dynamic secrets backends and these authentication engines — they have a lot of moving parts. There's rules, there's policies, the engines themselves, there's a lot of different tuning options. When we did our initial proof-of-value deployments in our lower environments, we basically just worked through the process of enabling all this by hand and translated that into shell scripts. But that quickly became unsustainable. As you learn, as you grow, as you deploy these clusters, you're going to want to come back and tune things. You're going to want to maybe shorten some TTLs, maybe lengthen some other ones, change some token types from service to batch. You're going to want to remain adaptable with that. And if you've got shell scripts that you're using to manage this, it's going to be a real pain.
We took advantage of the Vault provider for Terraform to do a complete end-to-end configuration of Vault. In our provisioning process now, we bring up that first Vault node in the cluster, we initialize it so that we can get our root token, and then we run our Terraform, and everything else is done. From that point, now we can stand up Consul Vaults there, ready to handle all of the dynamic secrets. And as operators going forward, when we want to tune these things, when we want to add policies, change things, we've got a really simple way to do it. And it’s infrastructure as code, right? That should be our goal for everything.
As I've touched on, we use Chef — we are a Chef shop for our traditional VM-based infrastructure. We wrote cookbooks for Vault and Consul from the ground up for this project. We don't use a golden image model, so our last mile with Chef is long. But that also gives us a lot of flexibility in exposing some of these different configuration endpoints and such, to give us the ability to tune the configuration of these clusters as we need.
If you're going this route and you're working with Chef and cookbooks, I highly recommend you set up a testing pipeline, use Test Kitchen, make sure that you're actually testing these things. If you can, use ‘kitchen-nodes’ to build a real cluster with Test Kitchen that connects everything together. That's going to give you a lot of confidence when you're making changes before they even hit your development environment.
We also use Chef as our mechanism for fetching these two initial secrets. As I mentioned, the Consul client certificates — we need to provide the public CA cert, and we need to provide a gossip encryption key. We didn't want to have to put the Vault agent on everything, but we need everything to be able to fetch these secrets. Our existing secrets management tool is not currently consumed by every single node. We've got NTP (Network Time Protocol) servers — certain things that don't need secrets — and we didn't want to make that a requirement for registering with Vault. We want every compute node on our network registered with Consul, even if it's just sitting there as part of the catalog.
So we came up with a method that allowed us to do a semi-anonymous authentication using AppRole combined with some information that's already available on any node that's being provisioned with our normal provisioning process. We wrote a custom library in Chef that will do the authentication with that semi-anonymous authentication method. We use batch tokens so that they're just a throwaway. We only need them for a few minutes for the duration of that Chef run, and then we throw them away. If you haven't used batch tokens or looked into that, they're a fantastic use case for this model, when you don't need to be tracking leases. They're super short-lived; these batch tokens have very little performance impact on Vault. And when you're dealing with tens of thousands of nodes, that's really important, because as we'll touch on later, your lease count in Vault is one of your biggest indicators of how it's going to perform. The more leases you get, the more CPU and the more disk you're going to consume on Vault.
We know how we're going to build everything, how we're going to secure it — how are we actually going to connect things together? This ended up being a little bit more of a complicated problem than we originally imagined. Right out of the gate, I was super excited about cloud auto-join. We're going to use tags and we're going to hook everything up together that way. And that was how I did the server clusters. It was great.
Then when I went to start connecting clients to this, I realized that we were going to have to grant EC2
DescribeInstance permissions to everything in our EC2 fleet. That's not a great security model for us. There's a lot of additional things you get out of that besides just the ability to read tags. And we also needed a way to grant our on-prem hosts either the same permissions through some wacky way that they can read tags, or else you have to give them an alternate method.
In the end, we ended up settling on load balancers as our model. But I do want to call out that there is an issue if you are using cloud auto-join where you cannot pass a custom EC2 endpoint into the join. Apparently, this is a problem with the underlying ‘go-discover’ library. It became an issue for us because we're in the process of restricting some egress, and we don't want to be going to these public endpoints. If any of the maintainers are out there, if you're listening and you want to solve issue 104 on the GitHub page, I will buy you a pizza.
Once our clusters were built and we actually began operating Consul as a service and onboarding our partners, we began to realize that dealing with Consul’s policies and tokens manually was going to be a problem. In the beginning we were just creating the policies and tokens for our servers, and that was all that we really needed. But as we began to onboard stuff — it was the same problem with Vault, right? You've got this sprawling infrastructure. Even though all your policies and tokens are managed on your primary datacenter — or they should be, you should replicate to make your life easier — we've still got all these different clouds that we have to deal with, and clicky-clicking through the UI, typey-typing, is just not a sustainable activity. We're a really small team, and we don't want to spend all day doing that.
Again, we leveraged Terraform provider. The Consul provider for Terraform will let you manage all of your objects with code. And not only did this make our lives easier, but it enabled us to provide a self-service model for our partners, our internal customers. As I mentioned, we're using open source; we don't have the ability to do any multi-tenancy. This is a service that we're providing for our other customers within CrowdStrike.
This model, using Terraform to define these policies and provision these tokens, gives you a way of providing self-service. When one of our customers has a new service that they want to onboard into Consul, they just need to make a PR (provision request). That gives us a chance to review it; we can make sure there's no conflicts with other services, that the name's going to be resolvable through DNS, all of that good stuff. It doesn't exceed a character limit, we can do some validation, and then when we merge that PR, we can have a CI/CD process that comes along, authenticates with Vault, gets a token to Consul, applies the changes for us. That takes a lot of the operational burden off of us during onboarding and really helps us accelerate that process.
One thing to mention in this model: why limit user access? One of the things is that the ACL system doesn't let you scope your permissions on these objects very well. So if you have permission to create policies, to edit policies, you have permission to edit all the policies — your own policy, anonymous policy, whatever. That was another reason that in this model we wanted to not provide a whole lot of direct access. But we do update the anonymous policy to provide greater visibility into Consul without having to provide a token. We want everything to be able to read from the catalog.
Also on the topic of operations, you want to automate your upgrades. Coordinating upgrades and restarts with this many clusters is already a challenge, but there are some prescriptive practices when it comes to upgrading Consul and Vault. In the case of federated Consul clusters, you want to upgrade the primary datacenter first, the secondary datacenters last. And on a per-cluster level, you want to upgrade the leader last — all the followers first, followed by the leader.
You want to do the same thing really, if you've got to restart for some reason. Let's say you've made a configuration change that can't be applied with just a simple reload. You don't just want to fire off restarts all at once, randomly do it. You want to be in control of that process. And furthermore, you want to do it in a way where you maintain the health and stability of the cluster. This means that after each node, you should wait until Autopilot reports a healthy status for the previous node before you move on to the next one. If you're doing this by hand and you've got 40 clusters times five nodes each cluster, you could spend all day in front of your terminal just staring at output and hitting a button when things are ready to go. So it will really help you to come up with some automation — whether it's through Ansible or some Python tooling, whatever your choice is — to make your life easier as an operator, because you want to have a low bar for upgrades.
As others have mentioned, you don't want to get more than two versions behind with Consul — I'm guessing with Vault, too — because you could end up with some backwards compatibility issues. Even if you're not interested in a new feature that's coming out, you're not interested in some security fix that's there, you want to stay on top of these upgrades so that when something does happen, when there is something critical that comes out, you don't have a complicated upgrade path to get there.
Along with that, you want to bring your own binaries. You want to roll your own packages. I'm not saying you have to compile everything from source (maybe you could). But you don't want to rely heavily on community vendor packages, as convenient as they are, because you're really buying into whatever their prescriptive model of how Consul or Vault needs to be installed — what the data directories are, how the service file is created. You might want to have some more control over that. I highly recommend that you take control of that packaging process, and also just to avoid some of these supply chain attacks that are becoming really common.
So it's a good idea, package your binaries. If you don't have a model for that already and it seems intimidating, take a look at fpm-cookery. It's a fantastic project. We make really heavy use of it at CrowdStrike. Again, now you're talking about all of your package definitions existing as code. A simple PR when a version changes with the new md5sum — you're good to go, it builds your package. Publish it in your artifact management system, and you can avoid these supply chain issues and some of the unpredictability that comes with using other people's packages.
Now let's talk about the most exciting part, which is monitoring, right? Even when you're operating at a small scale, your monitoring dashboard is your cluster. It is the x-ray view into the body of the beast that you are dealing with. It's going to tell you information about the health and the activity of the backend that you're not going to get from looking at the UI. It's going to tell you about the gossip network, API latency, leader elections, Autopilot health, all of these really valuable indicators of problems you want to be highly visible and readily available to you.
Same with Vault, right? You're going to get all of these system performance and cluster activity, lease counts like I talked about, and of course disk space, memory, CPU… Disk and memory are particularly important. You do not want to run out of disk, you do not want to run out memory, or you could get yourself into a situation where it's really difficult to recover from that because you've affected your data integrity. Hopefully you're taking backups, but you still don't want to get into a place where you have to restore from backup.
If you're using a tool like Grafana for your visualizations that lets you annotate, do that. Because when you first go to implement all this, you're going to do (hopefully) a whole lot of research on what are the valuable metrics: what am I looking for, what's normal, what's not? And then six months later when you have a problem in the middle of the night, you're going to go look at this dashboard, and you're going to not remember what this is supposed to indicate. Is this bad? Is this good? What does it mean that my oldest Raft log is seven days old? Is it supposed to be shorter than that? You want to annotate your panels to give yourself a little bit of a clue about what you're looking for, not only for yourself, but for other people on the team and for the newbies that you bring on to help you support this thing. You want to give them some guidance.
Another thing I want to point out is that when you're dealing with these clusters in the default configurations, especially for open source, there's one node in each cluster that's really doing the heavy lifting. That changes a little bit with performance standbys, things like that. If you're an enterprise customer or if in Consul you're doing stale reads, that changes a little bit. But just assume that there's one node in your cluster that's really doing the heavy lifting. So you don't want to rely too heavily on averages because something critical could get lost in the mix there. You want to have somewhere really visible on your dashboard, the max indicators, the worst indicators. I want to know the least amount of disk space that I have. I want to know the most amount of memory utilization that I have because that's going to focus you in on where your problem is.
As my demo picture here, I've got an example of what I mean with these dashboards. As you can see, we've got some of the most important gauges at the top. We've got those max indicators that I talked about, and we also have TLS certificate expiry. We're going to trust that Vault's doing its job, absolutely, but we're going to verify as well. Because if for some reason you have a problem with your agent or you have a problem with the Vault backend and those certificates don't renew, when they expire you're going to have a bad time. And so you want to stay ahead of that. You want to make sure that that's highly visible in your monitoring.
I also like to keep an eye on the registration, that line bar that you see there. That's going to help you identify if you've got flapping in your gossip layer. You want a fairly flat line there that doesn't have a whole lot of ups and downs — that could indicate that you've got a problem.
That's the stuff that we keep at the top. Then as you scroll down through these, you're going to have some additional information about the application metrics. These might not be things you need at the top, they might not need to be as readily available. But when you spot a problem, you want all of these different metric points available to help you correlate, to help you understand what might be happening. You want to know if you've got leader elections happening all the time, you want to know if your Autopilot health is flapping up and down. And so as you further on in your dashboard, you can put that additional information. Then of course, somewhere in there, you should have the actual individual system metrics that have the CPU, the memory, the disk that we talk about. You don't want to run out of memory, you don't want to run out of disk space.
Vault's very similar. We've got the gauges at the top that give us those key indicators of our performance and the health of our cluster. Like I mentioned, leases — that's right there, front and center. That's one of the most important things that you need to be looking for. And again, as you scroll down, you can have your additional information that tells you about some of the more specific performance indicators you can have with the Vault monitoring. We're using Prometheus metrics here, by the way, but they're the same endpoints, whether it's the built-in or Prometheus. And we try and identify all of the different backends, the PKI, Consul, whatever we're using. We want metrics from each of those.
You're not going to catch everything with metrics alone. As valuable as they are, they won't tell you the whole story. You need to aggregate the logs from your servers and your clients. This is going to help you spot problems that might go unnoticed in your metrics, especially at the client layer. It might not be practical for you to gather Consul metrics from every single agent in your network, but you still want to have an indicator of problems — things like certificates that may have expired, problems getting the leaf certificate on the clients, problems with ACL tokens, that kind of thing. Especially in our big cloud, 60,000 nodes, if a hundred nodes are having a problem, that's really easy to escape our notice. But if we've got good log aggregation in place, and especially if we have some kind of dashboard for that same visibility, then it's going to help us get ahead of problems that may not actually be impacting the cluster but could be impacting a customer experience or preventing us from having visibility in something that we need to.
Don't forget about alerts. I know you won't, but don't. You're not going to catch everything staring at dashboards every day, scrolling through, clicky-clicking. You need reliable alerting. You need to get ahead of problems. A lot like monitoring, there is a delicate art to identifying the right metrics and setting the right thresholds in a way that is going to allow you to spot a problem before it becomes an emergency but isn't going to wake you up every night with false positives. There are some really great resources online for helping you identify some of these key metrics that you should alert on and even giving you some thresholds to get you started. I've linked those on this slide here. But really, what I've called out here is the internal Consul and Vault documentation about this.
Also, I don't know how often people think of it, but Datadog's documentation for monitoring is fantastic. Even if you don't use Datadog, they have really detailed information about these different metric endpoints and some of the similar stuff that HashiCorp is telling you about thresholds that you need to set. Just be aware if you're looking at those documents that some of those endpoints are exclusive to the Datadog agents, so your mileage may vary there.
Something else to call out with alerts is that if you've got the ability, you want to create multiple tiers of alerts. Not everything is an emergency. There's only really certain things that you're necessarily going to want to be woken up about in the middle of the night. As you can guess, those are going to be things like lease counts, memory, disk — don't run out of memory, don't run out of disk. You want to be woken up about those things so that you can get ahead of those problems. Otherwise, when you come in the morning, you might not have an easy time recovering from that.
Metrics — they're important for catching problems. They're critical for that, as we're saying. But something that you may not think about is that the metrics also help you learn what normal looks like for your cluster. If you're new to Consul, or if you're deploying Consul at a scale you've never done before (as I am), you don't really know what normal looks like. So it's a really good idea to get your metrics in place right from the start. I know sometimes, especially for me, monitoring becomes the afterthought. “Now it's GA and now I’ll set up some monitoring and alerts for it.” In this case, I did it right from the ground up. The very first dev cluster that went up, before I did anything in prod, we had monitoring in place because I was able to use that to establish a baseline for what healthy looks like at these scales.
By checking these dashboards — well, I know I say you shouldn't stare at your dashboards. I have been staring at these dashboards for the past few months, every morning when I come in, because I like to know what normal is. That way when I have a problem, I can go and I can look in there and I can say, "Okay, yeah, these response times look normal” or “these response times don't." So the monitoring is great, not just for catching problems, but for helping you to understand the health of your cluster.
On that topic, I want to highly recommend that you do your deployments and your rollouts in a way where you give yourself a buffer. The way that we did it, we built these clusters months before we ever actually onboarded anybody. We registered everything, all of the clients, and then we just let it bake just to see what the health of the gossip layer was going to be like. It gave us a chance to slowly onboard some friendly services, make sure that there's no hiccups in our provisioning process, so that by the time we actually began onboarding partner services — our internal customers — we already had a really high confidence that this thing was going to be stable and it was going to work for us.
In the remaining time I want to talk about some of the problems that we ran into during this process. I've mentioned this a few times now: Vault lease counts. High lease counts hurt Vault performance. It's as simple as that. There are some different scenarios that you can run into that lead to lease counts blowing up. There are some improvements in the Vault agent in 1.12 that are really going to help with this. But up until this point, a good example of a way that you could wake up in the morning with a hundred thousand leases has to do with the way that the Vault agent works.
As we mentioned, the Vault agent authenticates to Vault, and then he gets the certificate that he needs, he writes it to disk, and he reloads the Consul agent. Let's pretend that the Consul agent isn't configured with the correct gossip encryption key or doesn't have one at all, and it can't start. The Vault agent is going to authenticate with Vault. He's going to fetch that cert, he's going to try to reload, and then he is going to fail, and he is going to do that same thing over and over and over and over again, endlessly — out-of-the-box behavior. A collection of misbehaving nodes could cause tens of thousands of leases to accumulate. The problem with this is that when your lease count gets that high, Vault starts having trouble responding to normal API calls, so you're going to end up in a situation where you're not able to recover because you can't issue the calls that you need to recover from this.
So far, all of the high lease-count issues that we've seen have been related to the Vault agent. Like I said, there's some new behavior in there — the ‘exit_retry’, that's not super new, but we did just recently implement that. That's going to help with this. The new certificate revocation method is really going to make a big difference here. And there is a newish template function — that apparently has some bug fixes in 1.12 that are going to allow us to implement it — but a new template function that is an alternative to the secrets function called PKI Cert, that changes the behavior so that it doesn't just automatically ask for a new cert every single time. It has some item potency in there where it can verify whether the existing cert expired and save you some lease counts there. And then also, worse comes to worst, you do have the ability to not create leases for these certs, but we didn't want to go with that hammer. We do actually want the ability to revoke these if we need to.
All right, mass certificate revocations: that goes hand in hand with this. Again, that's something that's going to be resolved with these new certificate revocation methods that are in the new version of Vault. But under the current model, up until this point, Vault has a hard limit on the certificate revocation list. It's stored in the KV, and so it has a limit on its size. If you exceed that, which I have done a handful of times now, you will not be able to fix it. If you exceed that limit, you won't be able to queue up any more revocations. They won't just expire on their own when they reach the end. And as far as I can tell — and from what I've been able to gather from talking with the Vault folks — your only option is to recover from backup or to rebuild your cluster.
Gossip network issues. This is another really interesting one. You've probably heard — I hope you've heard — every node in your gossip pool has to be able to communicate with every other node. This is serious, it's not just a suggestion like a Stop sign. You have to do this. If you have persistent communication issues in your gossip layer, one network segment that will not be able to communicate with another, it's going to cause you problems. We had an issue when we first deployed our dev cluster where we had a VPC (virtual private cloud) that by design could not communicate with another VPC, and there was a lot of flapping in the gossip network. I didn't care, because it was a dev cluster, but at some point we had to migrate the Consul backend to new subnets, which required destroying those instances and rebuilding them.
We ran into an incredibly strange issue where those server nodes that we deleted would start showing back up in the gossip layer as healthy initially. Their first hit back in the gossip layer was with a new lamport time, and they were healthy, and that was causing them to show up as potential voters in the Raft list. I didn't notice this until we had cycled all five servers, and then we ended up in a scenario where 10 servers were showing up in the Raft list, and Operator was getting really unhappy — nodes were coming up, coming down, leader elections…. Again, this is not a suggestion. Your nodes absolutely all have to be able to communicate with each other, at least in a persistent way. If you have an outage or something, this isn't going to cause this. But the long-term fix for this was to stop Consul everywhere and delete the Consul data directory and start everything up. You don't want to get into a scenario like this.
Again, don't run out of disk space. You could corrupt your Raft DB, you could end up having a really bad time. Storage is generally cheap. Overprovision.
Practice your recovery procedures. I can't stress this enough. You don't want to wake up in the middle of the night and be fumbling around on HashiCorp's excellent documentation about recovering quorum with peers.json, or trying to figure out where your backups are stored and how to copy them down to restore. There's a certain process involved in all of those recovery options, so you want to practice them. And you want to give yourself some chaos engineering. Cause problems in your dev and staging clusters, and give not only yourself but the other people on your team — new people that you bring on, whatever — give them the chance to work through these things.
A quick note about snapshot backups: if you're an open source user, you're very likely going to need to write your own scripts. If you've enabled ACLs on Consul, you're also going to need to factor that in. That action requires a token, so you're going to need to cook up a way to do that. Same with Vault — Vault snapshots require authentication. Make sure to factor that into how you write your policies.
I don't hear this a lot, but I want to tell you that if you build your Vault cluster properly, you can just build it from scratch. If you really end up in a scenario where you can't recover — "Oh crap, my backup script was broken and I didn't notice and I don't have backups" — Vault should be mostly ephemeral.
This is why you want to avoid static secrets as much as possible. We have the one static secret, super easy to put that back. But if you've built Vault right, it's ephemeral, you can delete it, bring it back up, and everything will just work. Those Vault agents don't need some fancy reconfiguration or anything. As soon as that API's back up and you've run your Terraform — another reason to use Terraform: in case you can't recover from backup, all you got to do is run that Terraform again, you got all those objects back. Everything just works again. I can tell you that from experience, because I've had to do this several times now as well.
Another thing to know, especially in the beginning: for us, we're talking about reducing our reliance on load balancers, but as we do these migrations, we don't just immediately go and delete the load balancer. Leave it up for a little bit and give yourself and your customers a chance to gain some confidence in Consul's capabilities. If, God forbid, you run into a problem or Consul goes down and one thing can't talk to another, you should be able to just update your DNS to point to the load balancer rather than to Consul. And then you give yourself a nice way to have a workaround for those connectivity issues.
In the last little bit here, just some future plans. We want to do a lot more automation. We want to continue to make our jobs easier as operators. We're very interested in ChatOps. Our data infrastructure team at CrowdStrike has done some really outstanding work with ChatOps that we want to incorporate into our workflows.
We're interested in service mesh. As I mentioned before, we're restricting egress and dealing with some of these security improvements. So service mesh is a compelling option for us as long as we can integrate it in a way that doesn't affect our core use case that we're hyper-focused on, which is service registration and catalog.
We are also, now that we've got these Vault clusters up and in place, we're going to continue looking at additional use cases for them. Our implementation right now is a very light touch on Vault. We don't have a whole lot of performance impact with what we're doing. So we have some additional opportunities there to improve our internal processes with some of these additional dynamic secrets, and we may even find some ways to augment our existing traditional secrets management tool with Vault.
With that, I really appreciate everybody's time. If you have any questions about this, if you want to talk about it, I've been eating, drinking, sleeping, and breathing it for the last six months, and I'm super excited about it. So I'll be happy to talk to you afterwards. Thank you, everybody.