Case Study

Health checking as a service, with Serf

This talk explains how Fastly used Serf to build a distributed health-checking system. The design borrows techniques from machine learning, signal processing, and control theory—to drive stable traffic allocation, while quickly and accurately identifying failures.

Health-checking is the process by which the health state of each component of a distributed system is monitored—for the purpose of distributing traffic across service instances.

While conceptually simple, health-checking can often become as complex as the services they monitor. The operational readiness of a single instance might be defined by multiple metrics, all of which must be efficiently processed and reduced into a single binary outcome. To complicate things further, the definition of operational readiness may vary according to the state of the overall service cluster: If too many instances are degraded, it's often beneficial to relax service levels.

Building a scalable health-checking system which addresses all of these concerns while remaining reactive to failures can be extremely challenging.

Speaker

Transcript

The talk I am going to give to you today, is about Health-Checking. In particular, I will talk about the distributed production health check that we developed and we're using in production at Fastly. Before I jump into describing the implementation and the design of the system, I would like to basically give you a bit of background to show you the context in which we developed and we operate this health-check system.

As I said, I work at Fastly. Fastly is an edge cloud, which is a fancy name for a content delivery network which also provides some additional services such as load balancing, Oauth, image optimization and video packaging on the fly. You may have not heard of Fastly before but certainly your browser has.

Fastly provides service to a number of high traffic website across several industries and, in fact, if you're browsing the conference website right now, you're actually unknowingly using Fastly.

The way Fastly operates, Fastly provides a globally distributed network of points of presence, Also called the PoP. Whenever a user makes a request for one of the websites that uses our services, we use a combination of IP Anycast and DNS to redirect the user request to the closer available PoP. For example, in this case, it might be our PoP in Dallas.

Inside our PoPs we've got one or more racks. In each rack, we've got one or more stateless load-balancers and a bunch of servers which all run a reversed HTTP proxy. So, whenever a request comes in, it first hits our load balancer which then hash the request among one of the healthy hosts in the system. So, request comes in, we hash it to a healthy host. Other request comes in and we hash them to the other available healthy host.

So, our PoP got a very peculiar characteristic and it got some constraints that drive the decision and the design issues within our PoP. So, first of all, in our PoP, space and power come at a premium. This is not something that only happened to us, but all our CDN. All CDN's have got the same problem so we always need to be very careful in the way we select our hardware and design our PoP to make sure they make the best of the available power and space.

Another particular feature of our PoPs is that we've got unlimited scalability so, for us we don't have the privilege of spinning up a new machine when we want to scale a PoP. Scaling a PoP for us means an engineer going on site and physically rack a new server in the data center. So, whenever a PoP runs hot on traffic, the only thing we need to do to scale up the PoP is basically shift the load among the data center.

So, making this consideration, one very important component of our PoPs is the health checking system. It's very important for us that whenever we check the health of the server in our data center, we are very accurate and timely. We don't want to, of course, leave an unhealthy server in production because it will then cause a bad user experience because their request will fail. But at the same time, we want to be careful not to remove from product servers which are actually healthy, especially when we are experiencing high load on our PoP, because that will cause a potential problematic situation which will require us to shift track to the other neighboring PoPs.

So, let's actually see how normally health checks works. We've got our PoPs and we've got a number of servers. In our server, we run an application. It can be, in our case, a reversed proxy but I think, in another case it can be, for example, a web server database and so on. And then we need a component which runs checks on this application. For example, it can check whether this application is healthy, if the process is running, what is the CPU utilization and so on. And based on these measurements, it determines whether application is healthy or not.

Finally, we also need a component which takes the results of these checks and does something with it. So, for example, it can report this information to the load balancer to make sure that unhealthy hosts are removed from production or maybe, for example, report this to the system or page someone to take a look at what's going on.

So, there are already a number of tools which takes care of these last parts. For example, Consul is an excellent tool that can take care of this part. You can basically use Consul to collect results of the checks and propagate it to your infrastructure. The way Consul works, for example, it's composed of two parts. One is clients, which runs on each machine and then we also need a logically centralized server which collects results and checking results from these clients.

So, if one machine goes out of production and starts failing, this is picked up by Consul agent which propagates it to the Consul server. It makes this information available to systems that consume this application and ensure that no longer traffic is sent to this machine.

So, the whole problem of propagating results of a check is very well understood and there are tools of Consul which solve this problem very, very well. However, there's a component in this architecture which is historically received not too much attention in the past. How do we actually measure the health of the server? So, the actual component that performs checks on our service instances running in our PoP has normally received little attention. And there's actually a crucial part of the architecture because we need to be able to accurately measure and timely making decisions on the state of the server.

Traditionally, there's always been the assumption that to understand the health of a service instance, we only need to measure metrics from the specific instance itself. So, if we want to know if one instance is healthy, we can just check the instance to know if it's healthy. However, I'm going to argue in this talk that this isn't often enough. We should actually need, to understand whether an instance is healthy, not only metrics coming from the specific instance we want to measure, but we should also know what is the context in which this instance is operated and how are other instances in the whole server pool operating.

Let me show you now, a set of examples to clarify my argument. Let me show you a simple case where only knowing the way we can accurately identify the state of a service instance by only measuring that instance itself. Let's assume that we've got one machine that is misbehaving, for example, it is becoming slow or starts failing. Well, it starts failing health checks and we remove the machine from production. This is a very common case that can happen all the time. Let me show you some graphs from my profound production that clarifies this example.

Now, in this graph I plotted the response time of our HTTP reversed proxies over a certain period of time. As you can see in this graph, most of the instances are behaving pretty well, so they got a response time, nearly zero. But there is one specific instance which is misbehaving. At a certain point, starts responding much much lower than all other instances. This specific case was caused by a node that was misbehaving and was poorly reacting to increasing load.

To timely and accurately identify this misbehavior is very easy. For example, we can make our health checking system check what is the response time. If the response time exceeds the threshold, we can remove the machine from production. And this is a very clean thing to do. We can easily remove and identify this misbehaving machine and remove from production.

However, let's look at another case. Let's assume that the machine are running hot on load, they are pretty much overloading and the machine starts becoming slower and slower and not responding incoming requests. So, since there will always be some degree on loading balance and that some machines may be running some background processes, there are always some machines that are a bit more loaded than all the other ones. So, the machine that is a bit more loaded than all the other ones, will suddenly start failing checks. This will cause the machine to no longer receive traffic and traffic failed over on other instances. But all these other instances already quite heavily loaded will get even more load which will cause more machine to fail and then will cause other load to be failed over onto other machines which will again start failing and then will cause a cascading failure and then we took our PoP entirely out of production.

So, let me show you a graph which can show this example. In this case, this is the response time of all the machine in the PoP. As you can see, the response time is pretty much all over the place. If you use the same heuristic that I show earlier on, so we remove from production machines with response time is above certain threshold, that will ten cause a cascading failure because that would cause us the removal of all the machines in the PoP from production, which is something that we definitely want to avoid. So, we want to make sure that if only there's one machine misbehaving, we remove them. But if all machines are pretty much in the best situation, we want to alert our monitoring system, page people or trigger some traffic engineering to shed load everywhere else but we don't want to remove the machine from production immediately. So, it's better having some service, but degrading, rather than causing an entire cascading failure.

And if you put the graphs side by side, you can see that only by looking at one specific service instance, there's no way to understand ... There's no pattern we can use to identify whether that's the only machine that's misbehaving or there are other machines that are misbehaving as well.

So, hopefully, I showed you the importance ... With this example, it probably convinced you of the importance of having health checks with makes decision based on the global cost of the service, rather than the state of a single server.

So, let me basically present you an architecture that we can adopt to solve this problem. This is our PoP in production. Let's see how we can wield this health checking system. Let's take a single machine and inside the machine, let's put this component. Let's put a component which periodically collects some metrics from various sources. For example, it can check what is the error rate of some service, what is the current CPU load, what is the available memory, rate of request server and so on. And this component will make periodic checks and collect a sample each time it makes this check. By periodically collecting this sample, we basically generated signal which collects measurements over time from various sources of information.

These measurements are then taken by each of the machine and are streamed to a centralized location which then takes these measurements into account, performs some mathematical operation over these signals and determines what is the health state of each of the machine in our cluster.

For those of you who are familiar with signal processing, this graph and this slide should ring a bell. This in practice, these blocks, is basically something that takes as an input, signal and convert it to a different signal. So, in signal person terminology, this is just a filter.

As you can see we basically abstracted this whole check problem into signal processing problem. Now we can use signal processing techniques and tools to solve the problem.

However, there's still a question, okay. This is a filter but how are we going to implement this filter in practice? What tools and what techniques are we going to use to determine what machines are healthy in our class and which ones are not?

Let me present to you an example of a filter design that you could use to solve this problem. The example that I'm going to show you is a simplified version of the filter we use in production at Fastly. What I'm going to show you might not be directly applicable to your infrastructure because your infrastructure might have different constraints and different majorities. However, I hope that by showing this example, it might give you some inspiration on how to decompose the problem and what tools are available to solve this problem.

So, let me start describing this problem. The service that we have running our network is an HTTP reversed proxy. From this proxy we collect two metrics. The first metric is the request success rate. We periodically send the requests to a testing point and we collect how many, the fraction of requests that succeeds. And for each of the succeeding requests we measure what is the response time. So, we put in all the machines in our cluster. In each machine, it meets two signals. Response time and error rate. Each of these signals then goes to three consecutive stages. The first stage is a denoising stage whose objective is to basically remove a feature from the signal that we don't need and only keep the features that we need for our calculation.

The second stage is normally the detection stage. We take all the signals, we put them together and we try to identify if there is some signal which is not normal from the other one so we can identify if there is some machine that is behaving differently from another one.

Finally, we take the results from this normally detection stage and we pass it through a Hysteresis filter. The objective of this Hysteresis filter is to stabilize the output and make sure there is no fluctuation of the output. Ensure that whenever we make the decision to remove a machine from production, or put it back in, this decision's stable and does not cause machine to flap in and out.

In this light, I saw you only one single instance of anomaly detection and Hysteresis filter for clarity. However, each host needs to have its own instance of anomaly detection and Hysteresis filter to make decision.

Now, let's look in detail at this stage of this filter. Let's start with the denoising stage. In these slides, I'm showing you some response time ... Some sample that we collect of response time of our HTTP reversed proxies over a certain period of time. As you can see, there are some interesting features in this graph. For the first of the graph, the mean, on average, most of the response time was around 0.25 milliseconds but there's some event toward the middle of the graph which causes a persistent bump in latency from around 0.25 to around 0.5.

But there are also some outliers, some cases in which there are spikes of latencies. These may cause by some tragic behaviors, for example, some page fault or some background task running or some trade scheduling. So, we are not in these outliers because they're only behavior, we're only interested in the persistent, long term behavior, which as per example, this persistent bump of latency in the middle. So, we basically need to take this signal out and filter it out to remove these outliers that we don't want. Because otherwise, if you keep these outliers in the signal, we'll make decision that will be incorrect.

Let me show an example of a filter which can achieve these objectives. To do so, we can use a filter that is based on a moving average. A moving average filter operates as follows. We define a window of a signal and we take all samples in this window and we compute the arithmetic average over this output. And the output signal would be basically a set of samples whereby each sample is the arithmetic average of all samples within a window. In this window slide, we are computing the mean all over the samples of our signal.

Now, as you can see, the output signal is much much smoother than the input one. And the size of the windows can be used as a tuneable parameter to decide how smooth they are the signal will be. The larger is the windows, the smoother will be the output signal. However, if you use a window which is too large, the output signal will not react fast enough to persistent changes into our input signal so it's important to select an accurate window size. There's no need to be too big but not too small.

Let me show you now, looking at this sample, the signal that I show you before. What is the effect of using a moving average window? So, let me show for example with this. The output signal are using windows at 20. As you can see, this window is enough to remove all the outlier but the signal is still a bit too noise. However, it reacts very fast when we have this persistent change in latency. If you increase the windows, the signal get smoother and again smoother and smoother. But as you notice, whenever there is the persistent change in latency around the middle of the graph, the bigger is the window, the slower it is to react to this change.

Now, let's take this signal and let's go through the anomaly detection stage. As I mentioned before, each single host can be represented by two values. The error rate and their response time. So we can plot each single host as a dot in this graph whereby the X-axis represents the response time that this host is having and the Y-axis represents its own error rate. So, let's plot all the hosts in our PoP within this graph. Now, let's say that that object won't understand whether the orange dot over here that represents a host is behaving anomalously compared to all other hosts in the PoP.

Let me explain you a very simple algorithm that we can use to detect these. We can take all these gray dots which are all the other hosts in our PoP and let's compute the average value across these hosts. These black dots represent an ideal host whose error rate and response times are equal to the average across all the hosts in the PoP. Let's now compute the standard deviation across both axis. For those of you who are not familiar with the term "standard deviation", it's basically a distance which is proportional to how spread the values are across the mean. So, the more the hosts are spread, the greater the standard deviation.

We can now define a threshold which is a proportion of the standard deviation across this axis. So, in this case we can see. By defining this threshold, we can then decide that a host will be healthy, will not have any anomalies in comparison to other hosts if it's within this threshold but if a host is outside this threshold, we can consider it to be anomalous and remove it from production.

Let's not talk about the Hysteresis filter. If you look at the previous graph, there was actually a problem that I didn't mention. What happens if a host fluctuates across the boundary of our region? Well, we've got this host which is outside of production. If it's fluctuating just a bit and coming back into this threshold, it will be put back into production. But then again, if it is not a little fluctuation, it will go out of production. This will cause our hosts to flap and go in and out of production frequently which is something that we want to avoid. To prevent this issue, we can define a second threshold, an inner threshold, and then configure our filter to ensure that a machine is removed from production only if it crosses the outer threshold and is put back into production only if it cross the inner threshold.

So, in this case, if this host crosses the outer threshold inwards, it's not sufficient to bring the machine back into production. We also need to cross the inner threshold inwards. Likewise, if you want to remove the machine from production, it's not sufficient to cross the inner threshold, it also needs to cross the outer threshold.

Now, the distance between these two thresholds is a tuneable parameter that gives a good trade off between stability and reactivity. The larger is the distance between this threshold, the more stable the output. However, if this distance is too large, our filter may be too slow to react to some changes so they will require some trial error or some measurements based on some data.

So, hopefully, the filter gives you a good idea of how we could decompose the problem and how we can stably and accurately detect machines that are misbehaving.

Let's now see how we can actually physically implement these in practice. As I said before, we are going to have a bunch of hosts in our PoPs. In our PoP, each host emits two signals and that gets fed into a filter which then computes the decision for other hosts and decides which host should be up and should be down. If you want to implement a filter, however, just implement a single filter, replicate only once will be a single point of failure. So we should probably need to replicate this filter and maybe have all the replica coordinating sensors. That seems actually, a bit difficult. Maybe there's a simpler way to do it.

So, we can actually make an interesting observation now. The internal state of the filter is determined entirely by its input. So, if we replicate several times the filter and each filter's got exactly the same input, then each filter's going to make exactly the same results because they've got the same implementation, it gets the same input, definitely the results are going to be the same, right?

So, we can replicate the filter as many times as the host in our and then each filter can operate independently. But now, as you can see, each filter still makes decision for all the other hosts. We don't need each filter to make decision for all the hosts. We can basically drop some decision and make each filter make decision only for one specific host. Now, after making this the composition of our problem, we can basically co-locate our filter physically in the host to which it refers to.

That composition has got very interesting properties because each instance of the filter is only responsible for making the decision regarding one single host. So, there's no coordination needed across these instances because each filter is responsible for only one specific partition of the decision space. Similarly, if one filter fails, that does not cause any cascading failure. The most that can happen is that the host to which the filter is associated with will be considered out of production, will be removed from production. So, this is a very resilient and robust solution to implement it.

However, there's still one piece that we need to decide to implement because if you remember, each filter needs to have a copy of all the signals emitted by all system so we need to have a robust and scalable way to distribute signals among all different hosts so that each filter instance, can have a copy of all signals that are emitted by all hosts.

We could have implement this one way using Consul. Consul provides all the primitive to implement this feature. We can emit over Consul some events so signals can using event primitives and filter can subscribe to these events using the watch primitive. Similarly, we can store the results in a local database and Consul can collect these results using the check primitive.

However, there's some peculiarity of our infrastructure that prevented us from using Consul. As you know, to run Consul, we need each host in our infrastructure to run a client but also we need an odd number of hosts, normally between three and seven to also run a Consul server. We need to ensure that at any point in time, at least one instance of the Consul server is up and running in production.

However, our PoP has got very strong survivability properties because we designed our PoP to ensure that the PoP would be able to survive and operate correctly even if only one machine is operational. But if we deploy Consul service on our infrastructure, if all hosts running the Consul server go out of production, that would cause the whole PoP to go out of production. Now, that could have been also an alternative implementation. We could have located some separate control machine which don't take user facing traffic and deploy a server on there.

But as I mentioned before, we are very much constrained in our PoP on power and spaces, so putting heavy machines dedicated to control and to just run a Consult instance was not a solution that was very suitable to us. However, there's another solution that we could use to implement this one and that's using Serf.

Serf, I'm sure that you heard already in several talks earlier today is a gossiping protocol that can be used to robustly and scalable distribute messages among member of a cluster.

Serf operates as follow. Let's suppose that we want to broadcast a message from one node of our Serf cluster to a different node. A node will start gossiping these messages to other nodes in the cluster. First, we start distributing this message to a fixed number of nodes within the cluster. These messages will received the node and in turn, will distribute this messages to other nodes in the cluster and all these nodes that receive that will do the same, until all hosts in our cluster receive a copy of this message they want to transmit.

This is pretty much how we beat our system. We got our own servers, each server runs a Serf instance, these Serf are all connecting to each other and exchange messages and exchange messages with the local health check instance that is running on them. Then, we also have another component on our system which consumes the changes in health and the propagates these changes to our central balancer.

These components practically is a BGP speaker with basically advertising towards our balancer in our infrastructure. And this is pretty much how we implemented Serf. We don't use Serf as a standalone process but instead we use the Serf library in a single binary.

All signals and filters don't communicate among each other. They all exchange messages across Serf. So, Serf is a very god piece of software. It was very easy to use and very scalable, very robust. Big fan of it. It was very pleasure to use this. By implementing our health checking systems within Serf, we learned some interesting lessons. In particular, there's a lesson that I think was useful to share with the community and I would like to share it with you.

At a certain point playing with Serf, we found out that by increasing the rate of messages that we sent or by increasing some hosts, we found out that we experienced some drop of messages. That's because Serf has got very interesting properties. So, Serf, by design, caps the maximum number of packets we can send per unit time. This is, in the vast majority use case, a very desirable property because this means that you and increase all gossiping activity, that does not cause us activity. It will only cause messages to be temporarily queued up and converged lower. This normally, as I said, is a very desirable property because that will ensure that we can control the bandwidth utilization on our Serf instances.

However, there's a little caveat. If you are going to use Serf like in our case, to distribute a constant stream of data, so not an event based messages that can happen randomly once in a while but something constant, we need to ensure that Serf is properly configured so that the maximum bandwidth that each node uses to send messages, is compatible with the amount of messages we want to send. So, we want to make sure that the number of messages that Serf wants to emit, is three times more than the number of messages than Serf can actually emit. So, we need to do some configuration.

Let's not see how we should configure Serf to achieve this objective. Now, as I showed you in the previous slide, each instance was Serf. Every time it wants to emit a packet, a message, we'll send it to a fixed number of neighbors. So, the number of neighbor is equal to λ x logn + 1. I'm sorry for the math but I promise I'll keep this simple. So, n is the number of nodes and λ is a customizable parameter in Serf configuration which is math. It's basically a multiplier for a transmission. The default value in both one and LAN configuration, is normally four. So, that means that basically, if you got a Serf node and these Serf nodes need to send our messages per unit time, we need to submit these messages over the network r x λlog + 1. So, for each of the messages that we receive, we need to replicate this and send it to λlog + 1 neighbor.

But that's only one node. There are also other n - 1 nodes in our cluster and all nodes emit messages as well. And when the first host receive messages from another host, we also need to contributory broadcasting the messages. So, we cannot only emit the messages that we generally produce, we also need to rebroadcast messages that we receive from other nodes. So, we need to send our own messages which is rλlog + 1 and also the messages from all other n - 1 neighbors which is n - 1 rλlog + 1. We can simplify this equation, then it becomes rnλlog + 1. This the exact count of number of messages the Serfs wants to send over the network. So, now I showed you what is number of messages Serfs need to send to operate correctly.

Now, let's see what is the number of messages that each Serf instance can send over the network and let's see how do we actually take Serf messages, we packet them and we emit them over our network. We've got, again, this Serf instance and Serf instance has got a bunch of messages they want to be submitted. So, the Serf instance will first try to pack as many messages as it can in each single packet and the number of messages that we can pack in a single packet depends on the average size of your message and the available MTU on your network. Let's call this MSG per packet.

Now, we define a fixed number of neighbors to which we want to send a message. These we call fanout and is a configurable parameter in Serf. And at a predefined interval, which in Serf is 200 milliseconds, we take each of these packets and we send it to one of these neighbors that we selected. So, remember, at each interval we submit a fanout number of packets. In each packets contain MSG per packet, number of messages.

By doing some very simple background on the unavailable calculation, we can compute what is the maximum number of messages that Serf can send over the wire given the configuration and that these clearly, the number of messages that we can send per packet times the fanout, which is the number of neighbors to which we send the packet divided by interval.

So, I showed you what is the number of messages the Serf wants to send to operate correctly and how many messages the Serf can send given its own configuration.

To ensure that we got queue stationality and we don't overflow queue at any point in time, we need to make sure that on average the number of messages that the Serf wants to send, is strictly less the number of messages the Serf can send on the network.

So, by pulling back again the equation that I showed before, this is the equation that we need to make sure our configuration satisfies. So, we can arrange things to better understand some of the properties and this is the configuration. Now, let's break down this equation and let's see some properties.

So, as you can see, the rate of messages that the Serf is provided ... So, the maximum number of messages that we can send without building up queues decreases as the number of nodes increases. So the more are there nodes, the more will be the gossiping traffic, okay?

Now, MSG per packets is something that we cannot do anything about that. Depends on the size of your messages and . It's not a tuneable parameter. The only tuneable parameters that we have are λ, fanout and the interval. However, λ is a parameter that's probably not safe to touch with because by touching λ will affect the rate of convergence of Serf. It will also affect the probability of eventual convergence of Serf. So, it turns out that the only parameter that we can tune are fanout and interval.

So, I think this equation can be quite important for those of you that use Serf at a reasonable frequency and one thing that I will say is that, if you want to operate Serf in such a way that you have a number of n nodes and each nodes send r messages per second, you should probably make sure that your configuration is such that these constraints is met because if these constraints violated then we'll end up with a and they will cause messages to be dropped.

So, this pretty much concludes my presentation. In summary, I would like to summarize the three main points that I discussed in this talk. So, first of all, hopefully, I was able to convince you that health checking modes that cannot be accurately identified if you only make health check decision based on a single instance, but we need to make health check decision based on all the instances put together.

So, a system like that could be implement with Consul or Serf and I think that Consul and Serf can be used in two different regions of design space so depending on your problem, you can use Consul or Serf. And another thing that I discussed in this talk is how to properly tune Serf configuration to ensure that the number of messages that your Serf instance wants to send does not cause a queue buildup and therefore a drop in messages.

So, finally, I put together some references that could help you out if you want to further investigate the issue and you want to design a filter yourself. These are pretty much handbook references, so the first books are about filter design. The first one is O'Reilly book and is a very good bridge between practice and theory while the second one are pretty much academic, a very theoretically book. The Mitra one is probably more inter lateral than the Proakis Manolakis but anyway, they're pretty heavily on math. Anomaly Detection is a very good book. Again, I think it's got a very good balance between practice and theory. Finally, Let's Talk About Stable Control, these are pretty much the ... The first book is a reference book for where the second one is, again, a book that provides a very good bridge between theory and practice.

Finally, if you want to learn more about the Serf internals, they are two very interesting papers. So, the first one is the paper. That is the regional paper which described the distributive failure detection and gossip used by Serf. And the Lifeguard paper, which describes some very clever optimization that to improve the performance of Serf.

So, with this I conclude my talk. Thank you very much for your attention. Come talk to me if you want to learn more. Thank you, again.

More resources like this one

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 1/20/2023
  • Case Study

Packaging security in Terraform modules

  • 12/22/2022
  • Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

  • 12/13/2022
  • Case Study

Nomad and Vault in a Post-Kubernetes World