Hear about Stripe's service networking team talks about their multi-region service networking tech stack built on Consul.
First, let's talk about Stripe. Stripe is a technology company that uses economic infrastructure for the internet. Businesses of every size, from small startups to public companies, use our software to accept payments and manage their businesses online. Hundreds of billions of dollars are processed by Stripe a year.
The core products Stripe offers enable merchants to accept payments from around the world. Stripe Connect powers businesses that provide online marketplaces, letting them accept customer payments on one end and paying out to service providers on the other.
We are the service networking team. The team owns the core infrastructure and other services that Stripe depends on to discover, route, and communicate with one another.
Functionalities in our scope are:
Service networking is built on top of open source technologies, including Envoy for request routing and load balancing and Consul for service registration and discovery.
Before getting into technical details, here are some context and a bit of history of the tech stack. Broadly speaking, Stripe started as a monolithic Ruby app, and we are now running microservices, mostly written in Ruby, Java, and Go.
Back then, NGINX and HAproxy were used to route traffic to underlying services. Puppet was used for managing service catalog—those are config files that feed into HAproxy for service discovery—until 2014, when Stripe adopted Consul. Envoy was adopted around 2017 for service routing and TLS processing. Around the same time, we upgraded to Consul Enterprise edition for the better support for end-to-end service discovery features, and the multi-region federation.
From the date when we first had data available, we have grown the number of Consul agents by 30x.
Within one Consul datacenter, we're using the common architecture. We have 5 server nodes to form the core. This setup allows us to tolerate 2 server node failures. There are up to tens of thousands of client agents gossiping with each other and sending their updates back to the servers, using RPC.
We deploy over 30 Consul clusters overall using the above architecture. The datacenter size ranges from hundreds to tens of thousands of Consul agents. We have separated our cloud into multiple virtual domains, where those domains have different security requirements, as you expect for a payment infrastructure company. Those different Consul datacenters are identified by domain, region, and environment.
All those Consul datacenters are federated using a topology specific to our use case. We use WAN to federate the datacenters in the same domain and environment. For example, 2 datacenters, DC1 and DC2, are both in the same domain and environment. But different regions are connected via WAN. The servers communicate using both gossip and RPCs. Note that the data in general is not replicated between datacenters. When a request is made for a resource in another datacenter, an RPC request is forwarded.
There is one exception: ACL (access control list) is replicated across WAN members. The benefit of that is the same ACL token, granted to a client could be used for cross-PC requests in WAN-federated datacenters.
We also use network area to help connect 2 arbitrary pairs of datacenters. Unlike Consul WAN federation, network areas just use RPC for communication, and it's secured with TLS.
The relationships can be made between any independent pair of datacenters. That gives us flexibility, and it's easier to set up. This allows for complex topologies among Consul datacenters like hub [and spoke] to be created, and we have multiple hubs in our topology that are connected to almost every other Consul datacenter. The combination of WAN and network area powers our multi-region Consul deployment.
In my opinion, WAN is a bit more complicated to set up but comparatively easier to operate and maintain afterwards. Network area is the opposite.
Now Mark will talk about our use cases.
The first use case for Consul is powering service discovery and registration, which Consul does very well out of the box.
The biggest user of Consul is our Envoy control plane and Envoy proxies that power Stripe's service mesh. The Consul catalog for each datacenter is collected and aggregated by the control plane and then sent to the Envoy proxies.
In addition, we also provide the DNS and HTTP interfaces to a few of our users who choose to opt out of our service mesh. Those users will do their own client-side load balancing using service catalog data from Consul. Like many other companies, we also have a layer of DNS servers that forward requests to Consul, and here we implement TTL caching to reduce load on our Consul servers.
As you know, Consul at its core is a consensus-based key-value store, and there are many features built on top of it at Stripe—for example, percentage migrations [canary deploys], blue-green deploys, basic dynamic service configs, and distributed locking and leader election. Our Consul clients for services at Stripe have built-in support for synchronization and leader election primitives. These libraries are commonly used for services that need leader-based routing.
Now, Ruoran will talk more about our service networking stack.
At the beginning of time, there was no such thing as service mesh. Once a request was routed into Stripe's datacenter, the internal frontend proxy would route the request to the backend host. It was roughly speaking managed by HAproxy and Puppet, where Puppet is responsible for sending the updated data service catalog as a file to hosts, which means whenever there is a change, you need a Puppet run to distribute the update to each host for HAproxy to pick up. As more host services are added, this becomes challenging.
Consul was adopted to solve the delay and overhead of those Puppet runs. And the service catalog updates for HAproxy are propagated in a timely manner. And the operation cost is reduced. Then Envoy was adopted for 2 reasons: security and usability.
The most important reason is we can check the identity of who is making the request. With built-in mutual TLS support, the client making the request needs to provide a certificate proving who they are, and the server can decide whether to accept or deny the request. Of course, the requests are encrypted. In addition, it allows us to support standard timeout, retries, and more for different languages.
It also supports sophisticated load balancing and traffic-shifting features.
In our first iteration, Envoy loads its configuration from Consul and a few other data sources directly from each individual host. In the second iteration, an internal control plane is built that is responsible for sourcing data and compiling them into Envoy configurations. The control plane enabled us to have more control and ensures those Envoy configurations are consistent. On screen is a 10,000-foot view of the first part of the stack, where a Consul client agent runs health checks and sends updates to Consul server.
Now on screen is a similar diagram, but the data flows in the opposite direction. Our control plane consumes the service catalog data and sends configurations to Envoy running each host.
One important part of service registration and discovery is health checks. There are health checks for nodes and services. Node-level health checks are used by our compute layer. It is an integral part of the node bootstrap process. Several standard node-level health checks are executed on all nodes. This also puts Consul at a lower level of the stack.
On the other hand, service health checks are configured by service owners and are registered when the services are started. There is a Stripe-specific interface for service owners to configure those health checks. We use almost all of the kinds of health checks supported by Consul, including HTTP, GRPC, custom script, and TTL checks.
Health checks can be categorized into active and passive. If we look at when it's executed or categorized into client and server side, we have active server-side checks performed by Consul, plus active client-side checks performed by Envoy. We're considering enabling client-side passive checks. The goal is to make sure the actual state of the world is detected in a timely, accurate, and efficient manner.
Mark will now talk about how it works with Kubernetes.
Stripe is migrating more and more services over to Kubernetes due to the benefits it provides as an orchestration platform. In the Kubernetes world, we choose to continue to use Consul for service discovery.
How does this work? Each host has a single Consul agent but now runs multiple pods and therefore multiple service instances per host. On each host, we run a proxy which listens to the Unix domain socket.
Then pods run a process which will mount the socket and forward pod requests to the host Consul agent. We run an additional proxy here for security purposes, so pods do not have direct access to the host Consul agent.
For service registration when bootstrapping, each pod registers itself as an instance of a service by setting the service address field to be the IP address of the pod. It also provides a health check that is called at that IP address.
For general Consul usage, a service can send requests as it normally would through localhost 8500. These requests get forwarded through the in-pod process and host proxy to the host Consul agent.
Why do we take this route? It provides us with an easier migration for existing services, since service owners do not need to change existing call sites and interfaces for service registration and discovery.
It also allows for Kubernetes services and host space services to exist in a registry without additional work. The alternative would've been to use Kubernetes service discovery out of the box and merging service catalog snapshots in our Envoy control plane. However, this would involve building a new unified service discovery API and a difficult migration, so we decided not to take this route.
Ruoran will now talk a bit about challenges that we faced when deploying Consul.
The first challenge was maintainability. We manage 30-plus datacenters, and the number is still growing. The number of nodes and service checks registered within Consul is also growing. On the other hand, we found an emerging need for observability.
Consul provides a great number of metrics by agent. However, we were looking for fine-grain, client-side metrics that could give us insights into how our users are using Consul on different hosts.
Another challenge is that the Consul metrics should not overwhelm our observability stack.
Third, we're low in the stack and we have a high bar for availability so Stripe can provide reliable service to customers.
Last but not least, scalability. The speed of the growth forces us to think about scalability up front. We're dealing with gossip limit. There is a soft limit that we're about to hit and we need to be prepared and to have planned for it.
For observability, we're leveraging a feature of our stack where we aggregate client-side metrics for host types, using AWS terminology, auto-scaling groups, which dramatically reduced the metrics load. In addition, we have set up an observability tool to monitor across datacenter reachability. Since we have multi-region deployment.
We do aggregation, and we lose the per-host metrics. That's why we are also testing the Consul audit-logging feature, which will give us insight into what exact APIs our users are using on each host.
For maintainability, the spinning up of a single Consul cluster is to a large extent automated and managed by code using Puppet and a Terraform. However, we need additional configurations, for example, when we federate multiple clusters in a specific topology for the whole system to be functional from Stripe's perspective.
Currently there is a lot of manual work involved in this process and it's error-prone. That's why we're working on declarative configuration tooling that will allow us to define the expected state of our Consul clusters. We've already built actions on top of Consul CLI that will apply the update to the cluster, for example, to set up network areas in an idempotent way.
The last piece is we need to have continuous monitoring so we know whenever there is drift from the expected state and we can apply actions.
Now Mark will talk about availability.
We run game days with various failure modes in Consul. Some game days we run include network partitions and maintaining quorum but having individual node failures and complete loss of quorum. Through these game days, we've helped services and users reduce Consul as a single point of failure.
Some best practices we've rolled out to various clients include stale reads when weak consistency is acceptable, which helps decrease failure rate in the case of loss of quorum. Another is retries with an exponential back-off when latency is not an issue, which helps decrease client observe failure rate during transient outages or leadership transitions.
Additionally, we've built an integration test suite for testing Consul binaries. This test suite spins up Consul clusters in Docker containers in configurations similar to our production environment setup and then runs tests that we've written.
This is especially helpful when doing upgrades and testing new features. We used to just roll out new Consul versions gradually in our QA environment. However, recent bugs and specific features that we used have caused turbulence in our QA environments, creating a lot of KTLO (keeping the lights on) work and impacting dev productivity at Stripe.
There have also been very specific bugs that we did not catch in QA that have slipped into production. And our integration test allows us to define and test for those specific edge cases going forward.
We also replace our Consul server nodes on a regular cadence. This is done to ensure node freshness and reduce the likelihood of nodes running into unexpected EC2 failures. To do this process, we use a custom leave script. A Consul server node will:
a) acquire a lock to ensure no other node is currently leaving
b) wait for the Consul cluster to be fully healthy
c) wait for there to be a sufficient number of voters before proceeding with termination.
Finally, we also utilize the Consul snapshot agent to periodically take snapshots and save them to S3. This allows us to restore our cluster quickly in the event of a full outage. We also cache the snapshot in our service mesh control plane, and will continue serving the stale Consul snapshot to our service mesh in the event that our control plane cannot reach Consul.
We are also building out a way to allow for our service mesh proxies to directly read S3 Consul snapshots as a last line of defense in the event of a full Consul and full control plane outage.
Overall, our Consul clusters have scaled fairly well, and we've really only needed to scale Consul server nodes vertically, i.e., bumping the EC2 instance size. Recently, however, we've been planning and rolling out additional changes to improve performance in our largest cluster. That is tens of thousands of nodes. As you know, Consul relies on Serf gossip protocol for checking node health and having information converge.
Recently, we've noticed issues with excessive gossip traffic and believe there is a soft limit for total amount of gossip traffic that we are approaching. To continue scaling our gossip pool, we are rolling out the network segments enterprise feature, which splits the Consul gossip pool into an arbitrary number of segments.
Finally, to better scale reads and writes in our Consul servers, we are exploring another feature: redundancy zones with clusters that have 3 voters and 3 non-voters. The benefits here are twofold. First, the non-voters handle stale reads, so we improve our stale read RPS, and also it prevents requests from overwhelming the leader.
Second, reducing the number of voter nodes reduces the number of Raft operations needed per write, thereby increasing write latency. Note that this does reduce our fault tolerance, since we normally run 5 voters. However, the 3 voters are spread across different availability zones, and we've really only seen nodes failing one at a time, a scenario in which a non-voter will be promoted to a voter.
Lastly, we are also considering spinning up different Consul clusters for the separate use cases we mentioned earlier, such as KV and leader election. This is to prevent all loads from the combined use cases from having to fall on a single Consul cluster.
Thanks so much for listening.