Case Study

Workday's multi-cloud network fabric with Consul & Vault

Published 7:00 AM UTC Nov 04, 2021

See how Workday built a global service mesh of multiple customer datacenters using HashiCorp Consul and Vault.

»Transcript

Hi, everyone. My name is Daniele Vazzola. I'm a principal infrastructure engineer at Workday. My team and the infrastructure organization take care of running the Workday application in our datacenters and cloud provider.

Workday is a leading provider of enterprise applications for finance and human resources. We serve more than 55 million workers and over 9,000 customers across the globe. The Workday application is deployed in a number of datacenters, a few on-premises and a few on public cloud.

My team takes care of connecting all those datacenters, making sure that our developers have a seamless experience when they try to connect to their upstream independencies or move data between different locations.

As you know, networking is art. There's no one size that can fit all requirements. Different applications need different kinds of connections, and it's not easy to ensure proper, secure connectivity between different datacenters. We need to have proper encryption, and we need to install firewalls and different controls, from the physical layer up to the application layer, to make sure we can control and secure traffic that is handling customer data.

»Multi-Cloud Challenges

When you move to multi-cloud, the challenges get harder. There's different technology available in different cloud providers. They all offer products to connect your datacenter to their network, but then when you want to connect multiple datacenters to multiple networks across different cloud providers, there's no solution that works across the board.

The IPv4 private address space is just not enough. When you look at it, 18 million seems like a lot, but they're not. You start fragmenting the network. And before you know it, you end up with a lot that cannot be routed across cloud providers.

And IPv6 is just not an option. The support in public clouds is not yet at the level where you can use IPv6, only networks. And so the sooner you realize that you cannot have a flat network, the better.

We tried to see if there was a common layer where we could hide all the different cloud provider datacenter differences and give our service team that nice flat network that they like to use.

We started from low-level network protocols, techniques that we use to connect our own datacenters. And that was good, because it requires the least involvement from the service teams. They don't need to know what's going on. When we connect different centers with dark fibers or mTLS, it's completely invisible to them.

But we cannot use that when we connect to cloud providers. Those really low-level technologies are not available. So if we move up the stack and go all the way up to Layer 7, we reach the application itself. And in there it's easy to have a service mesh that will work with any cloud provider, any datacenter, because in the stack, you don't see the underlying network devices anymore.

But that requires a lot of changes to the applications. If you're lucky enough that everything is running on the same platform, you might be able to inject your own libraries or network configuration in every application.

But if you have a legacy application running on another stack, or there's an acquisition and the software that you just bought is running on a completely different design, this cannot be done. It takes too long for all the different applications to onboard the same platform.

In the meantime, we need to have something that just works, maybe fewer features, but enough to have a nice experience for our developers and make sure that they can focus on adding features and fixing bugs for our customers instead of spending time trying to figure out how to send packets in our own network.

»The Ideal Connection

We went around to different service teams, stakeholders, asking what they would like to have. And at the end, everyone came back with this: They really don't care. They just want to send packets and have some network magic take care of them and make sure they reach the other side in a safe way.

The fact that they don't care doesn't mean that it's not important. That means that they're not interested in which technology is used, as long as the packets reach the other side in a reasonable amount of time.

So we thought: We can not do magic, but we can build something that, from a distance, at least from the service-only point of view, looks like magic. And this was Consul connected with a series of custom gateways. As you know, Consul has had this concept of multiple datacenters since the beginning. They started with RPC forwarding across datacenters and now continue with Consul Connect and the service mesh with mesh gateways.

So we thought that building a mesh of customer datacenters was relatively easy. It's well documented around Layer 7. Cloud providers had no problem giving us enough primitives to do this.

But once we got this, we needed a way to have the service team and our application set traffic inside this mesh without being part of the mesh itself. And this is what the ingress gateways are for. The Ingress gateway is an instance proxy that is connected to the rest of the mesh and is able to take traffic from outside the mesh, wrap it in proper mutual TLS, and send it over so that the rest of Consul Connect is able to route it to the destination.

On the other side, the gateway unwraps the mutual TLS and certificate and connects to the endpoint as a standard client. Source destinations don't see the mesh, which is great, but in between, the communication is secure with proper encryption and allows us also to decide who can talk to what.

Consul provides a reference implementation on ingress gateways. It's controlled via config entries. You can handle that with APIs to Consul, using Envoy on the data plane, which is great as well, because it has good performance and is easy to deploy.

The ingress gateway provided a few constraints. You can have only 1 TCP service for each TCP port. Multiplexing or wildcard is supported only for HTTP using auth headers, and TLS can be enabled but uses the same certificate used for the Consul mesh itself. The certificate is valid, a proper subject alternative name is added, but still it's the same signing CA. For our use case, we needed something different.

Thanks to Consul being modular, we just replaced that. We built our own controller that renders configuration for HAproxy and Envoy on the data plane. The main difference is SNI multiplexing. We expose multiple TCP services on the same TCP port and sniff the SNI headers on the TLS handshake. We use custom TLS certificates to terminate the client TLS connection.

»Vault's Role

We looked at our existing Vault deployment. We can use different certification authorities and different certificate formats to expose this ingress gateway and services to our internal clients and also for the support for HTTP headers. So we can set auth headers, write them, whatever was needed by our applications.

We are working now on enabling passthrough TLS to allow our client applications to have a TLS connection that is terminated all the way back to the destination, by double-dropping the connection with Consul mutual TLS in between. That will make our network completely invisible, but add a bit of latency for sure. But a few of our clients want to do this on their own mutual TLS connection. This is something that we are working on at the moment.

»Workload Identity

Another thing that was really important for us was to make sure that we are out of the picture. A service team shouldn't need to interact with the infrastructure team at all when setting up and deploying their application. So we created an API that can be used to request and create and destroy ingress gateways or register services with the mesh and set up intentions and all the other business rules that we needed.

The problem there was that we needed to ensure strong identity to identify the different actors and make sure that there was no way to misconfigure the mesh or for a service to take over traffic for another service. Vault was already there. We used the pluggable authentication backend in Vault, and we enable our platforms to login to Vault and exchange some proof of identity with a stronger certificate that we could then use to interact with the API.

We have a workload in AWS, whether 2 instances or serverless workloads, authenticated to Vault.

We have Kubernetes tasks. We have humans using things like LDAP or single sign-on, and also other internal authentication backends that we just plug into Vault using this different service team. And they can authenticate, get access to our API, and request the network to configure itself in the shape that they need for their deployment.

We reached this point in which we have Consul taking care of connecting different applications across clouds and datcenters and the service client that can send traffic to their closest and dedicated ingress gateway. We take care of moving the packets around without them knowing. And on the other side, a service will just receive traffic from its closest terminating gateway.

Everything is encrypted, and the connection is stable and has decent performance, because the proxies in between usually keep the TCP connection open. So you save a lot of handshakes and reuse existing tunnels to send the packets through.

»Getting a Jump on the First Hop

There's still one bit missing: the first hop. We still need the client to force the traffic through the ingress gateway setting. Setting custom SNIs is something that is not that common yet on a lot of clients. Since I have been around for 15 years, most of the time the configuration is implicit. Libraries just set SNI to whatever is the FQDN or OS name that was requested by the client. More modern clients allow you to set that. This is a standard way to connect to some services. Your client uses usually DNS to resolve an OS name to an IP, opens a port, and connects directly. The SNI is set to service.com in this case, because the library does that, and this works 99% of the time. This is what you want to do.

In our case, it’s different. We need our service to get a different response. Instead of an IP port. It needs to get a hostname or IP port for the ingress gateway, and then an SNI to reach the service that they want to do. This is not easy to do with all libraries.

Modern clients allow you to do that easily. For example, Vault CLI has an explicit parameter there. You can set that in config as well, and that allows this specific override between the address where you are connecting and the SNI that you want to use. You can do that with curl, which offers the same features. curl will also set proper status. Your destination service wants to see this configured and have no idea that you're using a different ingress gateway to reach it.

Legacy clients have more problems. The only way, unless you start building libraries and software, is to override DNS. You need to intercept DNS requests and respond with crafted responses that will return the IP address and port of the ingress gateway that you want that client to use. That requires a lot of involvement from the service team.

We started with the idea that we wanted them to even ignore that we existed. And now we cannot go back to them asking to override DNS because it's not easy to do, and depending on the technology they're using, it might not even be possible.

»Work with the Platform Team

That's where involving the platform teams came in. Our applications, most of them, run on an internal platform that is managed by a dedicated team. They are in a really good position because they're close to the application. They know what the application needs, and how the application runs, but they also really used to interact with us and the rest of the infrastructure to plug their platform in the underlying infrastructure.

We did a good thing there. We engaged with them early in the project. Didn't have even a working proof of concept, but we're already chatting with the different service teams to see if there was a way to connect what we were building with what they had.

This was really useful and worked really well because, for the service team and the platform team, it was relatively easy to adapt their platform to hook inside our mesh.

An example was the team responsible for our Kubernetes platform. They did a great job and created a Kubernetes operator. They will take care of passing those upstream requirements from the different deployment configurations provided by the service team. We then interact with our API and set up all the different configs between ingress gateways service station and intentions, to make sure that our mesh was ready for that kind of traffic.

And then they override and inject specific records in their own core DNS in Kubernetes to make sure that when the application actually tries to reach a service that is supposed to go through an ingress gateway and our mesh, Kubernetes will return the IP address and port of the ingress gateway instead of the real IP address, if the URL is something that is resolvable outside of Kubernetes.

This is nice because the applications don't really see what's going on. They don't know and they don't care. That's what we were aiming for.

We started deploying this in a few dev environments, and it was working fine. A lot of times they didn't even know that their traffic was going through this mesh and it just worked. That was what we were aiming for, having this network that you don't see, you don't know you're using. And that means that it's perfect for what you are trying to do.

We went from a proof of concept to handling hundreds of services in a matter of weeks, but probably too fast. We discovered really early a lot of scaling problems and config issues. But it was great because we had quick feedback and short feedback loops, and we are still improving. But it's already handling a lot of traffic and we are rolling this out for production workloads as well.

One thing that wasn't covered by the service team is that long tail of services that are not part of our platforms, or they run on a lower level.

One example was LDAP connections for Linux systems like running virtual machines that you want to connect to LDAP to have your Unix users and groups populated from there. That doesn't work. The platform can't help you there because you are on a lower level.

And as we found out, LDAP libraries in common Linux distributions are still not supporting SNI because no one probably needed that. They're a bit behind. The only solution was to offer a lot of different options and configuration snippets and different ways to do the same thing. So we could give all the service teams a list of different ways to solve that problem. Hopefully one of those recommendations will match what we're doing.

For virtual machines, it was simple. We can use standard runs everywhere, and you can use it to inject TLS and SNI to any connection.

It listens to the localhost and then takes care of connecting as a client to the upstream. Modern and standard deployments and also have this nice SNI explicit configure. It makes the configuration really easy.

For others, say a workload running in AWS, we offer people snippets in Terraform that show them how to override Route53 records. That works great for serverless workloads like Lambdas. And that was probably the best thing that we could do. There's no way to create something that works for everything, but if you offer the service team enough options that eventually will make their life easier, they will try to help you in return.

That’s the advantage of being part of this mesh.

It's so great that even when some work is required on their side, it is not felt as an imposition from the infrastructure team. It's, "Let's meet halfway and we can get something great going on in here by doing a bit of work on both sides."

The success of this project was based mostly on the communication and cooperation between the different teams. The technology itself keeps changing. The requirements we had 6 months ago are different from what we have now. And in 6 months' time they will be different again.

We are using HAproxy. We'll probably move to Envoy for the ingresses, and I'm not sure what we're going to do with the rest of Consul as we move forward.

The important thing is that we can fix the technology. We have a lot of smart engineers working on the problems and they will find a solution and something that works.

But you need to get the people on board and make sure that we are all working in the same direction and trying to have a clear goal of what needs to happen. And then everything else will fall into place.

Thank you for listening to my presentation. I hope you have a great conference.

Sign up for the latest HashiCorp news

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023
Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/22/2022
Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

12/13/2022
PDF

A Field Guide to Zero Trust Security in the Public Sector

View all resources