Making Multi-Environment Service Networking on Microsoft Azure Easy With Consul
Oct 07, 2019
This talk is a deep dive into the features of HashiCorp Consul Service on Azure (HCS).
Do you use Microsoft Azure? Do you run a Kubernetes cluster? Do you run... MORE than ONE Kubernetes cluster? Do your Kubernetes applications need to talk to applications running on Azure VMs, a database on-premises, or to a stack of toasters running NetBSD in your closet?
Consul can help you adopt a service mesh, enabling a flat, fully encrypted network that spans multiple underlying networking environments. Sound intimidating? It's not! Learn how this can be made as easy as possible on Microsoft Azure using HashiCorp Consul Service on Azure.
- Alex DadgarSoftware Developer, HashiCorp
- Brandon ChavisSolutions architect, AWS
Brandon Chavis: Hello and welcome. Thanks for coming. The title to this talk is Easy Multi-Environment Service Networking on Microsoft Azure. You can safely disregard this title. We are going to talk about some of that, but it is a stealth-mode title for us, so we didn't accidentally give away HCS before the conference.
I'm Brandon Chavis, I'm the product manager for HCS. We'll bring up Alex Dadgar about halfway through to talk about some of the technical deep-dive aspects. But let's go ahead and get started.
This morning we announced a new initiative. This is a beta launch of HashiCorp Consul Service. It's a private beta. And what is this? This is a managed Consul service. This is something we worked with Microsoft on, built in partnership with them.
Our goal with HCS is just to make it as easy as possible to get a fully managed Consul cluster running in Microsoft Azure, so that you can take advantage of service discovery and service mesh functionality to connect all of your different compute environments together.
We are going to deep-dive into HCS in this talk, but first I wanted to set some context for why we decided to build managed service for Consul. And that's because Consul serves as the center of your networking architecture, because it can connect together so many different compute environments, whether that's Kubernetes, virtual machines, or physical servers on-prem.
Anything that tends to be responsible for so many environments also tends to be a pretty fundamental component of your stack. We wanted to make it easy to take advantage of Consul, but also dramatically reduce the day-to-day operations that your team has to undertake and leave that to the experts at HashiCorp.
Consul’s role in your infrastructure
Let's talk about a couple of scenarios where Consul is really useful, and why it serves such an important role in your infrastructure.
We'll review a world, a very ancient world, before service discovery. And traditional networking is where you view things at an IP level, and services connect directly to another IP. This assumes a world where IP addresses don't change frequently, maybe like on-prem servers.
Something this static we know really isn't realistic, especially in a world in the cloud with autoscaling virtual machines, maybe with containers that are ephemeral and rescheduled constantly. Maybe they live only a few hours. We know that you cannot rely on an IP address to be consistent and use that to reach these services.
As things get more dynamic, as you move to the cloud, there are ways to cope. Assuming your clients and your servers are in the same network, you could use something like an internal load balancer.
But you need to make sure that your downstream service of that load balancer knows about that endpoint. You have to figure out some way of propagating that information to your other services, and you also have to figure out how to update it if it changes.
Sometimes what this looks like is your downstream service hard-codes a DNS endpoint. Maybe it just looks directly at the CNAME assigned to your load balancer. And this is generally fine if you're running 1 or just a couple of services, but as you scale up this gets pretty complex. Changing just 1 DNS record could require you to do things like update dozens of downstream consumer services.
This is where Consul service networking comes in and can dramatically simplify your application's view of the world.
There are 2 sub-patterns here within service networking that we'd like to talk about. The first is service discovery, and this is what gives us name-based resolution, or essentially letting you not care about the IP that your service is running on.
Because Consul just knows where instances of service API, for example, live, and it keeps up with any changes to its location or its health, and then it propagates them across the cluster.
Service mesh gives us the ability to specify connectivity patterns between our services and then connect them using Mutual TLS. TLS gives us the authentication solution, so it's identity via certificates. And then authorization is provided by rules in Consul.
This lets us say that our API can reach our database, but our web service can't reach out directly to the database. You still have IPS in the mix, there's still a core networking primitive, but they're really no longer a security mechanism for you.
Some systems, like Kubernetes, already solve service discovery pretty well. For example, here our web service can reach out to our API service within the same cluster, no matter what instance in the cluster it's running on. You can do that by name; it can do that anytime.
The cross-cluster problem: Kubernetes vs Consul
But there's a major caveat here in Kubernetes, and that's that this service discovery only works within the same cluster. If API wants to reach out to the DB service running in a different cluster, it wouldn't automatically know how to resolve that service endpoint.
There are ways to solve this. You could use a load balancer, but you're still responsible for propagation of that load balancer information across clusters. Of course, there are ways to do this, but you're down the path of building service discovery for your service discovery.
Again, Consul can help out here by being a central authority for all of the services that live across Kubernetes clusters.
The service registry ensures that everything that's running across both clusters is dynamically registered and monitored and health-checked, and is up to date. Either cluster can use that Consul service registry to discover and access a service on another cluster entirely.
This is all implemented behind the scenes using really simple mechanisms. At the end of the day, it's just the Consul agent running on each node, and it does DNS resolution for services that are in the registry. Kubernetes just talks to Consul over localhost, so there's no additional latency of another load balancer op.
Problem: Consul challenges across disconnected networks
But even with Consul, there are still some common networking challenges that we need to solve. Even if I have Consul, if my applications are on different disconnected networks, for example, I can't easily make them available to each other.
It's really common for us to see different Azure VNets for different teams. Maybe each team has their own VNet or their own Kubernetes cluster. It's also really common practice to put a Kubernetes cluster into its own VNet, and so the connectivity between these clusters is something we have to solve for.
You could again use an external load balancer in this case, but as we've seen, this doesn't really make your life that much easier.
If your clusters are in separate networks, we could use VNet Peering as well. We could solve this connectivity challenge that way, so long as we've planned our IP address space in advance.
This is the same problem as setting up a VPN to connect back on-prem. You need to make sure you've planned ahead, that you've got no IP address space overlap. If you do, you might need NAT and the complexity grows.
VNet Peering would then enable you to use an internal load balancer in this case, like we described in a previous slide, but then you have the challenge of managing the load balancer configuration as well as the underlying network connectivity.
Mesh gateways close the gap
A graceful way that Consul can help you solve this is with mesh gateways. We announced this as part of Consul 1.6. Mesh gateways are effectively just Envoy proxies, and they enable you to connect compute environments that run in different clouds, or in different networks.
Mesh gateways receive their configuration from Consul, and they just proxy fully encrypted traffic across network links. Super-cool because they can't decrypt that traffic. It's fully encrypted. The only plaintext that they see is the SNI header specifying where to forward traffic.
At the end of the day, it looks like this. Mesh gateways can help you connect to Kubernetes clusters that live in different networks. This can help you span Kubernetes service discovery across multiple clusters without needing to manage load balancers, and without needing to manage peering connections.
And because mesh gateways work anywhere, not just in Kubernetes, they're super-helpful for migration or hybrid use cases.
For example, you can add legacy applications to the service mesh even if they run in different networks. As a migration example, as you move pieces of functionality from a monolith running in a VM to Kubernetes, if they're all participating in the service mesh, all you need to do is re-register that service and its new location is automatically propagated across your infrastructure.
These are some of the numerous networking scenarios that Consul can help with. Especially if you're using Kubernetes and you need that cluster to communicate with other compute environments or other clusters, especially those living in different networks.
Service mesh and service discovery in conjunction can prevent you from needing to manage load balancers, peering connections, and can absolutely help you with migration use cases as you move from those monoliths to microservice.
With HCS you get those benefits, but we also further reduce your operational burden by managing those consul servers on your behalf. And you can just take advantage of service mesh and service discovery and leave the day-to-day stuff to us.
Benefits of HCS
Let's walk through some of the benefits of HCS.
The first benefit is provisioning. We provision the Consul cluster for you and we streamline the setup to make it much easier to get going.
There is some configuration data you give us when you spin up the cluster. We take all those inputs and we provision the cluster as you've requested.
The second after provisioning is operational work. Operational work is ongoing, but we have a team of SREs that are there to manage and support your clusters, and we also provide features to dramatically reduce the work that your team has to do when you're running Consul.
This includes things like automatic upgrades, automatic backups, and by default configured for high availability.
The last point is security. This is a huge focus of ours, and we'll be enabling and configuring all the security features that are available in Consul by default. This includes things like ACLs and certificate management that we take off your plate.
HCS on Azure
Let's also talk a little bit about the Microsoft integration, because we worked closely with the folks at Azure to get this built.
Some of the most important things are that you can provision and manage your clusters directly through the Azure portal, and we saw a little bit of that in the keynote. We'll talk about that more later.
But HCS basically makes it so you don't ever have to leave the Azure environment to get started. You can also control access through Azure Active Directory. You don't need to sign up anywhere else; your identity in Azure is automatically taken in and we use that to create the cluster.
Then finally, you'll also get granular metering and usage data on your Azure bill, and you'll pay through Azure.
These are some of the integrations that exist today on top of all the architectural best practice for Azure and Consul that we've put into the design of this service.
This slide shows the 30,000-foot view of HCS.
You'll find it eventually available in the Azure Marketplace. It's not publicly there quite yet. And you'll be able to subscribe from the Marketplace.
Once you've subscribed to the service, you'll be able to create a new Consul cluster, and we have a control plane here on the right-hand side that is built to enable the provisioning and the management of the clusters on your behalf.
These resources are provisioned into a resource group that lives in your account and it's locked down. And you don't have access to those resources, but these are the resources that we manage for you. This contains things like a VNet dedicated for your Consul cluster and the servers for Consul to run on.
I've described a little bit about why we decided to build HCS and what problems it can help you solve. So I'll bring Alex Dadgar up on the stage to dive into details of how it all works behind the scenes.
How to use HCS
Alex Dadgar: Thank you, Brandon. We spent a lot of time making the user experience of using HCS really easy. I want to show you how you use HCS.
Accessing the Consul UI couldn't be easier. As you've seen, after you provision the Consul cluster, you can access the Consul UI directly in the Azure portal.
The way we make this work is, during cluster creation we create DNS entries and provision publicly trusted Let's Encrypt certificates for you. This gives you a resolvable DNS domain name that lets you access your Consul cluster UI fully end-to-end encrypted.
Gone are the days of having to upload self-signed certificates into your browser in order to look at the Consul UI. You have a fully publicly trusted certificate chain.
Connecting the Consul clients is similarly easy. During cluster creation, you give us a CIDR block, and we use that while we're creating the virtual network that we launch Consul clusters into.
After the cluster is provisioned, you can peer your virtual network to our virtual network. And even better, we generate default client configurations for you. These use the same best security practices as we use on the servers.
Since we have a custom cloud auto-join provider for HCS, this default configuration will have a
retry_join string that enables your Consul clients to automatically determine the IPs of the Consul servers and join automatically, so you don't have to do anything.
After you connect your Consul clients, you're ready to use Consul. There are no real extra steps because you're using HCS.
But what's really important to note is we take care of all sorts of operational tasks in the background for you, and we'll walk through those now.
For anyone who has set up a Consul cluster, securing it is always difficult. We fully take care of this for you. We generate client Consul server certificates using short TTLs, utilizing Vault in our control plane, and we automatically rotate those for you as they near their expiration date.
We also enable ACLs during the Consul clusters creation. Before anyone has access to the cluster, it's already bootstrapped with an ACL system. And of course we also enable gossip encryption.
HCS for disaster recovery
We also help you with disaster recovery.
We will automatically take state backups for you, and the operator can decide the snapshot rate and the number of snapshots to keep. This could look like, "Take a snapshot every hour and retain the last 8 snapshots."
We also have a force-snapshot API, so if you're ever about to do something dangerous on your cluster or you want to be extra safe, you can take a snapshot, do your cluster modifications, and then if anything goes badly, you can restore with a fresh state snapshot.
We also automatically upgrade the cluster for you, so you as an operator can decide whether you want us to only apply point releases or whether you want to us to also apply major releases as well.
We also support a maintenance config so you can tell us to only upgrade your cluster, let's say, on Saturday and Sunday between midnight and 6 AM, and then we try to respect that.
That's some of the benefits you get from a fully managed HCS service.
How it all works
Next we'll walk through how we make it all work.
To do that, we will walk through the high-level Consul create workflow. As Brandon mentioned, the first step is finding the HCS service in the Azure Marketplace. This looks like this: You find Consul, you select it, and then you give us some details like. "Where do you want that Consul cluster to live? Which cloud region? and What resource group do you want us to launch it in?"
After you do that, you can specify the Consul configuration itself. That might be the datacenter name, how many servers you want us to run, what some of the snapshot and maintenance can fix, and so on.
After you do that, Azure will create 2 resource groups in the background. One is for the customer, and one is for HashiCorp. We'll go into more details on how this is used later on in the talk.
After Azure creates these 2 resource groups, we as a control plane receive a notification, and in the background the control plane starts provisioning the resources that comprise your Consul cluster.
There's a lot to unpack here, and we're going to start by looking at the architecture of the control plane.
Before we talk about the architecture of the control plane, let's define what its responsibilities are. These largely break down into 2 classes of operations. We have cluster lifecycle operations such as creating the Consul cluster, updating the Consul cluster, and deleting the Consul cluster. Then we have APIs for operators such as setting up the maintenance config, reading the cluster configs, etc.
Architecture of the control plane
Let's talk about how we architected the control plane to do this. My slides are blank because we had a fully clean-room design. We didn't have any baggage when we started this project, so we really architected it from the ground up.
We only had one major requirement: The control plan had to be highly reliable.
How did we come up with our architecture given we had no baggage and no real requirements? Well we worked backward from the fact that requests can take minutes to complete.
If we go back to this slide that defines the responsibilities, we can see that the cluster lifecycle tasks can take minutes versus the API tier only takes milliseconds. When you think about it, it makes sense.
If you think about creating a Consul cluster, what is it doing? It's spinning up VMs, creating virtual networks, provisioning DNS records, etc. All of this can take time.
But the fact that it can take time poses a real challenge from a design perspective, because a lot can go wrong in a minute.
Your client's TCP connections can close. The machines that are fulfilling those requests can die. Your external dependencies can fail on that time scale. The Blob Stores can go down. Let’s Encrypt can go down, and we're not perfect either. Our internal upstreams could also be failing.
When we set out to design our internal architecture, we had to come up with a series of requirements for our long running tasks. They're as follows:
We cannot run the requests on a single machine because they can fail, so we have to be able to distribute the tasks among the pool of workers.
Those tasks must be retriable, because upstreams can go down and we can't predict how long they'll go down for.
We have to be resilient to those failures. Whatever we do has to be very consistent even through retries, because, after all, we are bringing up infrastructure, so we don't want to bring up 2 VMs when we meant to bring up 1.
Lastly, it must be highly observable, because we need to enable our SREs to make sure our customers' clusters are running without a hitch.
How a workflow engine helps
We looked out into the ecosystem to decide what tool to use, and we decided to use an open-source tool by Uber called Cadence. Cadence is a workflow engine. Many of us aren't familiar with what a workflow engine is, so I'll try to give a high-level description of what they are.
When you're programming in a workflow engine, you really structure your program into 2 things: workflows and activities.
Workflows describe the high level-flow of your program, and they execute activities, which implement the business logic. It looks like normal code, but there are a few requirements. Your workflows must be deterministic, and ideally your activities should be idempotent because they can be retried.
I'll walk through an example to try to make it more clear how workflow looks.
Here is an example workflow for creating a Consul cluster. It has a few steps. First we generate certificates, then we upload those certificates to a Blob Store, and lastly we execute Terraform to bring up the Consul cluster.
What's interesting about workflow engines is what's running this workflow might be different than the machine executing those functions. We'll walk through this.
First the workflow starts, and it has to generate certificates. When we call this, it runs an activity. This activity might be running on a machine different than the machine running the workflow.
When this activity runs, it's just normal Go code, and it has a return value. When that value is returned, it doesn't get sent directly to the machine running the workflow. Instead, its results are uploaded to Cadence.
Next we move on to the next activity, upload certificates, and we can see here we can reference the output of the previous step. It looks like we have a shared-state system, even though the work is being distributed across machines.
Let's say we're running this upload activity and something fails. Let's say the Blob Store is down. Obviously we need to retry this. Ideally we wouldn't have to rerun the full workflow, and with Cadence we don't have to.
We run the generate-certificate activity, but instead of rerunning it, we reuse the value we have previously stored in Cadence. We then run the upload-certificate activity using the old results, and once that works through our series of retries, we can finally move on to the next step.
This gives you an idea of some of the benefits of Cadence. The main benefit is that it gives you a consistent paradigm to write workflows that execute across fleets of machines in a deterministic and consistent fashion.
Activities are highly configurable when it comes to retries. So we can retry for hours or fail quickly, depending on the scenario. And Cadence itself is highly scalable. Uber, for example, runs a workflow for every single driver, and there are a lot of drivers.
Back to the various things the control plane has to accomplish. We solved the cluster lifecycle tasks by using Cadence. And for the rest we decided to use a microservice architecture.
We use protocol buffers and gRPC to define the RPC endpoints. This lets us tap into a super-rich ecosystem of gRPC. We use a technology called gRPC Gateway, which allows us to generate RESTful API space on our protocol buffers. We use this to integrate directly with Azure, which exposes and expects a RESTful API.
We also from the get-go have instrumented and traced all our requests, so that we have a highly observable system.
I wasn't kidding when I said we designed backwards from the fact that certain operations can take a long time. Our tech stack starts with Cadence. We then model all our long running operations as workflows.
We have a workflow for creating a cluster, deleting one, potentially doing a rolling upgrade.
Then on top of that we layer our microservices. These microservices receive traffic from Azure via the gRPC Gateway RESTful API tier.
What's really awesome about this is each of these components breaks down to a Nomad job file. Internally our full control plane is running on Nomad. Each of those boxes you can see is a container running on the fleet of Nomad clients.
And since this control plane is used to orchestrate Consul, we would be remiss not to use Consul. So we also run Consul internally, and use the Nomad-Consul connect integration to secure traffic internally between all our microservices.
You can probably guess by now, we also use Vault internally. We use Vault to generate short-lived certificates to access Azure resources. We also use it to create certificates for the Consul clusters for each client, and to transit-encrypt secrets that are in our system. What this means is internally we never store any secret in plaintext.
Takeaways of the control plane
We built a highly reliable system leveraging Cadence and Nomad, and we secure it using Vault, so we have minimal secrets, and the secrets we do have our short-lived, and we encrypt all traffic in our cluster using Consul Connect.
Lastly, it's highly observable. This might not seem like it matters, but we need to enable our SREs to debug any problems to make sure the Consul clusters are operating without a hitch.
Architecture of the data plane
This is where infrastructure is running. As mentioned, when you create an HCS from the Azure Marketplace, Azure will create 2 resource scripts.
One of them is for the customer and this contains the interface to use Consul. This is where you would see the Consul UI; this is where you could hit those APIs, and more.
The other resource group is for HashiCorp. We deploy infrastructure into that resource group. What's important about this architecture is customers do not have access to that resource group. So you can't accidentally go in there and delete a piece of infrastructure that's used to run your Consul cluster.
For the cloud resources we do bring up, we bring them up using Terraform. Each Consul server is running on its own VM, and we use Virtual Machine Scale Sets (VMSS) in Azure to make them reliable.
What's important about a VMSS is that if the underlying hardware that's running the VM in Azure is facing problems, or dies, running it in a Virtual Machine Scale Set will mean Azure automatically brings it up. It's resilient against hardware failure.
We then spread those virtual machines across availability zones, and they each share a virtual network that you can then peer into.
Let's see how this works. If you select 3 Consul servers, for example, we'll put 1 Consul server inside each AZ, and if you scale up to, say, 5, we then start round-robining across AZs. The reason we do this is to give you the highest fault tolerance in the case of an AZ failure.
That's the infrastructure we'd bring up, but how do we configure the Consul cluster itself? Consul requires certificates and its configuration. We treat those as secrets. We don't want to expose those, in case the machine gets compromised.
The way we do that is we upload them to a Blob Store and lock down access to just the user on the VM running Consul and that VM itself. We then use a helper binary called the Host Manager that's running on the VMs, and inject a configuration via cloud in it as plaintext, and all that configuration contains is the paths and the Blob Store to go download the secrets from.
Building in reliability
We can walk through this. When the control plane comes up, before it brings up any Consul VMs, it uploads the configuration to the Blob Store. It then will bring up the Consul VMs, and the Host Manager will read its configuration from cloud-init and it'll know the past, go read the certificates and the Consul config from the Blob Store.
It then writes those to disc and signals to the Consul binary that it should start up.
What's awesome about this architecture is, after we provision the Consul cluster, if our control plane goes down, we don't bring down the data plane, because the Consul cluster has all the configuration and secrets it needs to run in the Blob Store, and it knows how to get them.
When we designed the data plane, we took extra care to make sure it was a highly resilient architecture. With what I showed you, we can handle machine failures and single AZ failures without taking down your Consul cluster.
We're also highly secure because no secret is ever passed via plaintext, and we lock down even the access to the Blob Stores to just those VMs. Lastly, your uptime is fully decoupled from ours. In the case of a disaster, your Consul clusters still stay up.
Why did we spend so long talking about architecture? Well, the architecture of a system really defines its high-level properties, and we wanted you to know that we spent a lot of time and care designing our control plane and data plane architecture to be highly secure and highly reliable so that you can trust us to run your critical infrastructure.
HashiCorp Consul Service is now in private beta, and we're super-excited about it. If you're interested in this at all, please register for more information at hashicorp.com/hcs.
We really think that we have built the most secure, reliable, and easy-to-use Consul experience, so we couldn't be more excited.
Thank you so much, and I hope you enjoy the rest of HashiConf.