Presentation

The What, Why, and How of Zero Trust Networking

This talk explains why past solutions like SDN or SDF aren't great, and how HashiCorp Consul can help you build a zero trust network that works for you.

Update 2022: For a detailed, technical, written overview of service mesh, visit our newer page: What is a service mesh?. To learn more about the challenges, business impact, and 4 pillars of zero trust security, visit the new HashiCorp Zero Trust Explained page.

Speakers

  • Armon Dadgar
    Armon DadgarCo-founder & CTO, HashiCorp

Transcript

Thank you, guys; thanks for making the time. For those of you who don’t know me, my name’s Armon. I’m one of the founders and CTO of HashiCorp. You’ll find me all around the internet as just Armon, I’ll post the slides to Twitter after this. And if you have any questions, feel free to follow up with me.

For those of you who are less familiar with HashiCorp or maybe know of only some of the tools we work on: As a company, what we try and focus on is, What is the broad swath of problems as we try and adopt cloud-based infrastructure? And in our view there are four different layers, four different challenges. One is, How do we provision infrastructure? How do we think about setting up, managing infrastructure day one, day two, as we evolve it over time, and day N, as we decommission that infrastructure through the lifecycle. And our big focus there is our Terraform tool.

How do we think about the challenges of securing cloud-based infrastructures, things like secret management, key management, data protection, application identity? Our focus there is our Vault tool. Then there’s a question around, How do we connect all these services together? As we’re going to mix our VM-based applications, our container-based applications, our serverless applications, and we want all of these to operate as a single, logical app. How do we bring all these pieces together and network them together? The connectivity service networking challenge. And that’s our Consul tool, and we’ll talk a bit more about it today.

And lastly, How do we deploy our applications? How do we give developers a simple interface so that they can deploy an app, scale up and down, change configs, and manage app lifecycles independent from machine lifecycle. And that’s our Nomad tool for application deployment. Just gives a little bit of context.

Beyond traditional networking

The talk is really about zero-trust networking. I think before we can get there, it’s useful to have a little bit of context on traditional networking. How do we think about operating and securing our networks today, and why zero-trust even makes sense? I think this diagram is what you would consider a classic standard networking topology. You have your four walls of the data center and your single egress/ingress point, where all of your data comes in and out over a single pipe into the data center.

In this approach, what we would do is deploy all of our fun network middleware appliances right at that front door. We’d put our firewalls, our level four, our WAF (web application firewall), level seven, our IDS, IDP, SIEM systems—all of these network middleware appliances, to basically filter that traffic that’s coming in and out. And our goal is basically to create this zone distinction. We want to have a red and green zone, where we say everything on the outside of our network is bad—it’s in red, it’s untrusted. We don’t know what that traffic is. Everything on the inside is green. It’s been vetted, it’s made it through the gauntlet of our checks at the front door, and we’ve said, “This traffic is good. We’re now inside of a trusted zone.”

This is the classic approach to network security. And within the four walls of our network, if we have Service A and Service B, and they want to talk to each other, they can just do that. It’s an open network, so Service A just connects and speaks either HTTP or RPC or whatever protocol it is. But, generally, that traffic isn’t encrypted, it isn’t authenticated. It’s just by virtue of being on my private network, you’re allowed to talk to the service. Or we might be filtering on the CIDR block and saying, “Anything inside the CIDR block is allowed to talk to me.” Within the four walls.

What’s happening is, this network perimeter is allowing us to assert trust. We’re asserting that any traffic that’s from within the four walls, we trust it implicitly. If we start to say, “Well, the inside of our network’s getting a little too big for comfort. We don’t necessarily trust everybody inside of that network,” we might start to divide it. We’ll start to put zones within it. And this might be things like VLANs, so a virtual LAN kind of approach, or we say, “We’ll tag the traffic and then they can’t co-mingle together.”

It might be more sophisticated than other approaches, but what we’re doing is segmenting this network. And this is buzzword du jour in this category. You’ll hear everything from macro segmentation, micro, nano, there’s now pico segmentation. God knows what it all is, right? But the core logical idea’s all the same. I have one shared flat physical network, and how do I split into smaller chunks, smaller segments? And whether we call them femto segments or macro segments, logically all the same idea, right?

And the goal of having these segments is we restrict the traffic that’s flowing in between these segments, to reduce our blast radius. And historically, these segments tend to be pretty large, they’re pretty coarse-grain. When we talk about a VLAN, you might still have hundreds or thousands of services on each VLAN.

The problems with the traditional model

So what’s the problem? This seems great, seems simple enough. Why do we need to change anything? The first problem comes from the threat model that we use when we think about this network security portrait. The assumption is that this is the situation we’re preventing. We have our castle that we’ve built our four walls around, and everything’s coming over our drawbridge, and we kind of know what’s coming in and out. And we want to prevent a bad guy, who’s on the outside of our network, from getting access to our network, right? That attacker should not be able to come in and access, let’s say, Service A.

The problem is, in the majority of cases, the attacker is not on the outside. The attacker is an employee. They’re a contractor. They’re a subcontractor. They’re someone we’ve given access credentials to, and they have VPN credentials, and it turns out they’re on the inside of the network. It’s very rarely someone who’s totally, utterly on the outside of the network. Maybe it’s someone who’s not explicitly an attacker. They’re an employee who has good intent but bad judgment on what attachments to open in their email. They’re not directly necessarily the attacker, but effectively, they’ve allowed an attacker to sit now on the private network, by opening something they shouldn’t have.

Part of the challenge of this assumption of, “Let’s build this really tall castle wall and then perfectly filter everything,” is the moment an attacker’s on the inside, it’s bust. The whole system is bust, right? Because the assumption was based on, “I trust traffic on the inside.” So this is a serious problem.

The second serious problem is, logically, there’s one front door, but practically, very rarely is there one front door. It’s like, yes, there’s the true ingress/egress where our customers are coming in and out and reaching our web servers, but there’s the little side VPN where the corporate network is connected to it, and our office is connected to it, and one of our partners is connected over a VPN.

In reality, there’s never one front door; there are many front doors. And this starts to get us into the situation where we’re not securing a single castle. We’re starting to think about securing many different interconnected castles. Each of these might be a data center, it might be a physical office, it might be a partner’s network, that we don’t necessarily have insight into what’s running there. Could be a mobile device that’s VPN’d in. So you start to get this sprawl, where all of a sudden you’re getting into this mode where you’re like, “There are a lot of different front doors that we need to make sure are locked all the time.”

These things might be peered through different arrangements. This could be VPN, this could be dark fiber, this could be MPLS, right? There are many different approaches. But effectively, what we are saying is we have this split between trusted interior network and untrusted exterior network, and we’re going to stitch all of our trusted zones together.

So what happens if I’m able to get onto any of it? As an attacker, my goal now becomes, realistically, to get to any of these front doors. It doesn’t matter, right? And practically speaking, I don’t necessarily need to go through and attack the firewall, or find a way to circumvent the rules. All I need to find is anything that’s web-facing that has a zero-day vulnerability, or anything with an out-of-date library. Or a dependency that hasn’t been patched. Some way that I can pivot onto this network. Or I send you a phishing email and get someone on the inside to open it for me.

The moment I can get onto that network, now it’s sort of a free-for-all. I can pivot from there and get to everything. And if this all sounds relatively hypothetical, an example I really like to use is the anecdote of the Target breach. This is only maybe 18, 24 months ago. And it was widely reported that Target’s customers, their credit card data was stolen through an HVAC system, right? And when you see that headline, you’re like, “How is that possible? How could an HVAC system lead to a compromise of customer credit card data?”

But it all comes back to the assumption and the architecture of these networks. And what you see is that the HVAC system is connected with Wi-Fi to the store’s network. And the store’s network is connected with a VPN backhaul to the corporate network, and the corporate network is connected to the production database. As an attacker, what they were able to do is wardrive against a weak Wi-Fi encryption protocol from the parking lot. They didn’t have to walk into the store.

From the parking lot, they could break the Wi-Fi that connected the HVAC, and once they were able to get onto that Wi-Fi network, they were able to pivot—multiple hops. If we imagine, in some sense, the store might be the left, corporate network might be the middle, production might be the right. They’re able to pivot network to network—because these things are all trusted zones that talk to one another—until they could get to the database. And from there, they were able to just exfil all the data, sitting in a parking lot, thousands of miles away from the production database.

This is a super, super relevant type of attack, and it’s the same thing that happened to Neiman Marcus, it’s the Google Aurora attack, right? The list goes on, all to come back to this assumption that our network is secure, and it turns out it’s a bad assumption.

What does this all point to? There’s a fundamental weakness in this security model, right? The perimeter-oriented approach has a weakness. The first big one is the insider threat is just a total omission from the security model. It’s just not even considered at all. It does not consider what happens if you give credentials to bad guys who now work for you. They’re inside your network.

The second problem is you end up with so many entry points that it’s hard to secure. You have lots of firewall rules. There’s a lot of process around managing all this stuff. It becomes difficult, over time, to administer all of this. And then this is all getting more challenging as we talk about cloud. Everything is an API away from being an entry point into our network. And I think the biggest indictment of it is this all-or-nothing model. Unlike everything else in security, where it’s a defense in depth, the perimeter becomes all or nothing. Once I’m in, I’m in. Game over.

New approaches have problems too

I know that sounds like a lot of doom and gloom. Everything is sad and hopeless, right? How do we move forward from here? How do we learn, in some sense, to trust again? I think that’s the motivating question. If we look inside of a data center, it looks something like this: You have dozens, hundreds, thousands of unique services, they’re all talking to each other. There’s a complex east-west traffic communication flow, and these things are all talking to each other.

The problem we have today is that this zone is a flat network. As an attacker, if I can get to any of these services, I can very likely pivot to any of the other services. That’s the crux of the problem. If we stepped back and said, “What’s the platonic ideal? What would be the perfect world?” Because in some sense, there’s no world in which none of our applications will ever have a zero-day or an out-of-date library or whatnot, right? That’s just reality.

If we accept that and say, “What’s the best thing we could have from a network perspective?” we’d want it to be so fine-grained that only the services that need to talk to each other can talk to each other. If Service A needs to talk to Service B, great, it should be allowed to. If C and D need to talk to each other, they should be allowed to. And we restrict the zone, the box, as small as possible. If A is compromised, the only thing an attacker could get to is B.

The problem in reality is that very rarely are there these clean cuts. A also needs to talk to C, B also needs to talk to D. And so what do we do about this? In practice, what we tend to do is zoom out. We create a larger zone that then includes A, B, C ,and D, and then this creates the problem. Because now, all of a sudden, if I break B, I can talk to C, even though there’s no real reason B and C should be allowed to talk to each other. It’s because I’ve zoomed out and created this coarser-grain network.

How do we stick to the platonic ideal of bringing these segments as fine as possible without zooming out and just creating these large coarse-grain networks once again?

When we talk about reasserting this trust in our network, I think there are three philosophical camps. There are many different technical approaches to this problem. I think the common ones are software-defined networking, software-defined firewall, and the BeyondCorp zero-trust model. If you hear “BeyondCorp,” it’s synonymous in some sense with zero-trust. A logically similar approach.

The SDN approach

When we talk about software-defined networks, the approach says, “Let’s start with a base physical network that we don’t trust.” We have this giant flat network, we interconnected all of our data centers, but we don’t trust it anymore. So what are we going to do instead? We’ll create these much smaller networks that are virtualized, that we do trust.

We won’t put everyone on the virtual network. We’ll just put A, B, C, and D on our virtual network because they need to talk to each other. But our other 50 services don’t get to participate. Now, if you break one of those services in this virtual network, you can only get to the other ones inside the same virtual network. The challenge in this approach—and I think you’ll naturally get some bias from my view; I think one of these approaches makes the most sense, hence the title—is that SDNs are exceptionally hard to deploy, exceptionally hard to operate, and exceptionally hard to debug.

They tend to require packet encapsulation, there’s a performance penalty generally associated with them, their administration is complicated, and they invert in some sense what you want from your network. My network’s typical promise is the bits might get there. That’s what’s made it possible to build networks. Versus now you need a highly reliable, highly scalable control plane, otherwise the bits definitely don’t get there. It brings this hard operational challenge with it. Not to say it’s impossible. But it’s one approach.

The SDF approach

A different approach is a software-defined firewall. Which says, “Okay, let’s keep our flat, physical, untrusted network. We don’t trust it anymore.” But instead of saying, “Just because you’re on the inside of this thing, all traffic is implicitly allowed,” instead, we’re going to push the firewall to the edges. Every node is a firewall. Every node runs its own firewall.

And this way we can run rules that say, “Our web server is allowed to talk to our database.” So not only do you have to be on the internal network, but only the specific machines that are web server are allowed to reach the database. So this is a much nicer approach in some sense from SDN, because we keep our flat network, so it’s easier to operate, it’s easier to debug, our normal tools work. And we’re still talking about the same physical IPs. It’s just pushing a little bit of firewall logic. Most of this turns into some central set of rules that says, “Web server can talk to database, that’s templating IP tables at the edge.” A new web server shows up, you re-render IP tables at the edge, and say, “OK, that IP is also allowed to talk to it.”

The problem with this is what we have done is tie identity to the IP. We’ve said, “OK, web server can talk to database. Therefore, if you are a web server, this IP can talk to this IP. So we are doing that translation from logical service to IP to rule. Now the problem with that becomes, What if there are multiple things running at the same IP? And this can show up in two different ways.

One way it can show up is you are using a scheduler: Nomad, Kubernetes, Mesos, Docker, ECS, pick your favorite. They’re going to put multiple applications on the same machine. Now, all of the sudden, any traffic from that machine, you don’t know which service is originating. Is it the web server talking to you, the API, or the some other service that is running on that machine? All of them have the same source IP when you see it over the network. So you have this problem of attribution.

The other challenge is, What if there’s anything in the middle of the network that’s rewriting IPs? Could be a VPN, could be a load balancer, could be a NAT, could be a gateway. Any of these devices are rewriting the IP on the way out, and now how do you think about what your source is? If I have two data centers talking through a VPN, all my traffic looks like it’s coming from the VPN, so do I just blanket say, “Anything coming from the VPN,” in which case I have just created a giant flat network once again.

Implementing zero-trust

There’s a challenge: The identity of an IP is not a very stable unit. It’s not a stable identity. Is zero-trust the logical extension of this all the way to the edge? To say, “Well, how do we get a stable notion of identity?” Again, start with an untrusted physical network. We don’t trust it; we assume an attacker is going to get on it. Then we say, “How do we impose identity at the edge?” We still want to do filtering at the edge, much like we did with the firewall, but we want a real cryptographic identity, not just an IP, which is this fungible thing.

The challenge here—and there is no panacea with any of these solutions—is we have to have some way of assigning identity. There has to be some way we distribute identity to our application. Typically, this turns into a certificate distribution problem. And then we need some way of enforcing those axis controls of who’s allowed to talk to who, in a way that is not a firewall. Because a firewall thinks “IPs and ports.” Here we are starting to talk about much more logical service identity.

This gets into the question of, “OK, how do we make this a little more concrete? What does it look like as we think about implementing zero-trust?” Let’s go all the way down and make this very, very simple with two services. Let’s start with a web service and a database. These things might be in the same data center. It might be they’re in the same physical data center. It might be they’re both in the cloud. It might be that our web server is on Prime and our database is in the cloud. Who knows? It doesn’t matter for the purpose of this application. These things exist somewhere on a network and they can talk to each other.

We distribute to both of them a pair of certificates. One that says web.foo.com, one that says database.foo.com. What we have encoded in the certificate is a notion of service identity. The identity of the web service is encoded as the prefix, same with the database. Now, when these two systems want to communicate with each other, what they do is create a mutual TLS connection. Anytime you go to your bank’s website, bankofamerica.com, what you are doing is a one-way TLS connection. You are verifying the identity of the bank. You are saying, “Give me a certificate that says bankofamerica.com. I don’t care which IP I’ve reached.” Surely they have thousands of servers and thousands of different IPs. All I care about is that certificate says bankofamerica.

In their case, they don’t care what your identity is. Could be on your phone, could be on your laptop, could be an ATM connected back to their server. They are doing a one-way verification. The difference with mutual TLS is the other side verifies your identity. In this case, the web server connects to the database, does this one-way verification of, “Are you a database?” But the database also verifies, “Are you in fact a web server?” You must give me a certificate that proves you are also a web server. That’s what makes this a mutual as opposed to a standard one-directional TLS.

Once they have done this verification and said, “We both have certificates that are valid; they’re both signed by a certificate authority,” then what they have done is establish an encrypted channel. Anything that is going over the network between them is now encrypted. We have established a shared encryption key. That’s only half the problem. What we have now done is asserted identity. We know one side’s a web server, we know the other side is a database. What we don’t know is, Are web servers allowed to talk to databases? We need to be able to look aside at some set of rules, some set of policy that says, “OK, great, I’m a database, the other side is a web server. Are these things allowed to talk to each other, yes or no?” This is the difference between authentication, which is, “Do we know the identity of our caller?” and authorization, “I know your identity, but are you allowed to do this action?” We need some way to authorize that.

Logically, this is the idea behind zero-trust. It’s, “How do we encode and provide identity in some stable way, in the form of something like a certificate and push it to the edge for this identity enforcement?” And now, what we get out of this is, it doesn’t matter what’s in the middle of that network. It doesn’t matter if you are an attacker who’s sitting on that network and can listen to the traffic. It doesn’t matter if you are some random app that’s got on to my private network or an attacker. Unless you have a valid identity and you are authorized to talk to the database, you can’t get to the database. It’s very unlikely that the HVAC system is authorized to talk to the database. I would hope not, at least.

Zero-trust with Consul

How do we take this from slides into actual implementation? This, for us, is the focus of our Consul tool, and what we have been working on more recently is turning Consul into a service mesh. When we talk about the challenge of operating a service-oriented architecture, or a microservice architecture, I think there are three big challenges. One of them is, How do we wire all these things together?” I am going to have a mishmash of technology, right? And the joke I like to use is, any old enough company is like a giant tree. You cut down any bank and you have the mainframe sitting at the very middle, and then you’ve got the bare metal, the LVMs.

You have containers. You are going to have serverless at the edge. Every ring of technology still exists. How do you glue all of these things together and allow them to operate? Forever and ever, the common denominator has been the network. What lets the serverless function talk to the mainframe? They both speak TCP/IP.

I think the challenge is, How do we get all of these things to talk to each other, discover each other, and compose their functionality? That’s one challenge. Another challenge is, How do you configure all of these systems? So now we have gone from a relatively monolithic world of, “We have a handful of large monolithic services.

They have their properties on XML file. So now we have a distributed configuration mess. We have 500 services with 500 different configuration files. How do you configure things in a logical way? How do we invert that and have a central config problem again?” And then this third problem is the segmentation challenge. How do we go from these giant flat networks that everyone implicitly trusts everyone to a zero-trust network where implicitly no one trusts anyone? Everything is explicitly authorized.

For those who aren’t familiar with Consul, it’s one of our older tools. It launched in 2014. It does over a million downloads a month. We have individual customers run north of 50,000 agents. We just passed a fun benchmark internally. A single customer ran more than 35,000 agents of Consul in a single data center. And some of the folks who just talked about it publicly—and I think what I want to highlight is, it’s the full gamut of people. Don’t think about this as, This is a problem I have if I’m born in the cloud-native, or this is a problem I have if I’m born in the private data center and I have no cloud infrastructure. It touches everyone who has a network, whether you’re a private network, a public network, hybrid of these two networks. It touches that whole range, because the moment you have multiple things sitting on a network and you trust that network, you have this problem.

The first big challenge in wiring together all these services is, How do they just talk to each other? How do they find each other? Our view of this is you need a centralized registry. Anytime a service or a machine comes up, it gets populated as part of that registry. The registry knows: What are all the machines? What are all the services that are running? What’s their current health status? And then it exposes that outlet through a number of different APIs so that you consume it. Maybe DNS. What’s nice about DNS is pretty much every application speaks it. Instead of putting an IP address, you put a DNS name and it will magically resolve to an IP. The HTTP interfaces let you build richer automation and tooling. And web UIs are nice for human operators. If I am a human operator, I might log in and see something like this: a snapshot of, “What are the nodes in my cluster? What are their services? What’s their current health status?”

We’re not going to spend a lot of time on service discovery.

Service configuration

The idea is basically, Instead of having the source of truth be the configuration files at the edge, where every app has its own config file and now I have 50 different or 500 different sources of truth. How do I invert that? I want to have a central source of truth, and the config files at the edge are just fragments of that.

The 50 keys that are necessary for my app get pulled from a central repository and get fed into the application, but the source of truth is a centralized system, not a distributed system, so it makes it a little easier to manage these systems. It also makes it easier when you want do do things ,like, “I’m going to flip an app to maintenance mode. I don’t have to modify 50 configs and redeploy 50 apps. I just change a single central value and push it out in real time and reconfigure these systems at the edge.” It helps us as we start to build more dynamic systems with feature flags, or maintenance modes, or things like that where we want to tune them dynamically at runtime.

The fun part of this, where this gets into the service segmentation, is Consul Connect. And Consul Connect was our approach of saying, “How do we make zero-trust practical?” Because there are a few key challenges when we talk about service trust. Like I said, any of these systems—SDN, software-defined firewall, zero-trust—have their own challenges. And I think the challenges when we talk about zero-trust are these three problems.

One is, How do we manage a set of rules around who can talk to what, that authorization problem that we talk about? Let’s say web server connected to me as a database. Is that allowed? Yes or no? We need some rule language for that. The second problem is, How do we think about certificate distribution? This is a notoriously terrible problem. And then the third one is, How do you integrate this into actual data path? We can’t rely on something like IP tables because it’s looking at IP and port layer.

When we talk about the service access graph, it’s about having a set of rules that we can manage around who can talk to who. And in Consul view, what this is is logical service to logical service. It’s web server can talk to database. And what’s super important about that unit of management is that it’s scale-independent, unlike an IP. And what I mean by that is, if I said my web server can talk to my database, but what I’m trying to enforce is a set of IP controls, well, if I have 50 web servers and 5 databases, I have 250 firewall rules—you’re just multiplying them out. And if I want my 51st web server, I need five more firewall rules, right? So there’s this scale dependence of the IP. It’s not a great unit as a result of this.

If my web server moves around, I’m changing firewall rules. It’s a bad unit as a result. Versus: If I can say my logical rule is: “Web server can reach the database,” it doesn’t really matter if I have 1, 10 or 1,000 web servers. The rule “Web server reaches database” is the same rule. There are many ways you might want to manage this: through a CLI, an API, UI, infrastructure as code with something like Terraform. It doesn’t really matter; depends on your preference. Here’s an example of managing it with a CLI. For Consul, you might have a relatively low-precedence rule that says, “My web server to anything is deny,” a blanket deny rule, but my higher-priority rule “web server to database” is allowed. You can create a hierarchy of rules just like you would do with the firewall.

You might manage the exact same set of rules with a web UI. This tends to be more comfortable for firewall administrators. It all works out to the same thing, which is, Do I have a central set of rules that’s governing who can do what? Then you get to the real fun part, the real meat of the challenge, which is, How do we do certificate distribution? And in practice, most organizations have some ways of signing certs, but I think the joke is the average expiration time of a certificate is always longer than the average tenure of an employee. And that’s for a very specific reason, which is, you don’t want to be there when it expires. I’m sure all of us have seen the 10-year-long-lived cert, right?

This points to the challenge, which is, Why are we doing that? It’s not because it makes sense to have 10-year certs, right? Nobody can revoke them or manage them. We’re doing it because distribution and generation are a nightmare, and so we’re minimizing how often we have to do it. How do we think about solving that problem so that we don’t need these 10-year-long certificates? And this is our over-pivot of the focus for Consul. What we need to do is issue a set of TLS certificates, and these things are doing two things for us. They’re letting the applications have a strong cryptographic identity, and they’re allowing them to encrypt traffic. The hard part is that middle zone. How do you get those certs there? And this is our focus.

The moment a web server shows up at the edge, gets scheduled on a machine, or you deploy it, what the Consul clients do is generate a new public-private key pair for it—generate a key pair in-memory, just random high-entropy key, but they’re not signed yet. What they need to do is generate a certificate-signing request, which basically just shifts the public key portion off of the machine and says, “Here’s the public key that says. ‘I’m a web server.’ Please sign this by the certificate authority.”

That flows to the Consul servers who are then responsible for saying, “Are you a web server? Should you be allowed to have the certificate, yes or no?” And if so, they’ll flow it out to someone to sign it. They can sign it themselves, if you want Consul to be your certificate authority, or they can flow it out to a hardware device or Venafi or an external certificate authority like Vault. It goes to whatever your authority is, it gets signed there, and then that flows back to the client.

The key is that we’re not generating CSRs that are ever more than, let’s call it, 72 hours. We’re not generating 10-year certificates; we’re generating three-day certificates. And the clients then become responsible for owning that lifecycle. As we get close to expiration, we’ll generate a new certificate-signing request, refloat it out, get it resigned, bring it back to the client, and constantly manage that process of generation, signing, rotation. So we have shortlived keys and not these decade-long-lived keys that are painful to manage.

As part of the format, the certificates are basically standard X.509—what you’d get if you went to bankofamerica.com. There is a slight twist to them, which is what’s called SPIFFE. For folks that are aware of it, that’s a project through the CNCF to have a common way of talking about identity of services. And if you’re using another system that’s SPIFFE-aware—API gateway, STO, conduit, whatever—these things can interoperate and talk about identity in a consistent way. That’s the goal of that. But effectively the formats are not so different from what your browser already uses.

Enforcing the rules

Then the final challenge is, “Great, we have a central set of rules of who can talk to who. We’ve distributed certificates with identity to the edge. How do you enforce anything?” And this becomes the application integration question, and there are only two answers to this. One is either the app is aware of what’s happening, or two, the app is unaware of what’s happening. If the app is unaware of what’s happening, then what we do is transparent proxies. We run something alongside the application, which is using the certificates, ensuring the rules are being enforced, and then proxying traffic back to the application.

It has a minimal performance overhead, gives us operational flexibility, we don’t have to change the applications. That would look something like this. You’re pushing the control, the metadata, to these proxies, which are configured with certificates and rules and all that good stuff. The applications are talking to the proxy, which is transparently talking over the network to the other side. And as an app, they’re unmodified, they’re not aware of what’s happening.

So app talks to proxy, proxy goes proxy to proxy, terminates on behalf of the other side. Now, this could be whatever proxy you want that’s valid and handles certificates correctly, handles rule authorizations. It could be a system like Envoy, we ship a built-in proxy; could be HAproxy, NGINX—pick your favorite proxy.

This is an example of configuring a proxy. Here, what we are doing is telling the proxy, “When the local application talks to port one, two, three, four, upstream, that’s Redis.” From the perspective of the app, all it’s doing is connecting to local host one, two, three, four. The proxy’s then connecting upstream to Redis, wrapping that in TLS, applying all the rules and authorizations, and the app just thinks it’s talking to Redis like normal. What’s nice about that is it gives us a development workflow as well, so I can treat this almost like SSH, a reverse tunnel kind of a thing. I could run a local proxy. This is just doing a reverse tunnel back to an end application, and then use my normal tools like psql that I’m used to.

The only other approach is native integration. The app can be aware of what’s happening. The nice thing about zero-trust is it’s basically standard TLS. We’re not doing a whole lot on top of just mutual TLS. The SDKs around this are basically just imposing mutual TLS like normal and then calling back and saying, “I’m a database. Web server, talk to me. Allow this, yes or no.” We need a callback against that central rule set, and that’s about it. Here’s an example of doing that integration in Go.

It effectively adds a four-line perfix where we’re setting up the SDK to be able to talk to Consul, and then we make that single line change that’s in red to source our TLS config. What this is going to do is read this public-private key dynamically, do a callback when the connection comes in to verify “Yes, no? Should this be allowed?” But then otherwise it’s standard TLS.

Stepping back, I think what we try to focus on with Connect is, What are the three big workflow challenges if you’re trying to adopt a zero-trust network? You need some way of talking about rules of who can talk to who; you need to solve the certificate last-mile distribution problem; and then you need to sit in the data path—you have to enforce who can talk to who. Otherwise, all the rules in the world don’t mean anything.

Wrapping up, briefly, when we go back and talk about the challenges of traditional networking, there are three that are super key. One is, we don’t consider inside threats typically. This is a glaring omission. Whether it’s Snowden or pick your favorite, the insider threat is real. The second big problem is, there are too many entry points to secure realistically, or at least too many to say we’re going to perfectly, 100% of the time, secure. This gets exacerbated even as we think about cloud.

The last part is, it runs against everything else in security, which is layered defense. It’s fundamentally an all-or-nothing approach; once you’re inside the perimeter, kiss it goodbye. So the zero-trust view of this is to say, “Being on my network or having a specific IP doesn’t grant anything.” We, again, have zero trust of the network itself.

And instead push the controls to be identity-based. That identity could be a user—the Google-BeyondCorp model—it could be a service. But what we’re saying is, “The identity is what matters, because it’s a more stable unit, not the IP.” And an approach that uses mutual TLS is very much like what we do on the public internet. Our goal with Consul is very much to try and solve that as part of the integrated lens of thinking about, “What are all the problems as we operate service-oriented infrastructure?” But there are many solutions to this problem.

That’s all I had. If you guys have any questions, I’ll post the slides later. Feel free to reach out to me. Thanks so much. Thank you.

More resources like this one

  • 3/15/2023
  • Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

  • 1/20/2023
  • FAQ

Introduction to Zero Trust Security

  • 1/4/2023
  • Presentation

A New Architecture for Simplified Service Mesh Deployments in Consul

  • 12/31/2022
  • Presentation

Canary Deployments with Consul Service Mesh on K8s