FAQ

Whiteboard: Network Automation w/ Consul and Terraform

Armon Dadgar outlines how you can extend infrastructure as code automation to network as code, by having Consul trigger automatic changes to Terraform configuration when networking middleware changes state.

»Transcript

Hello, and welcome. Today, I want to spend a little time talking about network automation using HashiCorp Consul and Terraform. But to start, I think it's useful to talk about what a traditional environment looks like — in terms of that end-to-end application workflow — and how that typically requires manual intervention at the network layer.

Usually, where that starts is we might have an application. That application is interfacing and using, let's say, a database. But of course, there's an underlying network that needs to support this. 

Typically, in between our application and our database, we might put a firewall, and that's being used to govern access. This might impose a rule that says IP1 is allowed to talk to IP2. And then, similarly, in front of our application, we might have a load balancer that's allowing us to spread traffic across multiple instances, manage failovers, etc. This might be a relatively classic pattern. 

»Still Relying On Manual Network Interfaces?

The challenge we typically see is that as this application team continues to iterate — push a new version, scale up and down — they have to interface with the network in a relatively manual way.

 

They might deploy a new instance of it. But now, typically, they have to file a ticket against the networking teams to update the load balancer and add this new instance into the backend, as well as to update the firewall — and add a new rule to say, IP3, for example, can also talk to IP2.

 

In this classic networking scenario, the application team is not empowered to self-service and deploy their application and manage that lifecycle. Yes, they might be able to deploy it onto a platform such as Kubernetes or cloud in an automated way. But then they're stuck filing a ticket and waiting potentially days or weeks for the underlying network to get automated — or updated in this case.

»Where Does Consul it Into This Story? 

Well, Consul acts as a service discovery mechanism at its core, meaning when we deploy Consul, and we're using it to have all our different applications register. This gives us a bird's eye view — or a universal catalog — of all of the services running in our environment.

 

The most basic layer, instead of the application hardcoding the IP address of the database, the application might be querying Consul to say, "Where's my database running?" And then it's using that to communicate with it.

 

This starts to avoid the fragility of having hardcoded IP addresses. It allows us to scale things up and down, manage failures in a more reliable way. Overall it enables a microservice pattern where we don't have a bunch of hardcoded IPS and load bouncers everywhere in the environment.

But in this case, we still have a firewall and a load balancer. These might be hardware devices — might be software appliances — that are in between. And those things have a set of rules that need to be managed. The load balancer needs to know about the backends, the firewall needs to know which IPs to grant. So there's a bit of a disconnect here.

»How Can Terraform Help?

This is where Terraform comes into the picture, which is looking and saying, "Should we be managing the underlying set of IP addresses for the firewall, or the underlying set of IP addresses for the load balancer? Or is what we care about a higher-level policy that says this application is allowed to talk to this database?"

 

That's what we care about. That's what we should manage. The IPs are a detail. Consul knows what the IPs are. They're coming and going. We don't care to manage them all the time. This is where we enable Terraform to author an infrastructure as code definition on how to configure this underlying device. 

This might be a Palo Alto firewall. But over here, we might say great, I have a different configuration. Again, this is infrastructure as code with Terraform. And here, I might say I'm managing my F5 BIG-IP device. Whereas this might be my Palo Alto  firewall. 

We've partnered with basically everyone in the networking universe, everyone from Cisco to Juniper, to Palo Alto, to F5, to Checkpoint, and more. We list everybody on our website. But all of the core networking vendors have done the work of standardizing by building a Terraform Provider that allows you to manage their underlying hardware device, software appliance, or an SDN type network fabric — so we can define the rules in Terraform about how to manage those. 

What we don't want to do with Terraform is move the hardcoded IPs from the firewall’s configuration into Terraform's configuration and have a whole list of hardcoded IPs in Terraform so that now every time we update the app, we have to modify the infrastructure code and manually do another reply.

Instead, those become a variable input to our script. We'd say our variablized input is the set of IPs for my app. It's the set of IPs for my database in this case. For the load balancer, it might only be the set of IPs for the app. But that is a variable that is being fed into this Terraform configuration. We're not going to hard code those because Consul has a real-time dynamic view of what all those variables are. 

»Using Consul and Terraform Together

We enable this end-to-end by marrying Consul and Terraform together. So that when this application gets deployed, it registers with Consul. Consul then feeds that in as a set of variables to Terraform to automatically execute it.

Now we don't need to manually manage this static set of IPs. Instead, we say, great, I've defined the policy that says my application is allowed to talk to my database. I don't care what their IPs are when they change, and I'll re-render the Palo Alto configuration.

 

Similarly, this load balancer on this backend should route traffic back to this application. I don't care if they're 1, 10, 50 copies of it, what the IP is — just add and remove the IPs as they come and go. Again, we can look and say, "That's a variable input." As that application changes, Consul will feed that in and execute Terraform automatically.

At the heart, this workflow now enables us to achieve end-to-end network automation without the developers caring. They can autoscale their application, deploy new versions, and do what they need to do. As long as the app registers with Consul, the downstream automation will trigger to update the load balancers, the firewalls, the API gateways, the underlying network fabrics, without them filing a ticket or having to interface with that.

 

»Safe Automation

I think the challenging question then becomes how do you start to do some of this state safely? Obviously, I want to have a separation of concerns. I don't want my development teams to necessarily be able to update the configuration for my firewalls or my front door load balancers. 

I want to be able to have visibility on which application changes are taking place. And I want to be able to impose policy controls on what those changes can be, so it doesn't become the wild west. 

Historically you solve this with a separation between your networking teams who might have owned these devices and your application teams — and they interfaced through a ticket.

 

»Leveraging Terraform Enterprise or Terraform Cloud

This is where leveraging Terraform commercial — Terraform Enterprise or Terraform Cloud  — becomes valuable. Instead of each of these being free-floating Terraform definitions, Consul can integrate directly with something like Terraform Cloud, Terraform Enterprise. 

This allows us —  for each of these definitions, each of these different devices — we can now connect this and define a Terraform workspace. Each of these becomes a workspace within Terraform Cloud or Terraform Enterprise. And this now lets us do a number of important things.

»Creating a Role-Based Access Control Scheme

One is we can obviously tie this into a role-based access control scheme. We might say, "Great, only my actual NetOps team are allowed to come in and modify the Terraform definition." 

I want my networking team or my platform team —They own the codified definition of this. My developer shouldn't be able to modify it. Maybe they have permission to come in and read it. They should be able to see how it's configured to debug something or understand the system. But they don't need permission to be able to modify anything. Now we can tie each of these into their own workspace and apply role-based access control around them. 

»Creating an Audit Trail 

The next important piece is now we have an audit trail, as well as full visibility on all of these networking changes. Every time an app is coming and going, it's going to register that with Consul. Consul will detect it, and if there's an appropriate downstream that needs to be triggered, Consul will then interface with Terraform and trigger an update to that workspace.

But now we can come in and see the history of those runs. What were the variables that were fed into it? When did a change take place? Maybe that correlates to an incident or a degradation, or maybe a change in network performance.

 

We want to have that visibility of what changes took place. When did they take place? What was the trigger and cause for those things? By using Terraform Cloud and Enterprise, we get that history. We have that provenance of how did this take place? What took place?

 

»Adding Additional Governance 

Then lastly, we can use some of the same policy as code frameworks. We can apply policy as code to put additional governance on top of this. We might want to have an additional layer of policy checks in terms of, is this a valid change that we're going to apply before that goes through?

Much like we would apply policy as code to our normal infrastructure as code pipeline — in terms of how we're making changes to our infrastructure — we can impose those same changes or same policies on how we're applying changes to the network as well. This becomes key.

»Building on Top of the Ecosystem

And then lastly is, we can build on top of this ecosystem. If we need to invoke hooks into other systems, we can use the ability to have run tasks — or webhooks — so that as different orchestration activities are taking place, we might want to notify other downstream systems: Great, we did a change with Terraform. We updated this load balancer. We updated this firewall. We want to do a webhook out to potentially our CMDB and notify that. Or we want to go out to Slack and leave a notification for people. We can start building on top of the capabilities — ecosystem and API, Terraform Cloud, and Terraform Enterprise support — to hook this orchestration into a broader environment.

This is how we start marrying these two together. Consul, at the base, provides that catalog — that real-time view — of which IPs belong to which services and how those are changing as applications come and go.

 

These applications might span many different environments. This could be a Kubernetes-based app. This could be a VM-based app. Maybe it's running on a native ECS Fargate type of environment, might be bare metal, etc. 

We don't care where the applications are coming from. It's about creating that single global consistent catalog of all of it. That then enables us to connect it into things like Terraform to do this automation of the network layer. And then, of course, bringing in things like Terraform Cloud — Terraform Enterprise — allows us to do it in a safe way at an enterprise scale.

Having role-based access control, having the visibility, being able to do policy governance and  to reuse modules, It becomes another key piece here. How do we reuse modules and definitions across multiple different things? 

If I have multiple F5s, I don't want them to all be done in a different way. I can leverage the private registry, reuse my Terraform scripts, and do it in a consistent way across the environment. 

It's about doing this safely at scale but ultimately enabling that end-to-end automation of the network. Hopefully, this was helpful to learn a little bit more about how we think about end-to-end network automation and the power of bringing Terraform and Consul together. 

Thanks so much.

More resources like this one

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 3/15/2023
  • Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

  • 2/3/2023
  • Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

  • 2/1/2023
  • Case Study

Should My Team Really Need to Know Terraform?