Presentation

Divide and conquer: Splitting the challenges of migrating to cloud operating models

The HashiCorp toolchain can divide and conquer the challenges in deployment, provisioning, scaling, security, and networking that come with cloud operations and microservices.

For our updated view on the cloud operating model, read our latest white paper: Unlocking the Cloud Operating Model

Dave McJannet, the CEO of HashiCorp, is constantly exploring the challenges of software production and infrastructure with Global 2000 customers. Their core challenges are rarely unique: They want the ability to migrate applications to multiple cloud infrastructures, both public and private on-premises. They also want to break those applications into loosely-coupled components, bringing them in line with the microservices trends they keep hearing about.

Gaining this architectural and infrastructural agility comes with a significant cost in more operational complexity, but there are new tools and paradigms that can help minimize that cost.

McJannet explains these relatively new approaches to cloud operations by categorizing the various challenges into four software engineering and operations domains:

  • Operations
  • Security
  • Networking
  • and Development

One-by-one, McJannet explains how HashiCorp's tooling stack addresses each domain's challenges with an independent, modular solution.

For operations: Terraform lets you build complete or partial models for various infrastructure environments with domain-specific code. This results in templates that a team can vet, modify, automate, and reuse for faster, safer provisioning.

For security: Vault is a standalone engine that centralizes the management and distribution of API keys, passwords, certificates, and other secrets between various internal and external applications. Not only does Vault ease the administration of these secrets throughout your architecture with automated key generation, rotation, and encryption, it also functions as an identity management solution.

For Networking: Consul automatically connects, monitors, and universally configures all of the individual services in your application architecture. It also provides highly secured communications and network segmentation.

For development: Nomad is a cluster scheduler that makes it simple for developers to deploy their polyglot applications in containers, virtual machines, or standalone. Using Nomad, developers don't need to know how to efficiently deploy and manage their applications on the cloud. The scheduler takes care of efficient deployments and handles node failures in a way that makes the system self-healing.

By understanding and addressing the specific technical challenges of each engineering discipline in your organization, you can mostly alleviate the new complexities that come with cloud migrations and operations.

Speakers

  • Dave McJannet
    Dave McJannetChief Executive Officer, HashiCorp

Transcript

My name's Dave McJannet, HashiCorp CEO. I get to spend a lot of my time with some of the largest companies in the world as they think through this infrastructure transition to cloud that they're all going through. Here I wanted to step back a minute and just share how we think about the market transition that's underway and how that then informs our product evolution.

I think the big picture of what we're seeing is that at the infrastructure layer, the world's going through a transition that we go through once every 20 years—and we're going today from a world where predominantly infrastructure's running on premises in a static environment and that is typically either an infrastructure that you own or are long-term leasing—to a world where, perhaps this was running on premises, today you're moving to a world of running infrastructure that is much more dynamic in nature. And that's the fundamental shift that's happening. Today you might have some workloads running on your private cloud, but then you will have some running on Amazon, you might have some running on Azure, you may have some running on GCP, etc.

The shift from static to dynamic infrastructure

The core distinction is pretty profound as it relates to the operating model for infrastructure. Simply stated, this world is extraordinarily dynamic and therefore every organization we engage with has to think about how they are navigating this transition from running infrastructure this way, to this way. The easiest way to think about it is to decompose it into the core pieces. So let's just talk about how infrastructure's provisioned in these two different worlds.

The way I provision infrastructure here [in the static realm] is really predicated on a static set of servers that I own. Here, so it's basically really the static idea. In this world here [the dynamic realm], I don't just stand up 100,000 servers and leave them running on Amazon, it's much more on-demand. So I move into a world of provisioning infrastructure on demand.

At the security layer, the implications are pretty different. You're going from a world really that is basically a high trust environment with a clear network perimeter and therefore I can use IP address as the basis for security—to a new world where, in the dynamic world, you really have to think about, 'wait a second, this is fundamentally a low-trust environment.' Therefore, I have to think about what else can I use rather than IP as the basis of security. So I move to the world of identity as the basis of security.

For the core networking, the challenge is again slightly different. In this world [static world], everything has a physical server host, so I can then have an IP address that's based on that specific server host. Fundamentally, it's host-based connectivity. Everything's predicated on a physical machine that's running over there. Well, in this new world here [dynamic world], there's no notion of a physical machine, and therefore I have to think of the world in services. So the world moves to one of service-based connectivity. So a database or an app server—where is it in this new world? And then use that as a basis for connectivity, recognizing that it's gonna move around.

And then for the application developer, well, I'm no longer deploying an application to a physical location, I'm deploying an application perhaps using something that's running across a distributed fleet. So basically I'm deploying an application through a fleet. At its core, this represents a completely different model for how to think about infrastructure relative to the world that we're all familiar with.

The challenges of dynamic, multi-cloud infrastructure

How most of our customers think about it is they then decompose the problem into four core people in my IT organization: there are ops people, there are security people, there is the development function, and then there's essentially a networking group.

All four people have to figure out how to navigate this transition and that is the core challenge of cloud adoption at scale—recognizing that all four of them need to understand the implications of this new model. So the way that we then think about it is to say, 'well, wait a second, let's talk about the ops person. How has their world changed?' Their world has changed in three ways:

  • One is fundamentally in terms of scale. The scale of the infrastructure that I'm envisioning here may be 50 VMs, the scale of infrastructure I'm provisioning here may be 50 machines. So the scale challenge is just a different one as it relates to the provisioning exercise.

  • The second challenge is one of variety. Most of our large Global 2,000 customers are running not just on the private data center, but also some workloads in Amazon, some on Azure, some on Google, some on Alibaba. The challenge for the ops team is, 'how do I provision infrastructure across this variety of target platforms?'

  • Lastly, the third challenge is one of managing dependencies. As I'm provisioning infrastructure as a core policy in this world, I need to provision the monitoring agents and the connectivity aspects that are perhaps specific to my environment. How do I include those components in that new world as well?

Meeting new ops challenges with Terraform

So for us, Terraform plays that role and Terraform's extraordinarily popular. It is not our most popular product, but it's certainly up there. It's used to provide a consistent provisioning experience for this new cloud operating model that allows people to leverage all the innovation coming out of these different platforms without providing a lowest common denominator across all of them.

The way Terraform actually works is relatively simple. Terraform really has two parts, there's Terraform Core and then there's a provider for every environment you wanna interface to. There's an Amazon provider, there's an Azure provider, there's a GCP provider, there's a vSphere provider, etc. And what Terraform does, by decoupling this, much like a middleware broker and adapter would do, it allows me to be able to expose all 220 services on Amazon that I wanna invoke, and provision on Azure. Obviously they don't have 220 services, they have different services, maybe they only have 150. On GCP maybe there are 120 services, etc.

The idea here is these cloud providers are gonna continue investing for the next 20, 30, 40, 50 years and you wanna be able to expose the core services that those providers are gonna make available over that period of time. And so by adopting this 'core plus' provider model, what you allow for is: every time Amazon introduces a new capability, it's now made available in the Amazon provider for Terraform for everybody to consume.

There are two categories of providers for Terraform. There are the core infrastructure platforms, but invariably you're not just provisioning compute capacity when you're provisioning infrastructure, you need to configure something on top of that as well. So as a result, there are about 150 providers today and that number grows every week. Whether that's for Palo Alto networks or F5 or Kubernetes or Datadog, you know, take your pick. These are things that are part of the provisioning process that wanna be deployed on top of the core compute.

What you can do is create a Terraform template which includes, for example, the configuration of maybe the three Amazon services that you are interested in, plus the configuration from Datadog and Kubernetes and F5 and Palo Alto Networks—that becomes a reusable template that anybody can provision. So what you now have is the ability to deploy—in a codified manner—an infinite amount of infrastructure in a very repeatable way, and that is not just including the core cloud services, but also the other aspects you wanna configure as well. That is why Terraform is widely used by operators in this new model.

Meeting new security challenges with Vault

At the security layer, the problem is also pretty profound and we talk about it here. You're basically going from a high-trust network to a low-trust network. Therefore, you need to assume that your network is not secure, that's a safe assumption. It is secure but it's now outside of your control, so the safe assumption is that it's not. The problem here becomes, number one, how do I protect secrets? How do I do secrets management? Secret management being things like database usernames and passwords that previously in this world [the static world] I could just leave unprotected. In this world [the dynamic world], that's not a solid approach.

You need to centralize your approach to how you're managing secrets. Secrets can be of all types, whether that's a database username and password or a log in to a system. You fundamentally need to assume that those things should be protected in a different way than they were previously. The second challenge is one of encryption. I should also assume that if I can encrypt everything both in flight and at rest across this distributed fleet, then it's actually okay that the network is low trust.

Therefore, I need some way of addressing the encryption challenge in addition to the secrets management challenge here, and then lastly I need to be able to use identity as the basis of access rather than physical IP addresses. Recognizing that each of these environments has a different identity model. For example, I might use active directory on premises, the Amazon IAM model on Amazon, Azure active directory on Azure, and the Google IM model on Google. You start to get a sense for the challenge. So this is how Terraform works as I described here. The way that Vault works, which is widely used to address the challenge of cloud security, is really to think about how things used to work and then how things work today.

In the old world, what we would do is you would have a client connect to a backend system like a database and I would pass the username and credentials and get back information from that user. In a high-trust network, that's a valid approach. In this new world you need to insert a secondary step, and that's where Vault comes in. Vault inserts itself in the middle of this particular flow and says, 'let me authenticate against some trusted source of identity.' And there aren't that many of those in a typical organization as I said for your on-prem, it's probably active directory or LDAP, or it could be Amazon IAM or the IAM model of the cloud provider. It could be OAuth from GitHub, it could be Okta, you know, whatever system of record you trust.

The way that Vault then enforces it is when the client makes the request, rather than going directly to the database or the backend system, a database is an example of a system, but it could be any kind of system or application. The request is made to Vault, Vault then authenticates that request of 'who are you, do you exist in my records?' And then it grants access to the backend system and it returns a token. That token is what's given back to that requesting client. The policy associated with that token is defined by the security team. And that is where the hand-off, and the recognition that there are multiple parties involved here, comes in. I can now give my developers and my ops team a single place to go to access a system or an application and the policy associated with how long that credential lasts for example is defined by the security team.

I, as a security team might say hey, "Every time this client makes a request, give it a token that lasts for one second, or 30 seconds, or one day." Or whatever the condition might be. That's the basic idea of how Vault helps people address the reality that they are now operating in a low-trust network.

There's also a transit backend. We have this idea of auth backends and system backends, the same way that Terraform has the idea of providers, Vault has this idea of backends. In an auth backend, there's a limited number of authentication mechanisms that you might have, but those are all supported in Vault, you can be fairly confident. But then there's essentially an infinite number of system backends, maybe it's for an Oracle database or an SAP HANA database or an application system.

There's also a transit backend that allows Vault to act as a certificate authority to encrypt data across your fleet. So without changing the workflow of the application you can now start to encrypt, by policy, everything in that flow. And that's how Vault addresses the challenges of running in a low-trust network.

Meeting new deployment challenges with Nomad

The third layer is—now your developers in this new world have to figure out how to navigate this new way of deploying applications, and frankly their job is much easier than everybody else's. The core challenge here is, 'hey, you're gonna have heterogeneity at this layer, you're gonna have some job apps, you're gonna have some C# apps, you're gonna have some common .NET apps, you're gonna have some Hadoop apps, you're gonna have some container-based apps, you're gonna have some VM's. And the challenge becomes one really that would be categorized in two parts.

One is, how do I separate the concerns from my developer having to know where everything is gonna run? It shouldn't matter, they just get to say, "I'm gonna deploy this application, this is what this application needs." Number two, this challenge is one of binary packing. If I'm spinning up a hundred thousand servers for my applications, I'm paying for them all, and therefore how can I schedule the core functions of these apps to efficiently use the infrastructure. And Nomad is widely used to do that.

If you have container applications that you are using something like Kubernetes for, you know, it's very very common for people to use Kubernetes for some of those pieces, Nomad for some of the other pieces, Cloud Foundry for some of the other pieces, Java application server for some of those pieces, the point is there will be heterogeneity at this layer.

Meeting new networking challenges with Consul

The last challenge, and the fundamental challenge of this model, is actually knowing where everything is. You're now spinning out things across a distributed environment, and so our most widely used product is, in fact, Consul.

Consul is deployed across really every machine that runs in these distributed environments as the system of record that tells you where everything is. It serves a few different purposes, the first of those, it acts as a dynamic service registry that tells you where everything is in this distributed world, so when an application gets deployed, here's the latest version of that method. For the ops teams, it acts as a dynamic infrastructure registry. Where is everything? How many app servers do I have running? How many containers are running in my environment? Where is the database?

Previously [in the static world] I would have described when I deployed—as an example—a database into my environment. It would have an IP address, 1.1.1.1. In this new world, because everything's moving around, people use Consul to say, "I'm deploying this database, here it is, it's DB1." And then that becomes the basis of how you communicate with it rather than the IP address. So it acts as this common registry and backbone that allows you to establish a mesh of where all the services are in your environment and as I said, Consul is by far our most widely used product, and it's used not just in the container landscape, it's used for fronting mainframes and basically everything in between.

The last use case of Consul is—once we have all the services registered and discovered, I know I have a hundred thousand databases in my environment, you can then use Consul to enforce connectivity in terms of what service can connect to what service. So for the security teams, Consul plays a role in terms of service connectivity. And it's because the core challenge of how I actually enable applications to see the light of day in this new world is fundamentally a security one. The Consul Connect capability within Consul, which is what allows the connectivity, is what really makes Consul so powerful as a common service mesh to underpin this distributed fabric of compute.

Final takeaways

That's how we think about the pieces, and I think the seminal takeaway for me is the fact that we are going through this transition to a different operating model for infrastructure from static to dynamic, we think about it in terms of the cloud operating model. That cloud operating model actually very often we see adopted on premises, but it's a different mindset. It's about ephemeral infrastructure, it's about assuming low trust, it's about dynamic IP addresses, it's moving into software-based everything. And that, while it came to bear as a result of this transition to cloud is just as relevant on premises. For example, there's a vSphere provider for Terraform, and that's how people do that. People use Vault on premises very often because the idea of a low-trust network is actually a good idea.

So number one, we're going through the shift as an industry, number two, the best way we found to describe it is to actually decompose it into the problems for each person or certainly practitioner type inside the IT organization and in so doing establish a common operating model that allows our customers to adopt cloud.

More resources like this one

Vault identity diagram
  • 12/28/2023
  • FAQ

Why should we use identity-based or "identity-first" security as we adopt cloud infrastructure?

  • 3/15/2023
  • Presentation

Advanced Terraform techniques

  • 3/15/2023
  • Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

  • 3/14/2023
  • Article

5 best practices for secrets management