As we move from static, on-prem infrastructure to dynamic cloud services and hybrid infrastructures, there are several challenges for IT operators, networking teams, security teams, and developers that need to be addressed.
Hi, my name is Armon, and what we’re going to talk about today is: As we make that journey from private data center to operating in a multi-cloud world where we have our on-premises environment plus multiple cloud providers, what changes in terms of how we deliver our infrastructure for all the different people involved, whether that’s IT operators, networking teams, security teams, or developers?
When we talk about multi-cloud adoption, we often see that the starting point is still the private data center. Most organizations have an existing infrastructure running within their own 4 walls, and so as they transition to multi-cloud, in the short term that’s not going anywhere—we still have our private data center. But then in addition to that, we’re going to start layering in some of our preferred cloud partners. We might add in AWS and Azure and GCP, and in China we might have AliCloud. So we’re going to start to add these additional targets on top of what we already have.
I think one challenge about going through this journey is technology re-platforming. In our traditional data center, we were largely homogeneous—likely heavy open source with OpenStack, or more of a proprietary platform with VMware—but largely we see a lot of standardization around 1 platform within the private data center.
But as we now talk about being in this multi-cloud reality, each of these platforms is API-driven, but with a different set of APIs. We can’t really dictate to our cloud vendors, and there’s really no standardization. So one challenge as we do this re-platforming is the diversity of the platforms themselves and the fact that all of them are different APIs. I think that’s one piece of it.
The other piece, in some sense more challenging, is the process shift. When we talk about the private data center, traditionally it’s very much an ITIL-driven model. And so really it’s an organization of our teams around specific technologies. We might have our VMware team and our firewall team and our load balancer team.
And the experience of interacting with this is we file tickets across each of them. As an example, let’s say we only have these 3 key teams, our VMware team, let’s say our F5 team for load balancing, and our Palo Alto networks team for firewalls. We’d first file a ticket against the VMware team and say, “I’d like a VM.” And then wait some number of weeks until that’s provisioned. Then followed by creating a new ticket against our load balancer team, and waiting a few weeks until the load balancer is updated. And then lastly filing a ticket against our firewall team, and waiting a few weeks until that gets provisioned.
So the experience as a consumer of this is: we wait weeks to months until we get all the way across this process. And this is really our time-to-value, because until we get through this entire process, it’s not useful that I have a VM if no traffic is going to it from the load balancer and none of the firewall rules are open. I can’t really do anything with it. I have to progress through the entire pipeline before this is actually useful.
As we talk about transitioning in a multi-cloud world to embracing these other platforms, very few organizations are saying, “I want to keep the existing process, but I just want to support multiple platforms.” In practice, what we see is that most organizations want something much closer to a self-service experience.
You might call this a number of different things. It might be self-service, it might be DevOps, it might be CI/CD. But really it’s this notion of, over here, our groups are not actually empowered, you file a ticket, and you wait, and you have to orchestrate across many different groups, versus, over here, how do we empower the end development teams to deliver their applications without necessarily waiting weeks to months?
In our view, the combination of these 2 things in some sense breaks everything about how we do traditional IT. So for us, it’s really about, How do we then think about the key personas involved in this transition, and what changes for each of them? At the base of this is thinking about our IT operations teams and their challenges as they provision infrastructure.
So when we talk about provisioning infrastructure, it’s easy to only think about it in that Day 1 context. I don’t have anything running; I want to provision and have my initial set of VMs or servers or containers running. When we talk about provisioning, we really need the full lifecycle. So it’s not just the Day 1 provisioning; it’s the Day 2 patching, it’s the Day 2 upgrade, it’s the scale up, scale down, deploy new versions. And then finally, as you get to Day N, the decommissioning.
So it’s the full lifecycle when we talk about provisioning.
The next challenge is for our security teams, and increasingly they work with our ops teams in terms of: How do we secure all of our infrastructure?
And when we are talking about securing infrastructure, it’s really a few different layers. Yes, it’s the underlying infrastructure—access to our VMs, access to our databases. But it’s also higher-level application. How do we provide credentials like database, username and password, and API tokens, and certificates to the apps themselves, and all the way up to data protection. If our application needs to encrypt data or it needs certificates or it needs any way of managing data at rest, managing data at transit, there’s a data security aspect in here as well. So how does our security team plug into that, especially as we span multiple environments?
The next challenge is: What about networking teams? Historically, when everything was on-prem, we owned our own networking appliances. We bought the Cisco gear and the F5 gear and the Palo Alto gear, and we had strong levels of control around all the physical infrastructure. Increasingly, though, if we’re in these cloud environments, we don’t. We can’t buy our Cisco device and ship it to AWS. We get whatever network is provided by the clouds. So how do we solve a lot of the same networking challenges without having the same level of control over the actual hardware that we’re using?
Again, this used to be only our networking team’s concern. But it increasingly overlaps with operations, and it’s really around the conductivity challenge of our different applications and infrastructure.
The final layer impacts our developers. If I’m a developer, what I really want to know about is the runtime that I use to deploy and manage my application. That’s the final layer of this. And I think within each of these groups, there’s a transition that needs to take place as we go from our traditional private data center, which was largely ITIL, to this more self-service, multi-cloud platform.
Just to highlight briefly, when we look at the IT Ops layer, the key shift for them was they used to get a ticket and then they would point and click, and it’s some console, let’s say VMware, in the case of my compute team.
You could replace this with the same thing for the F5 team or the Palo Alto team. It’s still a ticket; it’s just a different system that we’re in. So the transition really needs to be: We’re no longer going to do a ticket-oriented approach because this won’t say scale. If my goal is self-service, I don’t want my application team to file a ticket. They need to be enabled to do the provisioning themselves.
And similarly, our platform is no longer homogeneously VMware or OpenStack. We need to embrace a wider variety of platforms. So the approach we like to take is to say that what you have to move to is: Capture everything as code—so, an infrastructure as code approach.
What this lets you do is take that specialized knowledge that your administrator had in terms of, “What buttons do I point and click to bring up a VM?” You capture that knowledge now as code, and now you can put this in an automated pipeline.
Anyone who wants a VM can hit that pipeline and provision an identical copy that follows all the best practices without needing to file a ticket and having that expert person do it manually. It’s about getting that knowledge out of people’s heads, documenting it in a way that’s versioned, and putting it into something that can be automated like a CI/CD pipeline.
The other side of this is we want a platform that’s extensible. As we embrace other technologies and other platforms, we don’t want a different automation tool for every single one of our technologies. We might end up with 5 different sets of tools for 5 different platforms, and this just becomes a challenge in terms of maintenance and learning and upkeep with it.
Rather, we’d like to have a single platform that’s extensible to different types of technologies. The approach Terraform takes with this is: There’s a consistent way with HCL, the HashiCorp config language, to specify the as-code configuration. You provide that to Terraform, which has a common core, and then on the backend, Terraform has what we call providers. Providers might be something like AWS or Azure or, if we’re on premises, it might be VMware and F5.
What providers do is act as the glue between Terraform and the outside world. This is the key extension point for Terraform as we embrace new technologies. If tomorrow we say, “Now we’re going to embrace AliCloud,” we can just start using the AliCloud provider for Terraform, and it doesn’t change anything about our actual workflow.
We just specified in the same config language, we used the same workflow—Day 1 creating infrastructure, Day 2 managing it, Day N decommissioning it—but now we can extend it to support other technologies.
The security layer is the hardest part, and there’s a challenge around the mental model that we’re using for this infrastructure.
With the traditional approach, it’s very much what we like to call “castle and moat.” We wrapped the 4 walls of our data center and our impenetrable perimeter, and then over the front door where we bring all of our traffic in, we deploy all of our middleware. We have our firewalls and our WAFs and our SIEMs and all of our fancy middleware. And what we’re asserting is: Outside bad, inside good.
If we have a large enough network what we might do is segment this into a series of VLANs and split the traffic out.
This was the historical approach. The challenge now is we’re going to take these fluffy clouds and basically connect them all together in a super-network. We might use different technologies for this. It might be an Express Connect or a or a VPN overlay or some other form of networking technology, but effectively we’re going to merge all of these into 1 large network.
Now the challenge becomes: Here we had a perimeter that we trusted to be 100% effective, whereas here we no longer do. On a good day, this perimeter might be 80% effective. So the challenge becomes: We moved from a model where we assume our attacker is on the outside of our network and they’re thwarted by our front door to a model where we assume the attacker is on network because our perimeter is not perfectly effective.
As we go through this shift, it’s a huge change to how we think about network security, and I think it brings a few key things into focus. One of these is secrets management. How do we move away from basically having credentials sprawled throughout our estate? Previously, if we had a web app and it wanted to talk to the database, the credentials would likely just be hardcoded in plaintext in the source or a config file or something like that.
Instead, we need to move to a model where we have a newer system like Vault, which the application explicitly authenticates itself against, it must be authorized, there’s an audit trail of who did what, and if it matches all of those constraints, then it can get a credential that lets it go talk to the database. It’s really about applying some level of rigor and cleaning up these secret credentials from being strewn about the environment.
The next big challenge is: How do we think about data protection? Historically we relied on the 4 walls. The customer provided us, let’s say, some credit card data, and we just wrote that to our database and we said, “We’re safe because the database is within the 4 walls.” Now, in practice, this was never a good idea, but our security model assumed that the attacker was stopped on the outside, and so it was allowable by the security model.
Now, as we say, the attackers are inside of the 4 walls. This is a really bad place to be. And again, maybe classically what we would’ve done is something like transparent disk encryption so that the database would encrypt the data on the way out to disk. But in practice, that doesn’t protect us against attackers on network. Because if I’m on network and I can say,
select * from customers, the database will transparently decrypt the data on the way back from disk.
Instead what you need to do is think about data protection as something that’s not invisible to the application. When the application gets the credit card data or Social Security data, it interacts with Vault as a piece of middleware and says, “Please encrypt this data.” And then that encrypted data is what gets stored back in the database. In this sense, the application is aware of Vault and is using it as part of the request flow to encrypt and decrypt and manage data at rest.
There’s a lot more in here, but it’s really around thinking about: What are the primitives we have to provide our developers to be able to protect this data at rest?
The other side of this is data in transit, and the gold standard there is TLS. We want to use TLS to secure our traffic between our different applications and services on the inside. The challenge of getting good at TLS, though, is that we have to get good at managing an internal PKI infrastructure. And what we see with most organizations is that they’re not very good with PKI. What we see is you might generate a certificate that’s valid for 5 or 10 years and check that into a system like Vault and treat it like a secret that should be managed for many years at a time.
And in practice, what always ends up happening is you’ll generate a cert that’s valid for 10 years, it’s running in production for some period of time, and eventually that cert expires and the service goes down in production. You’ve got a sub-1 outage because the certificate expired.
Our view is it’s really hard to get good at a thing that you do once every 10 years. So if you’re going to rotate certificates on a 5- or 10-year basis, it’s almost inevitable that you’re going to have these types of outages, because it’s just not something that you’re exercising frequently. So the way we tend to try and solve this is we think about it more like a logrotate. You don’t take sub-1outages because of logrotate; it just happens every night. So how do you make that possible?
The key is you start to think about Vault as not merely holding on to a set of these certificates that live for a really long time, and instead think about Vault as a programmatic certificate authority. So what we’re saying is that at anytime the web server is allowed to come in and request www.hashicorp.com. What we’re not specifying is the exact certificate that the web server’s going to get—the particular set of random bits. We don’t really care.
What we’re saying is the web service is authorized, so we worry about the identity of this service as well as what it’s authorized to do, and we don’t really care about the specifics of what the value is. So what this enables Vault to do is programmatically generate that certificate. So when a web server comes in and says, “Give me www.hashicorp.com,” Vault can generate and sign a brand-new random certificate. And by the way, that certificate is only valid for 24 hours, or 7 days, or maybe 30 days.
This starts to flip the model where, instead of having these certificates that you create and manage for years at a time, what you’re managing is this authorization. You manage the authorization that the web server is allowed to request a certificate, and then we treat it like logrotate. Every 24 hours it fetches a new certificate and it rotates nightly to the new certificate. But it’s no longer, “Check out the very long-lived cert and use it until it expires.”
So you start to flip the paradigm a little and use automation to handle this rather than having a manual remediation process when a certificate expires. This flow of taking the identity of an application and mapping it to a set of authorizations, ignoring the actual value, is what Vault refers to as a dynamics secret.
This pattern can be applied to not just certificates. Another analogy would be: The web server at any time is allowed to ask for a database credential, or at any time can come in and ask for, let’s say, an AWS S3 token that allows it to go read and write from AWS S3. Once we start thinking about it in this way—we don’t know what the database credentials are, we don’t know what the AWS S3 IAM token is.
Instead we’re managing that the web server is authorized to request these credentials and, on demand when that request is made, Vault will generate a new dynamic credential that’s short-lived. It’s only valid, let’s say, 24 hours for S3 or 30 days for the database. So we start to think about the credentials not as a thing that we manually manage, but as a thing that Vault creates dynamically and is ephemeral in the environment. It exists for some period of time, and when it doesn’t need to exist anymore, Vault shreds it and moves on to some new set of dynamic credentials. But we don’t necessarily know what those credentials are.
And so you can almost then extend this pattern to this notion of what we call “identity brokering.” The challenge that we’re trying to solve is: As we go to the multi-cloud world, we have different notions of identity in each of these environments. So at our private data center we might use Active Directory to provide a sense of identity. Whereas in AWS we have IAM, whereas in Azure we have AAD credentials, etc.
Each of these environments has a different notion of what identity means, and so if we have an application that needs to work across this—for example, maybe I have an application that runs in the private data center, but it needs to read and write data from S3, or I have an application in AWS and it needs to read and write from Azure Blob Storage—how do we broker those identities? I want to trade in my AAD identity and get a new IAM identity.
And that’s where this mechanism comes into play. It might be that we use Active Directory to authenticate an application as being a web server and then authorize it to follow this path to request an Amazon S3 credential. And so in this way, Vault now acted as an identity broker. It accepted a client that authenticated against Active Directory, and then it was authorized to request an S3 credential allowing us to do this brokering between different platforms.
These become some of the challenges security teams have to think about, where before it was, How do you lock down the 4 walls and deploy a bunch of network controls centrally, to make sure the traffic coming in is filtered and vetted and trusted? And basically we asserted that this perimeter was the point at which we stopped our threats, and the inside was soft.
Where now we accept, “You know what? Perimeter is only 80% effective. The attacker is now on the inside,” as part of our assumption, these are the pieces we have to start thinking about as being in scope. We don’t trust that the credentials can be in plaintext everywhere, so how do we apply secrets management? We don’t trust that the database being behind the 4 walls is sufficient, so how do we encrypt data at rest, or encrypt data in transit? And how do we move toward this notion of ephemeral credentials?
Because if our application logs it to disk or it gets leaked through an environment variable or it gets leaked through an exception traceback or a monitoring system etc., these should not be credentials that are valid for days, weeks, months, or years. Instead, they become these dynamic things that are constantly being shredded and rotated and are ephemeral in this environment.
So as we move up a layer and talk about the connectivity challenge, the challenge for our networking teams is a) They don’t control the network anymore; in these environments, the network is defined by the cloud providers, and it is what it is. b) They need to work more closely with operations teams because we’re trying to go a lot faster.
It used to be that it was OK that it took weeks or months to update the load balancers and firewalls. That’s not OK if we’re trying to get to a place where we can deploy 5 times a day. So if we look at the classic network, you might say, “I have a Service A and it wants to talk to a Service B, but to do so it’s going to transit past a firewall, and it’s going to hardcode a load balancer.” So we’re going to transit past the firewall, talk to the load balancer, and the load balancer will bring us back to B.
What we have to do if we are, let’s say, deploying a new instance of B, is file a ticket against the load balancer team, file a ticket against the firewall teams, and ask that the network path be updated to allow traffic to flow correctly. This tends to be manual; it tends to take time. So the approach that we need to move toward is, first of all, how do we automate these updates? And second of all, how do we do some of this function, which is authorization, in the case of the firewall?
What we’re really saying is A is authorized to talk to B as well as routing with the load balancer. How do we solve some of these problems without depending on hardware? The first-level problem to this is, How do we stand up a central registry?
And this is the goal with Consul, such that when an application boots right, when a new instance of B comes up, it gets populated in this registry, and we have a bird’s-eye view of everything that’s running in our infrastructure.
What this lets us do is start to drive downstream automation. We can use the registry to run updates against our firewall and our load balancers and even to inform our clients. So we don’t have to do manual updates. When a new instance of B comes up, we’re not manually updating the load balancer. The load balancer is simply being notified that, “Hey, there’s a new instance of B. Add that to the backend and start routing traffic to it.” Same with the firewall; same with our downstream service.
The next piece of this is looking at it and saying, “Can we take this whole middle layer and shift it out of the network entirely?” This becomes the classic service mesh approach. So to do that, you are shifting the networking challenge out of the network.
If I have Application A and it wants to talk to Application B, then we deploy a series of intermediate proxies. This might be something like Envoy running on the machine. A is talking out through the proxy. The proxy is the one talking to another proxy, and it brings us back.
Now the outgoing proxy is responsible for figuring out which instance of B we should actually route traffic to. So our routing and load balancing decisions get shifted over here to the outgoing proxy. Again, we’re moving this out of the network to the edge, it’s now running on node with a proxy, but serving that same routing function.
Similarly on the other side, what we’re doing is filtering who’s allowed to talk to us. So instead of depending on a firewall running in the middle of the network, what we’re doing is, when traffic comes in, we’re making a decision, yes or no: “Are you allowed to communicate with me?” And in effect, we’ve moved the authorization decision out of the network and onto the edge.
This is done by asserting that in the middle we’re going to speak Mutual TLS. So we force an upgrade to Mutual TLS, and this gives us 2 nice properties: 1. We have a strong notion of who’s on both sides. We know this is A, and this is B. 2. It means that we’re encrypting all the data over the wire.
We get that for free with TLS. I think the big shift over here at the connectivity layer is really thinking about, We’re moving up from thinking about layer 3, layer 4 IP, and really services. Service A wants to talk to Service B or wants to talk to the database. But we also need to be much more reactive to the application. It can’t take days or weeks or months for network automation to kick in.
The way we solve that is by treating something like a registry as a central automation point. So when applications get deployed, they can publish to it and we can consume from that registry and do things like network automation.
The final layer of this is: What’s the developer experience at runtime? Here there’s a huge amount of diversity depending on the problem we’re solving. We might be using Spark, some of our big data or a Hadoop data platform. We might use Kubernetes for our long-running microservices. We might use our Nomad scheduler. So there’s a variety of tools here, and I think what you find in the runtime layer is: Pick the right tool for the right job, if I have a Spark-shaped problem, that’s what I should use.
For developers, the challenges are: How do we learn these new platforms, the state of the art, whether it’s Spark or Kubernetes or Nomad? But I think ultimately they’re focused on writing and delivering the applications as the same. What we’re trying to get to is, How do we expose these functions in a more self-service way to the developers? This becomes the core challenge of this multi-cloud journey: How do we move from an ITIL-driven world, largely on-prem, to a self-service world that operates across these environments? I think there are some key changes for both IT operators, security teams, networking teams, and development teams.
I hope you found this video useful. What I’d recommend is checkout hashicorp.com, and particularly our cloud operating model white paper, which covers these 4 layers and why they’re in transition as we go through the multi-cloud adoption journey.
If any of the sub-problems sounded relevant or interesting, please feel free to reach out and engage with us on how we might be able to help.
How Deutsche Bank Onboarded to Google Cloud w/ Terraform
Automate an Observable Runtime with Consul, Nomad, and Ansible
Using Terraform to Build a Self-Service GitOps Infrastructure as Code Platform at AppFlyer
Managing Secrets the Kubernetes-Native Way with HashiCorp Vault and Trousseau