Watch Armon Dadgar explain platform teams in this whiteboard session, and learn how they fit into a wider maturity model of cloud adoption and IT process modernization.
Hey, everybody. Super excited we're going to spend a little bit of time today talking about platform teams. This is a topic that's been coming up quite a bit as I talk to people.
As people are maturing their cloud programs, as they're thinking about how to really do cloud at scale, this idea of platform teams, SRE (site reliability engineering) teams, or cloud teams keeps coming up. So I wanted to share a little bit about our view, and how we think about the role of a platform team. What's the problem they're solving for, and what's the kind of scope of those teams?
Taking a quick step back, I think it's helpful to start with the question: “How do we think about the evolution of how people adopt cloud?” What we often see is this phase one approach to cloud, where you have multiple application teams and all of them are starting their journey to building cloud applications.
Typically, an organization starts, they sign a contract for cloud, maybe they have one or multiple vendors they're working with, and then they let their application developers run free. You might have application teams one, two, and three. All of them are building cloud applications. And maybe I have multiple different providers. I'm working with Amazon, Azure, and Google. So I have multiple different potential cloud providers.
In phase one, it's a very, I'll call it ad hoc or random approach. Meaning app team one, they might be all in on Amazon. App team two, they're picking and choosing. App team three, they're building something themselves. So it's this chaotic approach where each team is picking their own tooling, approaching it their own way, building their own pipelines, building their own process for how they're adopting cloud. And we see this very, very often. We see it so often that we've called this Phase One or Cloud 1.0.
I think the challenge in this model almost inevitably is 12 months, 18 months, 24 months into this, what we end up seeing is everything you'd expect:
You have cost overruns because every team's kind of doing it their own way. Nobody's paying attention to spinning down dev instances. Things are oversized.
You have security vulnerabilities all over the place because the application teams are really focused on their applications. They're not really thinking about some of the Day 2 concerns around, "Did I actually define my security groups correctly? Did I set the S3 bucket to private only?"
Then you have all sorts of compliance challenges. As a security and compliance organization, there's no easy way to partner to solve those problems because every team is solving it their own way.
So you have a hundred app teams doing it a hundred different ways, and you have one security team, one compliance team trying to go herd the cats, if you will.
So I think quickly people start to realize, okay, this is not a super scalable approach to cloud because we haven't really standardized. We haven't really industrialized the process.
This gets you into phase two, where you still have all the different application teams that are doing their thing. But now we say application team one, application team two, and application team three should not all have their own unique approach to doing cloud adoption. Instead, we should create a central platform team. Different organizations might call this a different thing. We call it a platform team. Some people call this the cloud team. It might be the DevOps team or the SRE team.
I think it comes by different names, Cloud Center of Excellence in some organizations. I think conceptually what's important is that you have this notion that I have many different application teams — they're the customer. I have one platform team. — they're the provider internally, and their job is to standardize the approach to cloud. They will be the central aperture to say, “This is how we build and deliver an application on any of these cloud environments.”
Now I have one team that's responsible for defining, "How do I do a landing zone? How do I build the patterns for these applications? How do I make sure the right security and compliance and cost guardrails are in place?" Which, organizationally, is important because now you have an interface for those other teams. So if I'm the security team or I'm the compliance team, I have a group to go work with. And that group is then standardizing across many different application teams, how this actually works.
I think from an organizational design perspective, you can kind of see the difference here in terms of the approach. It's about creating that central group, regardless of what they're called, making them responsible for that, and then enabling these customers. But then the question is, what's the scope of that team? What should they actually deliver?
This then gets into a little bit of a philosophy question of, "How much do you want to standardize? What should be the scope of those teams?" So the way I look at it is from the perspective of an application — what are the critical, non-functional requirements that it has? So that for every single app that you're building (if we start by thinking about the pre-production pipeline) there's a common set of things.
Every single application is going to need a version control system. So you might say, "GitHub is my VCS of choice."
I'm going to want some CI (continuous integration) system. I'm doing continuous integration and testing, and that's where I'm doing my app build. Do I want to have 10 different solutions or do I standardize on say, Circle CI or Jenkins?
Then you think about things like artifact management. I'm going to build my job application, or I'm going to build my Docker container. Do I have 20 different registries or is there a consistent Artifactory deployment, for example?
Then you get into things like static code analysis, and this goes back to things like security and compliance. I might say, "All my apps have to go through a certain amount of static code analysis to look for different vulnerabilities or license issues."
Every single one of these applications that I'm building, they have all these same requirements. It doesn't matter what kind of app it is or what the problem is, these are all consistent requirements.
When we think about the production side of it, you similarly have a consistent set of things. How do I think about provisioning my application? I need to define the infrastructure it runs on. How much capacity do I need? Do I need a version or upgrade resources? I want a consistent way to define and manage the provisioning of them.
Then the next layer up is how do I think about the security of those applications? So great. How does the app get usernames and passwords for databases? How does it get certificates for TLS traffic? How do I get encryption keys to secure my data? How do I do API keys for cloud applications or Twilio or sending email? So you have a secrets management, key management certificate management set of challenges that every app has. Almost every app is going to need either a database credential or API token or something.
Then you have a set of networking or connectivity challenges. When my application gets deployed, how do I update the load balancer, the firewall, the API gateway to make sure that the application actually gets traffic? If application A needs to talk to application B, do I need to get firewall changes automated, or do I have something like a service mesh that's going to enable app A and app B to communicate? There's some networking challenge for almost any app you're going to deploy unless it never talks to any other service.
Then you have a runtime. Where's the application actually running? It might be on Kubernetes, it might be a [AWS] Lambda function, it might be on top of [HashiCorp] Nomad. It doesn't really matter. The app has to run somewhere.
And lastly — it spans all of these layers — an observability challenge, meaning my apps are going to emit logging data. They're going to emit telemetry, they're going to emit tracing. How do I observe all that to understand if the app is healthy? How do I debug if there are problems?
Ultimately the applications then sit and run on top of this whole stack. For almost any app, they have all of these problems. That's a production grade app. Maybe you can say, "I'm skipping observability, or I'm not doing CI, I'm not testing the app." You could skip those things, but I think functionally, any mature application's going to need all of these pieces.
So then the question becomes, "What's in scope for these platform teams?" Ultimately, our view is: all of this, because the goal is to deliver consistency to these application teams. I don't want app 1 through 100 to do it in a different way. So if every app has these problems, I don't want to have 100 different CI solutions and 100 different pipelines that I have to manage.
The second side of that is the consistency piece — how am I delivering leverage to all these groups? As an app team, if I can onboard and get this whole thing delivered as a service, I can focus on what I actually care about, which is my application. Most of these app teams functionally are not here because they care about the provisioning details or the runtime or the observability. Those are details to support the application as opposed to the end goal.
Now that said, this is a very large scope. So as you think about the sequencing of the platform teams, where do you start? It's very hard to go from zero to providing all of this as a shared service. What are the important checkpoints along the way? And how do you deliver value incrementally as a platform team rather than say, we're going to disappear into a cave for two years to go build this.
I think the first piece of it starts with the pre-production pipeline, because this is the obvious start-point in the lifecycle of any application. I'm building a net-new app, it needs the pre-production pipeline.
As a platform team, can I standardize and provide GitHub and Jenkins as a service, and Artifactory I’ll run centrally to have the shared artifacts. I'm delivering that as a shared service that anyone can come in and build on top of. That's at least pre-production.
I think the next step after that, I call this the L shape, is to add provisioning into that. If we're standardizing on something like Terraform, for example, I can actually build a full infrastructure as code pipeline. I can commit my configurations to GitHub. I'm validating in my CI, and then I can apply my change through Terraform Cloud, for example.
And now I have that visibility as a platform team of all the infrastructure that's being provisioned. I can manage it in a central place. I can put the right collaboration and governance protection around that.
But importantly because I'm doing it at this base level, I'm getting the infrastructure as code done with Terraform, I can effectively support these other layers. I can now create a Terraform template for, let's say, my Java-based app running on Amazon ECS. I can create a module for that and that module is standardizing things like: how's that thing running, when the app gets deployed, is it connected to Datadog? Is the observability baked into that? Am I connecting to [HashiCorp] Vault to do my secrets management?
The advantage of this base layer of infrastructure as code is it's very flexible. Anything that I can express as infrastructure as code, I can begin standardizing it. I can have Terraform modules that provide a blueprint that enable my platform team to start scaling and have a consistent pattern. So if 10 different app teams are all doing a Java-based app on ECS, I have one module that defines how that works, and I'm defining and managing it in a consistent way.
That then allows me to start marching up the stack to increasingly provide these other services as a managed service of the platform team. So you can imagine one world where you say, "Hey, every one of my app teams is running their own Vault cluster, their own secret management solution." That's not super high leverage. Every one of these teams is managing it in a different way and dealing with upgrades and versions and backups. As a platform team, can I offer a shared Vault service or I can provide a shared Datadog service where I have one contract with Datadog, one account, I'm managing that, and then any one of these application teams can be a user/customer of it. And so on and so forth.
Ultimately, our view is you want to march up to then this full stack being provided by the platform teams. If we decompose this, there's a layer cake of platform teams. The base layer of my cake is providing an infrastructure as code baseline that I'm standardizing on and going to provide as a shared service. As a developer team, you can come to me and provide a set of Terraform configurations. I've standardized a pipeline that is ultimately very flexible and it supports pretty much anything that I can specify in infrastructure as code with Terraform. But now I have visibility and I can do it in a consistent way. I can get modules, I can get re-usability. I have that central aperture for how I'm doing all of these things.
Then over time, as I add more and more of these capabilities, what I'm really stepping up to is a platform-as-a-service (PaaS) abstraction. If you think about what all of these pieces amount to, it's a full-blown platform-as-a-service where now I'm telling my developers, "What you're really giving me is the source code of your application. And then as a platform team, (your source code is living in version control) we can automatically build it, we'll generate the artifact, we can deploy it onto ECS or Kubernetes. The monitoring is there, the security and networking is all built-in. It's a full platform.
The advantage is as you move up to that platform layer, that's how you develop and deliver leverage to the broader organization. Now all of my application teams aren't bogged down in the details of the infrastructure. They focus on what they actually care about, which is their application. And then, if that platform layer is too limited for them and it doesn't solve their problem, there are escape hatches.
I have a full blown IaC (infrastructure as code) pipeline that lets me manage anything in a very flexible way. This way I get the flexibility I want, I get the leverage I want, and I have the consistency of having a platform team that's delivering it across all these application teams rather than an ad hoc approach like in phase one.
Now, the final piece of this is as we get mature at doing this, we now are delivering a consistent platform, consistent set of pipelines across all of these application teams in cloud. Then get to phase three, and I think the extension from phase two to phase three is we really look at them and say, "Well, what's so different about the private datacenter? Why can't I take this exact same approach, extend that and say my private datacenter looks the same as a cloud, it's still API driven. It has all of these same fundamental problems.”
If I can just extend what I'm doing in cloud back to my datacenter, great. I can apply Terraform, Vault, Consul. I can deploy apps on top of VMware or OpenShift. I can extend Datadog to operate on-premises or use a different solution. Maybe it's Prometheus, maybe it's AppDynamics. But fundamentally, I can apply this exact same picture to my private datacenter. So this platform team then becomes a consistent way of delivering infrastructure across everything, not just cloud, but private datacenter and multi-cloud.
This is the maturity curve we see people go through. Most often people start in phase one. It's this chaotic approach to cloud, very ad hoc, every team kind of doing whatever they want. Very quickly people realize that it's going to be hard to control cost, security, compliance in a sensible way.
So that gets you to phase two where the platform team is there to standardize the approach and industrialize it at scale. Their scope might start more narrowly on pre-production, grow that to having an infrastructure as code pipeline with something like Terraform, but then get to the point where you're delivering a platform-as-a-service capability, which is this broader picture. You need all of these pieces for your apps, and that completes the layer cake.
Infrastructure as code standardization, platform standardization, and once you have that, then you can get into phase three, which is really expansion into the private datacenter, and that gives you a consistent approach to doing all of this.
Hopefully that was helpful. That just shares a little bit of our view of how platform teams have evolved in the role they play as we industrialize our approach to cloud.