Unlocking Cloud with GitOps with Terraform and Sentinel
Jul 26, 2020
Watch Lachlan White Terraform Cloud & Enterprise and Sentinel policy as code in a GitOps workflow.
- Lachlan WhiteDevOps Architect, AGL Energy
Many organizations are not pure cloud-native or still totally on-prem ITIL. They're often somewhere in between. Lachlan White created a visualization for what this looks like. Typically it means power struggles over who has authority for what (often stemming from the cross-cutting "DevOps engineer" role) and disorganized communication from shifts in roles and responsibilities. Armon Dadgar illustrates this "contain and drain" migration well in his whiteboard presentation: Balancing Centralized & Federated IT in a DevOps Transformation.
What if there was a better way...
Lachlan brought his vision of an operating model to life. We don't want developers to have to talk to the person in the service desk, and then the manager, for approval. That all is wasted time. Instead, he's used a GitOps automated workflow hundreds of times, making this once manual process an automated one, moving the software lifecycle along as quick as possible and getting the benefits related to economies of scale. His vision included a focus on:
GitOps: A deployment and infrastructure as code (IaC) process that treats Git version control as the central source of truth.
InnerSource: Running internal development and operations the same way you would run an open source project and community.
Re-usable, collaborative IaC: Using Terraform modules to scale expertise and keep control of reusable components in the hands of specialized teams.
Policy as code: Using the Sentinel policy as code framework in HashiCorp products to automate compliance checks, allowing for quick feedback rather than weeks-long ticket-based reviews.
Watch this VirtualDays APAC session to understand the details of AGL Energy's migration to a cloud operating model and see how they used HashiCorp's Terraform Enterprise to achieve its goals and a host of benefits.
For another demo and presentation on this Terraform Cloud & Enterprise use case in Lachlan's previous position, watch Unlocking Cloud with GitOps with Terraform and Sentinel at AGL Energy
Welcome, everyone, to HashiConf Digital 2020, "Unlocking Cloud with GitOps." I'm Lachlan White, and I'm currently working as a principal technologist at Lab3 in Melbourne, Australia. I'm lucky enough to be a HashiCorp ambassador, 1 of 3 in Australia, in the open-source community, and looking forward to seeing what we can do for the first year of that program.
My experience with Terraform and my GitOps journey is that I was lucky enough to be part of the first production implementation of Terraform Enterprise in the Australian Security Exchange's biggest 50 companies.
Getting to GitOps
How do we get to GitOps? The questions I see more often than not are, "Why do we need all these tools and technologies?" and "Why is cloud so hard?"
There's a Dilbert comic that hits close to home for me. The manager says, "I need to know why moving our app to the cloud didn't fix all our problems?" Dilbert says, "You wouldn't let me re-architect the app to be cloud-native."
The boss, trying to sound like he knows what's going on, responds, "Just put it in containers." Dilbert says, "You can't solve a problem just by saying techy things." And the boss responds, "Kubernetes." That hurts.
There's this preconceived notion that going to these technologies solves all of our problems, but the main 3 challenges we face around cloud really revolve around traditional IT versus cloud-native IT.
A big challenge is the change of culture, the transition from those models, and how we progress past those. And there's a triangle of discussion that's constant in cloud adoption, which is the "speed versus risk versus cost" discussion. We'll cover a bit of that in a responsibility model that I've drawn up.
But how do we solve these problems? As you may have guessed, GitOps is how I answer that question. I have my own definition of "GitOps." Weaveworks coined the term and is credited with coming up with a way to manage Kubernetes clusters and application deployments to those clusters.
But based on the maturity we now have around cloud and some of the HashiCorp stack, we can now move past that and say that GitOps is using Git as the source of truth for the management and the deployment of our cloud-native technologies. That is really exciting.
But before we dive into some of the Terraform aspects of it, what is GitOps? You know, it's another "ops" word. And it really comes from having come from a traditional IT background where we have ITIL as our main system of operating, which is really people managing systems.
It's a very spoken way of operating, and it's very risk-averse, for a good reason. We have a central datacenter. We don't want things to go wrong there. It's very expensive when things go wrong. And we're trying to move to this cloud-native world where we have GitOps people managing code, and code managing systems. This new way of working can be really scary.
But we don't need to necessarily throw away all of those ITIL practices. We need to put them into code so that we get the best of both worlds. And what we start to see as we go through this kind of transition into this GitOps model is that we move away from the traditional, heavy, monolithic-based application and shared service, and we start to move toward the loosely coupled modules.
That's something that Terraform can really help us with.
And that ends up changing our attitude from a cost-driven enterprise, in terms of traditional IT structures, toward something that's more inexpensive to change, open to innovation, and we've got scalability on our side in this cloud environment.
"InnerSource" isn't a new term. Tim O'Reilly coined it in 2000. It's defined as using open-source software development practices, which hopefully we're all pretty aware of, and the establishment of those practices within organizations, specifically the culture within organizations.
Armon Dadgar, who had an amazing keynote the other day, has a really great article and video on the HashiCorp resources page, which everyone should definitely go and watch. He tries to break down decentralizing the trunk of enterprise technology, which is super important.
Essentially, what we want to do is to feed off the middle branch. We want to enable teams to grow from a shared baseline, rather than trying to plant a forest, to stick with the analogy. And we want to do that through community-driven, collaborative infrastructure as code, to start off with.
And we need to get everyone involved. We want pull requests from security, we want pull requests from infrastructure, and we want pull requests from people managing cost reports, if we can get them to do that.
Once we start on that journey, we start to see huge benefits in an organization with reusability, flexibility, and even a faster time to market for the delivery of some of our applications.
A Responsibility Model
I drew up this visualization when I didn't find a really clear way on the internet of illustrating the change between a traditional ITIL-driven datacenter approach to one of a cloud-native capability, and the culture change that enterprises go through.
On the left side, we can see your traditional IT. Applications looked after by app teams, middleware looked after by the middleware team, infrastructure by infrastructure.
It's very clear what everyone does, and everyone has a color that they align to.
If we look to build an application in that traditional IT system, what we end up doing is asking networks for a network range. We ask security for clearance. We ask the data team for access to the store. We end up going through huge amounts of steps just to get to the deployment.
That can take weeks, maybe months. And that's traditionally done in an on-premises environment.
What we start to see recently is the transition to hybrid IT. We start to see this emergence of DevOps teams, the infrastructure team starts to bleed into some of the other areas as well. We've got the data team conflicting with the app team because our app and data are tied together in cloud.
It's a real battle for control in terms of how we're operating. We start to see the colors blur. When we're grabbing for those territories, what that encourages is a deep coupling of responsibility, which isn't what we want.
We want everyone to be open and sharing and following those InnerSource models in terms of our GitOps approach.
The challenge is that when people move to this hybrid model, although we see a decrease in time because of the ability to provision in cloud, it's an illusion that normally comes at the cost of risk and security posture of an organization, which can be extremely costly in the long run.
What we want to get to is the cloud-native IT world. Get everyone working in the same way, through the same tool sets.
For me, that really talks to the workflows and how we get everyone operating on the same page.
That's how I define "GitOps" and what we have.
Terraform to the Rescue
How does this get solved? Terraform has a couple of offerings: Terraform Enterprise, Terraform Open Source, and Terraform Cloud. I'm going to jump between Terraform and Terraform Enterprise in this talk, but essentially Terraform infrastructure-as-code language, driven by open source, declarative, state management. It has over 200 providers now, which is crazy. The Domino's pizza providers have already gotten a callout with their Sentinel policies for no pineapple.
But why do we use Terraform to solve some of these problems? Well, it's modular and reusable, which is an extremely beneficial thing in an organization.
We want to scale, define once, and use many. It's a very powerful thing that enables us to speed through our development and our deployment activities surrounding infrastructure as code, and even using some of those other providers to extend past what we would call traditional infrastructure.
It's extremely human-readable. It's amazing to see security engineers sit down and break down Terraform and understand what's going on compared to some of the other services that are available to do similar things.
It's extendable. We've touched on how many providers there are, how many other things we can use with it. For me, that's the biggest selling point. To be able to go and provision a full CI/CD pipeline, as well as my cloud deployments, with the same language is extremely powerful.
The Power of Terraform for GitOps
How does Terraform Enterprise fit into that picture? The collaborative infrastructure as code at scale is really where that comes in. The ability to start to use workspaces, get RBAC around those things, and have people collaborate, not just on local machines, is amazing.
Terraform Cloud does a really good job of introducing those as well.
Compliance: Sentinel gets introduced here. A policy-as-code framework, where we can start to apply policy as code to our workspace and our deployments and even have things like budgeting in there, is amazing.
In Terraform Enterprise we see the emergence of some of the security features of the platform: SSL enabled on the web GUI, some audit logs, and things like that. Also, scale. We see some amazing control around the Terraform worker node and how many runs we can do in a collaborative way, in terms of how many parallel workloads we can run.
If you're running at a huge scale, that is extremely important.
So we touched back on the ways of working, and Terraform really unlocks these workflows. For me, it's really about the abstractions. How do we work with getting those teams that are stuck in this trend within this model of trying to deploy their application? We need them to operate on the idea rather than individual events.
For a GitHub repository, as an example, we want someone to deploy a repository. We don't want them to have to talk to the person in the service desk, and then the manager, for approval. That all is delay. We've done this hundreds of times. We want to make this as quick as possible and get the benefit of that, which is the economy of scale.
We want to make sure we're reaping the cost benefits by doing these processes over and over again. And by defining these as code, hopefully we can start to get a 1-to-many benefit to using these modules through Terraforming and how we go about it.
These 3 points, for me, are how GitOps starts to work unlocking the value of cloud. We start to see people contribute centrally to the benefit of each other in the InnerSource model.
Responsibility Meets Policy
What you have to be careful about, though, is you have to think about how you're going to enable people to do certain things. We can't simply swing open the doors.
Sentinel's a really good way to manage policy around that, which we will touch on in a little bit, but what we look to do is to map the roles of that responsibility model across some of the resources we might have.
In Azure, for example, the main example here is a subnet. In this slide, we have 3 teams that care about subnet resources. We have security, who care whether that's public-facing. We have networks, who care whether there's an overlapping range with another subnet or network address space. And the application team care about just trying to get their app to talk to what it needs to and configure that.
So when we move to cloud and we have this flexibility around resources and roles that people are trying to do, how do we look to solve those problems? Trying to define that is quite complex.
Terraform can help us do that by the creation of modules and getting everyone's input into it. Again, it comes back to touching on the workflows, and not technology's piece of the puzzle.
In this example, we'll talk about containers, because everyone loves containers. This is the standard set of resources we would need to deploy an application in Azure, running on Azure Kubernetes service. This isn't everything, but it's a nice list to go through what we need.
At the top, we can see we've got all of the different teams and resources that are needed to configure all of this IP that we've collected over their experience. What we can start to do with Terraform is look at breaking all of these into specific chunks and reusing those as much as possible.
For example, on the left, we need to start putting our code somewhere. We've got a DevOps team that knows what a GitHub repo looks like for our organization.
Then let's work with maybe a cloud operations team, if we have one of those, to look at the default management group and subscription constructs that we need.
The list keeps going. But as you can see, if we had to define this for a developer every time, there's a lot of reinvention that happens. And that costs us time and effort and, from the business' point of view, money.
What we want to try to do is make sure we're reusing as many pieces as we can to gain Terraform's main benefit in terms of reusability and modularity. We want to reuse those things to enable our end user to say, "Let me grab all of those things and reap all of your hard work and get as far as I can."
As an example of this, when we look at a GitHub repository as code, what does that look like? As a new developer, I need to get a repository to store my business logic. So down at the bottom, you can see, we've got a module that's asking for a new repository for our application.
Within 4-5 lines of code, we're able to pull that module, in this case from our Terraform Enterprise private registry, and then we're able to take the construct of that module, which in this case has a default templated module with branch protection, default security teams added, so we can scan against that code after it's built.
But the developer doesn't get exposed to any of this. He's able to just reap the benefit of how many times we've done that before. We're able to move much faster than previously.
When we look at the classical infrastructure as code, as a dev or an engineer I want to deploy the infrastructure I need. How do we go through that?
Again, we want to reuse a module that in this case is an application with an app service plan, VNet integration, logging enabled, all of those kinds of things, so that we can look to enable the developer as fast as possible.
I have a whole other talk on Terraform at scale, which talks a lot around some of those conversations that happened at HashiDays Sydney 2017. It's a bit of a throwback, around Terraliths and Terramods and Terra services, that I encourage you to check out. Some of the resources in there are really amazing.
A lot of people seem to steer away from networks as code. Especially in the DevOps world, people seem to not want to touch networks as much. But I think there's a real advantage to it, and Terraform can play a massive part in going through this.
As in this example, a previous module that I've worked on, we had an iPad integrated into our module, which would just call for an API address to get the next available sub that was available to us through their iPad.
That enabled the network engineers to not have to their spreadsheet with all their free IP ranges and report back to us. But what we're able to do is empower the developers to raise pull requests against that, and just have security and networks potentially approve that.
But there are more smarts in here than just getting ranges. We're able to say, "What subnets do you want within that range?"
Or say you want a web and an app and a database. Let's take advantage of some of the in-built Terraform functionality, like the size of the subnet breakdown function within our module, to automatically create those for you and loop through those subnet names that you've given us to define and build that for you so that you, as a developer, don't need to have a huge background in networking.
That can really accelerate how we go from having absolutely nothing to getting through to the end of it.
What do these benefits look like? The problem statement that we had at the start was that we've got all these things that we expect developers to do in the current world. We've got infrastructure, security, risk and compliance, networking, but we've also got operational requirements and cost management concerns.
What I see happening over and over again is we start to push all these pressures down onto the developer. And what they're trying to do is just develop and deploy their application. Hopefully they're getting involved in some of our DevOps processes as well, but mainly we want them to really contribute to their business logic.
So how does something like Terraform Enterprise solve that? When we look at something like this, we've got, all of a sudden, with all of the modules that we've created and defined, different roles and different people contributing to the area that they have the IP in.
All of a sudden we have the DevOps guys contributing to infrastructure. We have security engineers potentially writing Azure policy, or maybe it's Sentinel policy, to fix our security needs. We have network operation engineers helping us with our connectivity issues, and potentially even risk people getting involved with that process.
And by pulling all of those modules into something like Terraform Enterprise, Terraform Cloud, or even baking that through Open Source, we're able to get these benefits shown on the left of the screen. We're able to get that reusability, cost visibility. We increase our governance and compliance because we're getting more people involved.
But the productivity that we get is much increased because of these activities. And this developer can now just focus on the development and the deployment of their application, because they're feeding off of the definitions that we've done as a collective throughout our InnerSource module and following our GitOps approach.
That really does lead to faster delivery times, quicker innovation times, and even unlocking the ability to leave your datacenter and go to the cloud faster, which has been my experience.
Terraform's Other Features
There are some other features within Terraform Enterprise that are really important to touch on. As a matter of fact, Kyle Ruddy blew me away with his presentation earlier today.
Sentinel really is a policy-as-code framework. And it works with our Enterprise offerings for some of the products and Terraform Cloud as well, but it can be used for everything from CIS benchmarks to security controls, tagging, even identity management, if you really want to push it that far.
One of the great resources that Kyle talks about is the Terraform Foundational Policies Library. That's a GitHub repo and an amazing resource to get started quickly.
Just have a look at how we can use Sentinel in a best practice way. Tagging, especially in Azure, is critical. You're crazy if you don't tag things, and it's crucial for things such as cost code and cost management, even relating things back to CMDB IDs, if that's what we need to do for certain organizations.
But if we look at doing those natively within Azure, we end up defining or auditing or even denying those after the deployments attempted to happen. For some things, if you deny, that might stop something scaling out.
If we have a virtual machine and that needs to scale, but it's denied because it doesn't have tags, then scale's going to file, and that's a big problem. Terraform will stop those issues happening by having the detection happen before the deployment.
It also enabled us to continue thinking about that in a 1-to-many way. Originally, we were tempted to try to put all of these validation rules inside each module. We quickly realized that if we could use Sentinel to have a single point of management for all of the deployments that we do, it was all managed from one place.
It released us from the operational overhead that we were getting ourselves trapped into before. So it's a really powerful tool.
One of the examples that I've got, we used it to break down an access problem that we had. Root access issues have been around for a really long time, just called different things. This diagram applies to that as well.
What we want to do is enable developers to touch certain things, but not others. How do we manage that? Well, we could use Sentinel to whitelist Terraform modules, for example, that have very specific instructions on how we want to deploy those.
That can be a really powerful tool because what we can do, then, in something like Azure where we don't have the ability to use Azure forests and organizational units like we do in a traditional on-premises environment, we're able to extract the identity provider and the RBAC control away from each other.
That's a really powerful concept that we can start to leverage in amazing ways to develop productivity but also increase our security posture.
Sentinel is the glue that holds some of these things together. To be honest, some of the deployments I've done for Terraform and Enterprise, we couldn't have done it without Sentinel and using that policy engine to get us over the line for some of the controls we had to do.
To touch on the responsibility model again, hopefully through this, you can see that we've moved through traditional IT and hybrid IT to cloud-native IT.
Now that we've got all this module, this intellectual property, we've got collaborative work, we're using Terraform Enterprise and Terraform to reuse our IP, we can get everyone on the same page in the way of working and get everyone collaborating in a specific workflow.
That enables us to use more cloud-native technologies like functions as a service and containers and all the things we want to play with, but it also removes the barrier to entry for people to start having conversations about improvements. We're enabling innovation.
And it reduces the cost of changing anything, because everyone's on the same page. It's a powerful motivator to try to unlock the value of cloud.
Thanks very much for having me.