Advanced Terraform users working with large organizations hoping to make their infrastructure more manageable should learn about Xebia Lab's layering concept to separate bootstrapping, infra foundation, and services for greater scalability.
Armin Coralic, a solution architect at Xebia, shares some hard-learned lessons about Terraform from multiple customer consulting use cases. If you've already started using Terraform are interested in taking it to he next level, this talk is for you.
Large organizations come with their own unique set of challenges. The focus of this talk is on learning why and how you should layer your infrastructure to improve workflows and separation of responsibilities, while removing barriers to scaling.
Thank you for having me here. My name is Armin. I work for Xebia as a consultant, and my job is basically helping other companies become better at what they do themselves. That can be by helping them with their IT vision, software architecture, creating code, teaching their teams or mentoring them, or doing infrastructure as code—what this talk is all about, of course, but enough about me.
I'm here to talk about Terraform, but before I do that, I wanna mention that all my examples in this presentation are AWS based, but everything I say currently today, is applicable to any other cloud provider or on-prem or anything like that. Because it's about the idea that's behind these experiences, instead of actual implementation of how you do these things.
Back to Terraform itself, from the user perspective, Terraform at its core is quite simple. If you look at the documentation online, you will probably be set up in a couple of hours and you will be creating your first resources and you will be very happy, but eventually you will realize that Terraform itself, although very awesome, is just a tool. It's about the mechanics and there's much more to it than only Terraform. We'll talk about that later in the presentation itself.
When you work with Terraform and infrastructure as code, you will go through a couple of stages. I want to go through them with you right now.
So the first stage is the basic one. It is where you create a couple of resources, you have your plan apply locally, you have a local state file. Everything is awesome, because it's small and very simple. But soon you realize, "I need to create much more resources." This is probably where Terraform modules will come into play, and you will either start creating your own modules, which is completely fine. You should definitely do it. But I wanna point out that there are some awesome verified and unverified modules being created by the awesome community—sitting, of course, here as well—which you should try and start using, especially because they help and they can speed up your process while you start.
But in the beginning where you are learning Terraform, you might get scared off by these modules. Why? Because they do a lot of complex things in the background. So you might be tempted actually to start to do things from scratch. My tip to you would be: don't do that, go and use those modules and learn from the modules how Terraform actually works, because there's some awesome Terraform coding out there, there are awesome tips and tricks in there that you can use, instead of just going and doing everything and learning everything from scratch.
The next stage after you have created a couple of resources, you have created your own modules, you will realize that you want to speed up and improve your workflow. That's why you're probably gonna start using the remote state file. You figure out that you need some secrets that you need to decrypt and encrypt and feed into Terraform, or maybe you want to do some pre-checks, or you can enforce two-factor authentication for AWS, or anything like that—things that are not explicitly created inside Terraform.
You can do that by creating batch files, makefiles, golang applications, whatever you like. It depends on you, but you wanna try to figure out how you can improve your own workflow in here. But you will also realize that you wanna automate things, so you will think about—how can I make sure that my workflow is fast enough, but I can also do it by using Jenkins or GitLab or maybe even Terraform Enterprise, depending on what you're doing.
My tip to you is, you should definitely automate everything, but you should watch out and not automate everything in the beginning, because if you are in a development stage where you do a lot of changes on your infrastructure, if you make a lot of changes, things will break, and if things break, it's much easier to actually fix them, by using a local plan and apply instead of actually going and having a build server in there.
There's, a discussion here on the HashiDays Slack about this. Maybe some of you can drop in, and then give their opinion on how to do these things and we can learn from each other.
What I forgot to tell you, by the way, in the first slide—because I skipped over it—it's not only about the resources that you create, it's also about the environments that you need to create and, in most cases, you can do two things. You can either decide to have Terraform workspaces, which is, of course, awesome, but you can also do things like using the blueprint module which can act like an environment that you can keep applying all over again.
So back to workflows. After you have done these three stages, you might be thinking, Oh, this is cool. I am really working, and everything is awesome and stuff like that. But this is only a first phase and it is also the phase where most of you and most companies are actually in, and it's basically the tooling phase. Everything is about creating a resource with a tool, and trying to configure something, but it's not the most important phase. It's actually the second phase that is much more interesting. It's also what this talk is all about—the second phase.
In the first phase, you have conferences, YouTube movies, books and stuff like that, so you don't need me to tell you how to do these things because you probably also like to work with Terraform, so you can figure these things out yourself. So what's the next phase?
The first stage in this phase, also the fourth stage, if you follow this diagram, is the teams.
Let's make an example, then. Let's assume I'm in a modern environment and I have three teams, which do DevOps. And in this case, I divided by having a dev and ops, and no I'm not saying those should be different roles. I'm just visualizing here so it's much easier to talk about it. But what about them? What's wrong with having these teams and having infrastructure as code with Terraform? It's that you need to decide how can you separate these teams between each other in an infrastructural level? So you can think about saying, "Okay, I'll just create one AWS account and I'll put all the teams in there and I'll have acceptance and production together."
But in my opinion, it's much better to maybe think about saying, "Okay, I will have an account for acceptance and production," or, even better, what I really like to do is have a domain account for a specific domain team, which has acceptance and production in it. Or it can even be scaled a little bit further depending on how complex it is. But this allows you to actually separate things between each other, making it much easier to work between team members and having separated responsibilities.
Depending on what you choose here and how you separate things, you will probably figure out that you need to start sharing some information on the infrastructural level between these teams. And that's also where remote state sharing comes into play.
Of course, Terraform has an awesome way to do that. It's basically by sharing your remote state, but there are some problems in that. First one is you don't know actually who uses your state file, although you might figure that out, what you don't know is what kind of resources they are actually using and, worse case, if you have cycle dependencies between these teams, you have automated everything, but you still can't do anything because you can't build everything from scratch, because you have cycle dependencies in there. And the other scary thing is the fact that the secrets in a remote state file are not encrypted, so giving somebody access to your full remote state file means they can potentially see all your secrets in Terraform. But, of course, there are fixes for this.
One of them is to basically specify that you will have a specific output file from a team to an organization or from a team to different teams as well. Those things can actually become contracts between you and a third party by saying, "This is what I exposed from the infrastructure and this is what you can get from me and I'll make sure this doesn't break and I will make sure these things say the same."
The other way to do this is using data source lookups in Terraform, which is also fine, but you have to watch out, of course, for the cycle of dependency and stuff like that. The other thing is, with the data sources, although there are some, and you don't have to share any state, you have to still decide naming conventions and stuff like that, so you can actually figure out where to find these resources. Another thing is, not everything can be found currently in the data sources, a lot of things can, but Terraform is an open source product, so just contribute to it so we can all benefit and then we'll finally have everything in there.
After looking into the themes, you might say, Yes, I am done, everything is cool. I've done the tooling, I have fixed the teams, everything is awesome. Let's party. But not all the organizations are as simple as the one I just showed you in the beginning. Most of them are really complex. Maybe not as complex as this picture here but they definitely have more than a couple of teams, they have different departments, they have HR, security officers, and a whole bunch of other different things, and all people talking to each other, so it's a lot of complex stuff. That means also that the most important and the last stage that you will encounter in your journey with infrastructure as code, and with Terraform, is basically, the organizational stage.
This is the stage where you will find out that it's not about the code itself, it's about the fact that these organizations change a lot. They decide one thing today and they move to another the next day and they have two teams and tomorrow they have three teams and they go back to one and they have four and it goes all the way around and you have to manage that, with your infrastructure and, of course, with your Terraform code.
So, how can you fix these things? But first, let's look at what kind of challenges there are in there before we talk about the actual fixes. The things that we found out, is that, the first one is, how do you separate the work and responsibilities between teams, disciplines, inside Terraform code. Because you have different people, different responsibilities, how do you handle that inside one Terraform code?
The second one is, how do you handle different lifecycles inside Terraform code? There are some resources in AWS, which will change a lot and other ones that should not change a lot, so do you even care about that? Does it matter? How do you fix those things? And the third one is, how do you not end up with a monolith, like in a software project?
We learned in the last ten years, that having like big monoliths is not a way to go, but putting everything in microservices, making everything small, is also maybe not the best idea. But somewhere in the middle is a way to go. And if we have learned that from software, why should it not be applicable in our infrastructure as code? It already says it's software code, infrastructure code, it's all code, it doesn't matter what it actually does.
So how can we fix this? It's basically by layering our Terraform code. Don't worry, you don't have to read this. I will go by all these layers one by one. But before I do that, let's look at some of the organizations that we can use as an example to see what actually happens when you change these things.
For example, we have an awesome new organization. It does DevOps, so there's one on multiple teams, but they have full responsibility. Or we might have an older organization that has separated divisions or departments like platform, security, domain teams. The domain team might even do DevOps, but they are not allowed to do everything so they kinda have different responsibilities in here. So, having said that, what about these layers?
The bootstrap layer
The first layer is basically the bootstrap layer. This is the layer which prepares the team to actually start creating infrastructure as code. So this bootstrap layer should never ever depend on the actual cloud that you're working on, because otherwise you will get a chicken/egg problem, because you need something to store your state file or something like that, so you don't want to do that.
But what is actually in a bootstrap layer? That is basically the layer where you actually create your organization or your account, you create some default roles with IM, you create an S3 bucket that will later be used like a remote state place, or you maybe even lock down some security parts, like enable CloudTrail inside. So how does this bootstrap layer look from my perspective?
So you don't have to exactly copy this, because, as I said before, it's about the idea behind this, but let me explain how we do this. In our case, we have one Git repository all accounts are in, so you can see domainA, domainB on the screen. Inside the domain, we have a blueprint module called 'account,' which is just a module that combines all the other smaller modules so that I can easily create different accounts by not copying the code all over. You might be thinking right now, 'Why are you not using Terraform workspaces in this case.' Because I believe that in this part, it's such a crucial thing that this should be specific and explicit as possible and, in my case, the Terraform workspace is not a correct place to use here, so I think it's better to use modules themselves.
As I was saying, we have put everything in here, in one Git repository. We also are not using any remote state file, so we are using a local state file, which is then being encrypted, and then pushed to Git before it actually goes online. So we don't have any dependencies to our cloud provider in this case.
So, how does this apply then to an organization, for example, the first one, the DevOps organization? You might be thinking, Let's put it in a Team A. There's only one reason when you should put it in the Team A and a DevOps team, if your organization only has one DevOps team. If it has more than one, it should definitely not be part of the team. It should be outside of the team.
As you can see, I put a question mark here. And the reason behind that is, I don't actually care who does it, as long as the team doesn't do it. And the reasoning behind that is because of the fact that inside the bootstrap layer, there might be things that are outside of the scope of the actual team, so they should not be allowed to actually do those changes or create those things. So somebody else outside of that team should be the one to do that. And in a small organization you can choose whoever you want, as long as it's not the team. Of course, in an older organization, it's probably going to be the security team who handles this, or maybe some other department, but it's definitely not gonna be the domain team.
The infrastructure foundation layer
The second layer in this stage is basically the foundation layer of the infrastructure, and depending how you decided to structure your teams, having one or multiple accounts, you might have one or more layers as you can see in the global and infrastructure one. So let's start with the global one.
In my example, I am using one AWS account which has one domain team in it with full responsibility, which has acceptance and production environment in it, but this account also has resources inside that are not environment-specific. For example, I might create CloudWatch log groups or I might create SNS alerting topics which are account-specific but not environment-specific, and I only want those things to be created once and not multiple times. So instead of trying to deduplicate everything from an environment, it makes much more sense to actually have a global layer where these things can actually be applied from.
Now we come to one of the more important layers. It's the actual infrastructure foundation layer. This is the layer where you can create your things like subnets, routes, maybe even databases depending on who's actually gonna be a responsible for this, but I will touch that subject a little bit later in a couple of the next slides.
The name of this layer is the foundation layer, and if you go back to it and you think about, "Okay, how is a house being built?" If you build a foundation of a house, you don't actually have a house yet. You have something to build a house on top. So the foundation layer of the infrastructure is kinda the same. You are creating infrastructure that is a foundation for your domain team to actually provide any services on top of it to provide any features to the other teams or to other people or businesses depending on what your company actually does.
So how does this layer and the global layer look like together in the folder structure? Again, this is just an example, you don't have to do it this way. We chose, in this particular example, to still have one big gigantic Git repository, like a monorepo, but every team and every environment like accept, global, and prod, have their own remote states files, so everything is separated from each other. But the cool thing about it is that because all the teams and all of the domains are in there, we can see each other's code and we can see who is doing what, and then copy things from each other and learn things, instead of separating everything.
Of course, you can decide to cut this up, but my tip would be: don't cut it up too granularly, choose something like one account and put everything that's in that account in one Git repository. Don't start chopping it like acceptance, global, and prod, because you'll have Git repositories all over and you will not know what belongs to what.
Another interesting part here is that I believe that global is such a simple concept that you should just do like a simple resources in there, while in my example, here, I have acceptance and prod, like an environment and a module thing, but I think the best place to use workspaces actually, that from Terraform is here because acceptance and production are 99.99%, or maybe a little bit less the same. So this kind of makes sense actually, to start using Terraform workspaces here instead of having like a blueprint module thing like I'm doing here.
So, how does this apply to the organization? Of course, in a DevOps environment, it is the team that does this. In a more traditional organization, it's probably the platform team that does this. And as I said before, in a little bit more traditional organization, it makes sense to also put the databases in here because if the platform team is possible for foundation, the database is kind of like a foundation layer where everything is built on top of, so maybe they can provide a service so it's much easier for the domain teams to build the rest on top of that.
The service layer
So the last layer that we actually have here is the service layer. You can think of this layer as everything that needs to be put in here from the domain team to actually provide any useful features, business value, anything that you need for your own customers. They may be your own colleagues, it can be other companies or actual clients, no matter what. So the things you can put in here is, if it's really necessary, are EC2 instances, maybe some Fargate in AWS, or some containers that you spin up, maybe an S3 bucket for your website. Opening port and load balancers to expose some things. All these kinds of things that are necessary to provide services.
Looking back, how should services look in a folder structure? They are basically exactly the same as with the foundation layer. The only difference is they don't have the global layer, because that layer should only exist in a foundation layer because that's something that's account and not services specific.
So how does that apply to all these organizations? So, a DevOps organization, of course, it's the team. You might be thinking right now, 'But if it's the one team, why can't I just bundle these?' I'll touch that subject later, but just remember here that those two resources, or those two layers, have different lifecycles. Things in the services might change much faster than the ones in the foundation, so that's the reason why you should still keep them separated. In a more traditional organization, of course, it's probably gonna be the domain team that handles this, so it kinda stacks up from here.
And here you're thinking, Wait, why in the world are you creating all this complexity, all these layers? Can't I just put all these things together and then make it simple and it's gonna be awesome. The answer is: yes, you can. You can do that, but eventually when you start to scale, you will find out that you will hit some boundaries that you cannot fix anymore by having everything together. So what I'm trying to say today is that if you layer your infrastructure as code, your Terraform code, it'll help you fix the problems that you come across when you work in the organization.
So going back to the organizational challenges, let's look at if we actually have solved any of these three problems. So looking at the first one, how do you separate the work and responsibilities between teams and disciplines inside Terraform? Kinda makes sense, right? We already have different layers, I already showed you that you can have different departments or even different people inside the same team being responsible for different things inside your infrastructure. Or, they don't even have to be responsible for it differently, but they can work separately from each other without having to interfere together.
The second one is, how do you handle different life cycles inside Terraform? If you look at the services layers, as I said before, it's gonna be changing probably daily, because those are the things that you are adding constantly. You're trying to create new features, new business values, so that's the thing that changes a lot. While the foundation and the bootstrap is probably not gonna change that much. It's going to change in the beginning, but afterwards it's gonna lie still and it should stay there.
But what happens right now if you put everything together and you make a mistake and you now have to fix something in the service layer while it is together and you have to provide a security update or anything like that in your foundation layer. You can't. While if you later these things up, you can apply and plan and do whatever you like in these two layers without having to interfere or be held up by the services layer, which changes much more often than the other ones.
The last and third point is: how do you not end up with a monolith like in a software project? Kinda because of the layering, you'll separate things up, so that kind of makes sense, but the interesting side effect of this layering is that you can also easily test your infrastructure as code. Meaning, for example, you can quickly bootstrap a new account, see if that works, and break it off. If you have a development account, you can easily start creating a development foundation and see if anything breaks because it's much faster, it has fewer resources, takes less time, so your feedback cycle is much bigger. And providing the foundation layer to your services, you can do the same thing with the services. And you can even change different things between each other and see if they actually break without having to depend on each other. So you can work completely separated from them.
Having said that, in the last couple of years, me and my colleagues have worked for different companies with different goals. We have created a lot of different Terraform codes, and although they look all completely different, of course, probably the same with you guys here, but they have something in common. And I've tried to put this in a summary here. The thing that we learned is Terraform is awesome and it's an awesome tool, but just knowing Terraform alone is not enough. You need more around it. It's not about the tools, it's much more than that.
The second thing that we learned is that although we can mess up our infrastructure as code and we can make that complex, we can fix those problems. It's actually the organizational dependencies and organizational changes that kill our infrastructure as code if we are not well-prepared for that.
The last thing that we learned, and what we found out, is although all our codes were different, they all in a sense, had these three layers: the bootstrap layer, the foundation layer, and the services layer. Meaning, layering your Terraform code allows a lot of flexibility and allows you to handle and fix those organizational dependencies and changes that will happen over time.
Thank you very much for listening.