Case Study

Scaling Innovation: ADB's Cloud Journey with Terraform

Hear how the Asian Development Bank (ADB) used Terraform Enterprise, Packer, Vault, and Sentinel to scale innovation without compromising security and resiliency.

Transcript

Hello everyone. I'm Krista Lozada, and I'm here to share Asian Development Bank’s journey, learnings, challenges, experience – everything within the realm of our move to the cloud and adopting automation. 

A Little bit about me — I'm a senior IT specialist at ADB. I have the great privilege of leading the middleware integration mobility and desktop engineering team. Before I start, I want to do a big shout-out to Bea Gan, Thank you for sharing your awesomeness and talent with us by doing all of the illustrations in this presentation.

Who Are We? 

We're the Asian Development Bank, and our mission is to eradicate poverty within the Asia Pacific region. We are not like commercial banks. We are a development bank, and we give grants, loans, and technical assistance to developing countries to help them tackle complex problems, such as water, transport, climate change, and even gender equality. Our headquarters are in Manila, Philippines, and we have around 40 offices within the region. 

Like all organizations, back in 2018, we planned for digital transformation. For us, it meant embracing the cloud, but unlike other organizations, it was not a cost-saving exercise., Rather it was a move for elasticity. We were driven by the need for resiliency. You see the Manila is expecting a big earthquake sometime soon. And we need to ensure that the organization is resilient when this disaster strikes. That meant moving to the cloud. 

Building the Right Foundations

Unfortunately, moving to the cloud is not always straightforward. There are a lot of complexities in there, especially for a bank. Our team had to make sure that we had set the right foundations in place. We had to make sure that our move to the cloud would be as seamless as possible. 

Traditionally, our approach is to make sure that we find one tool to rule them all. But for this project, one of the foundations you have to put in place is to have infrastructure as code: modern practices, modern technologies to enable you to move fast to the cloud. 

After doing our research, it made sense for us. Instead of one tool to rule them all, we needed were tools that do one thing very well and have them play nicely together. For us, it was very clear. For Day 0 infrastructure, it's going to be a HashiCorp Terraform job. Terraform does a very good job of managing immutable infrastructure. Day 1 to Day N infrastructure, whenever your infrastructure needs to change, it's going to be an Ansible job. 

We were so excited. We were supposed to start this project by April 2020. March 2020, boom, we got hit by COVID. And I guess, like everybody else, we were all blindsided. That meant for us literally overnight ADB headquarters had to close one month before our project was supposed to start.

So we were thinking, are we supposed to continue this? Because we hadn't done projects 100% remotely. At the same time, the ADB never closed its doors to its clients. In fact, it tripled its COVID-response initiatives. From $6.5 billion, it increased to $20 billion to support COVID efforts. 

Unfortunately, it also meant there was some urgency to move fast. We were driven to innovate, but at the same time, we cannot have hazards — we can’t throw security or resiliency out the window. We had to make sure that we moved fast, but at the same time, always secure and architecturally correct.

That meant we had to continue the automation project — even if it meant doing 100% remote work on a project that we have no experience with and a technology that we were not super aware of.

Completing the Project in 45 Days

We didn't want to re-engineer every single thing. We wanted to leverage others' experience. There are a lot of companies, a lot of partners that have already done this. They live and breathe DevOps and automation. We wanted to learn from them and make sure we could make this project as successful as possible. 

Fortunately, we were able to finish within 45 days. And by finish, I mean, we had a platform fully working, and were able to integrate and spin-up servers into the cloud in Microsoft Azure — and VMware on-premises. 

Of course, like any other platform, however powerful or how amazing it is, a platform by itself never solved problems; people do. If we wanted to scale our automation capabilities, we needed to make sure the foundation was ingrained in our people instead of trusting the tool to do it all by itself. Any tool can be abused or can be used wrongly. So we had to make sure we had the proper tenets in place and live by our principles whenever we have to do any project or initiative with automation. 

Rule #1: Do everything in code. What do I mean by that? Let's start from the very beginning. We only had Azure DevOps and Terraform open source in the beginning. Then, using Terraform open source, we provisioned HashiCorp Terraform Enterprise. I love it because it's such an Inception thing to do. You use Terraform to spin up Terraform. 

Once Terraform Enterprise was spun up, we used it to create our Ansible platform. This was the platform that we finished within 45 days. Some of you might find it impressive; some may not. But there were a few complications in there as well, such as the virtual machine scale set, where Terraform is spun up to — or even the virtual machines that Ansible is referencing to — had to use ADB golden images. It had to be CIS hardened. It had to have the agents in place, and blessed by security that it was production grade. That meant Azure DevOps had to spin up another pipeline, which uses HashiCorp Packer to create the image that is referenced by the Terraform scale set. 

At the same time, these workloads have to reside somewhere. And we were ballsy enough to be the first workloads to run in our datacenter in the cloud. Imagine this; we were building Terraform Enterprise, at the same time another pipeline is building the images that are going to be used by Terraform Enterprise. And at the same time, we had another pipeline that spins up the datacenter where Terraform Enterprise is going to live. Every single change on the image or the datacenter meant that we had to respin our infrastructure as code platforms in there.

If we had done this manually, we were probably would have been faster in the first iteration because we didn't have the right code in there, but it didn't give us the flexibility to change on demand. By writing every single change as code, we have the latest source of truth — or the latest configuration in place as a blueprint in our code. And it allows us to move because for any change in the datacenter, we would just destroy our existing platform, reflect whatever is changed in the code, and then spin it back up new. 

We love putting everything in as code so much that, like a boss, we've encouraged our management team to adopt a policy where no user has write access to our cloud infrastructure. As you can see, our contributors and owners are either the service principle of Azure DevOps or the managed service identity of the scale set in Terraform itself.

Any user would only have read access, meaning that if you need to change any infrastructure, you change it at the code level. Of course, we have emergency break-glass accounts that allow us to do the necessary changes as a person, but that's more a scenario for if the system fails or something has gone really wrong. The road that has to be traveled has to be — update anything over code. 

Rule #2: Don’t repeat yourself. There should be one piece of code that defines a piece of infrastructure. And one piece of code that defines business logic in ADB. Here, for us, it meant breaking down the monoliths. And monoliths are not on just traditional applications, even infrastructure as code has monoliths

Workspace and Modules

And now I'm going to tell you a lengthy story of how using workspaces and modules saved us from an architecture dilemma. Let's start with the datacenter in Azure. You have your hub virtual network. Traditionally, our hub has routing and security components in it. We're adopting a hub and spoke model in our cloud datacenter. In these spokes, we've decided to divide these subnets by data classification. 

So workloads within the same data classification — for example sensitive applications — can talk to each other. But if workloads need to talk to another subnet using a user-defined route table, we force them to go through the firewall. That allows us to monitor all ingress and egress of all traffic outside of the subnet. 

That also meant each subnet would have its own UDR attached to it, which is managed by Terraform, thank God. We also have other subnets in the hub virtual network. For example, our infrastructure as code platforms — they’re from Ansible — are stored in the hub. 

Security mandated that there should be no public endpoints for a parsed resource in Azure. That means if you need your Postgres, SQL, your Redis cache, your storage account — every single thing should have a private endpoint in there, and no public endpoint should be there. Makes sense, I guess. 

Unfortunately, we had to learn the hard way that, when you create a private endpoint, Azure automatically creates a system route from the subnet where the private endpoint is to everything it's peered to. 

What does that mean exactly? Well, that means a non-production server has a direct line of sight to any private endpoint; even if it's a production-sensitive database. Of course, that's a big loophole for us. It totally bypasses our architecture where all ingress and egress is enforced — or at least goes through the firewall where we can do inspections and enforce certain rules. 

The solution for this problem is simple. We would remove these subnets from the hub virtual network and then create our own private endpoint virtual network in there, have them peered to the hub. Of course, we need to add the private endpoint virtual network CIDR in the route table — and that would force the traffic back to the firewall. 

There's still a little bit of an issue there because we still have the infrastructure workloads in the hub, which has now direct line of sight to all private endpoints because it's still peered in there. We need to remove that and then place it in its own virtual network. And suddenly, this would've solved all of our problems because of that system route that's created on the private endpoint.

It would've solved it, but at the cost of complexity, because now our issue is that we added two different virtual networks in our datacenter to accommodate this use case. Thankfully, by September 2, 2021, Microsoft created a public preview that allows us to create a wider subnet range for route tables. 

That means we can yank out the virtual networks for PEs and infrastructure, return them back to the hub, and then we would just need to add the private endpoint subnet CIDR in the route table —and that would still force it in a firewall. 

Unfortunately, the public preview is only for the US. There's no love for us here in Asia Pacific. That means for the meantime — while we're waiting for this to roll out — we need to create a user route that is directly forcing all of the private endpoints to go through the firewall.

Now, our datacenter contains more than 60 route tables. That's because we have a lot of different data classifications, hence different subnets in the VNETs. That means every time a user requests for a SQL database or a storage account or Redis cache or whatever parse resource, somebody from the network team would have to update all 60 route tables. It's going to be A, time-consuming. And B —if they make a mistake — then there's a big loophole in the architecture. Because suddenly, it's an exploit that has a direct line of sight to whatever private endpoint was created. I'm guessing the network team would not love that so much. 

When you think about it, I've been talking about breaking down monoliths, but then I've discussed an architecture problem. That's because by using modules and workspaces, we were able to solve this problem. Thank you Terraform. 

We've designed the modules to be like Lego blocks You're supposed to build your workspace by adding the Lego blocks that you need to build your infrastructure. As such, in here, we use two modules; the private endpoint module, which creates a private endpoint, and the route module which creates one route to the route table.

We've also then used the workspaces for Terraform Enterprise, not the workspaces for Terraform open source. We have two workspaces in here. One is for private endpoint routes, and the other one is for private endpoints. There are a million ways that you could have solved this problem. I could have created one workspace, which creates one private endpoint and then propagates all of the routes across all the VNETs. 

But the problem with that one is it doesn't give us the flexibility. We've decided to create one workspace to create a private endpoint. Then by sharing its state and using run triggers every time that private endpoint workspace is invoked or run, it will automatically run the other workspaces, which is supposed to propagate all of their routes in different VNETs.

By using this approach — whenever Microsoft decides to extend the public preview to Southeast Asia — we can then remove all of the routes that have the private endpoints in the route table. And then, we don't need to change any code on the private endpoint workspace because they are divided. 

In our case, we wanted to make sure that your workspace would be as small and minute as possible, doing one logic and one logic only. And by using some orchestrator — for us, it was sharing the state and using run triggers — we were able to invoke dependencies that are dependent on that workspace. Same as if a subnet is being created on one of the spokes or even on the hub virtual network, it would then invoke again the other workspaces to propagate the routes.

Rule #3: Write code. but write code with quality. We are putting a lot of importance and dependency on those modules. And we have to make sure — because there's only one place to define that business logic — that it is of quality. But the big problem is how do you even test infrastructure? It's not like code that you can write a unit test. We used Azure DevOps to automatically do a poor man's way of unit testing for infrastructure. 

Here, I have an Azure DevOps project called Terraform Modules and a repo specifically for that module. We said one repo equals one module in Terraform. Let's say that I have my main branch, and I do another branch based on that on Dev and another one for my Feature branch. I then do whatever changes are necessary. 

At some point, I might do a pull request. Whenever you do a pull request, it's going to be for approval of my team. Whoever did the feature branch is supposed to test it locally and show some proof of test that the code is working as expected. 

After commits — everything is ready, it's been approved — you then try to merge it to the Dev branch. Once it's merged, then the feature branch has lived its purpose, and gets destroyed. We're then ready to merge it to production. 

We then do a pull request to merge it to the main branch. But by doing that pull request, it will automatically trigger a pipeline. The pipeline would then have three stages — build, deploy, and test — and you can see that it would fork at some point for auto and manual destroy.

The Build stage is very simple. It does a Terraform plan and Terraform validate. Deploy is Terraform plan and apply. Test is where we do our infrastructure unit testing. And if all is well, it would go to automatically destroy because we don't want to be spinning up resources whenever a change in module has been created. 

When an error occurs on the test stage, it would trigger the manual destroy, which is supposed to be looked at by my team to see why the module has failed. That test stage is very simple. I know that there are other tools like Terragrunt, but that would have required us to learn Go. 

We stuck to what we know best, which is AZ CLI and Bash. And from there, we did a poor man's way of unit testing: Is the name as I expected it to be? Is the region correct? And is the address space — what I expected it to be. I guess this is a module for virtual networks. 

At some point, when the pipeline is successful, it would allow you to merge to the main branch. And once you merge it to the main branch, you're good to go. We use Git tagging as much as possible. Of course, all of our workspaces are hard bent to a specific version for that module so that we don't introduce breaking changes to workspaces that are already working.

The semantics of how we do the tagging is very simple., Any commit made on Dev updates the patch versioning. Any commit made on Main updates the minor versioning. If we had code-breaking changes, we would update the Major version. 

For example, if we decided to change the way that we name a certain resource in a module, we would put it on the new Major version. Why? Because certain attributes in Terraform require you to do a destroy if updated. Whenever this attribute requires a destroy, such as changing location, changing name, and so forth, then that's going to be a Major version update.

Rule #4: Prepare for failure, prepare for learning. This is a space that is constantly moving, constantly shifting, and it's still changing. Terraform has only reached 1.0 this year, so you're not going to get it right. And it's okay because any failure is a chance for learning. We try to mitigate mistakes because we had very little experience with infrastructure as code and were moving all out and going 100%all of a sudden. It's from zero to a hundred within a few days. 

Mirroring Production

So we wanted to make sure that we were doing certain things in a controlled environment. What do I mean by that? I'll give you an example of how we update infrastructure: This is a Terraform workspace with a private endpoint route — of how we update infrastructure. 

Familiar with this? This is your Git branch. At some point, you're ready to merge the Feature branch into your Development branch. Once you're ready, do a PR because you're ready to put it in Main. But when you do a PR, you see that it's going to trigger another pipeline. The pipeline looks a little bit different. It's now build, and then it forks between sandpit and production. As you can see, sandpit has another stage, which allows you to destroy the infrastructure.

But I've been talking a lot about Sandpit. What is sandpit exactly? Sandpit is our mirror to production, if that makes sense. We had philosophical questions on where to put Terraform development. The initial question was should we put Terraform development in the development subscription, and then it should touch development infrastructure? 

But the biggest problem with that is development is production to developers. We cannot touch that. We needed to have a bubble where we can safely introduce changes without any repercussions of touching any workloads. And thus, the Sandpit tenant was born. The Sandpit tenant is the exact mirror. It's the same datacenter as what we have in production. The only change is that the workloads are slightly different because we didn't want to be paying for multiple VMs for something that we're not using. But the whole bones of the datacenter are in there.

At some point when the pipeline is successful, you do your build, you do Sandpit. Sometimes you do destroy if this is a workload. If it's a change in the datacenter, we don't do a destroy because we want to keep it in there. We're not ready to merge it to the production branch. The moment we merge it to the production branch it’s going to trigger the same pipeline, but this time it's going to fork build to production instead of Sandpit. That's how we introduce the change. 

This is only for Terraform open source because if you go Terraform Enterprise, the good guys from HashiCorp already did this for us. The workspace self allows you to define which branch you want to point into. For us, if the workspace is for sandpit, we point it for Dev. If the workspace is for production, you point it at Main.

You might ask me, Krista, this is all good if it's immutable infrastructure. If the infrastructure itself is not changing, this is good because you can test 100%. True. But what if the infrastructure changes? Then this doesn't help much. 

Remember when I told you about everything as code? Well, we got super crazy about it. What I've shown you is a YAML pipeline. We're using pipelines as code to ingest data in our database. We're doing pipelines as code. We're using data as code, whatever you can put as code, whatever we can automate, we're trying to do that in there. 

In here, by ingesting the data in the database for our data warehouse, for example, we're able to replicate the same state as close to production as possible. Whenever your infrastructure can mutate, I think the question that you need to ask is, is there any mechanism that I can apply using automation principles that keep it as close to production as possible? For us, it was leveraging all of the scripts available to us and an orchestrator like Azure DevOps.

Those are the principles or the tenets that we keep in place, and we take themto heart. But the reality is even if you do everything correctly, you always have to account for change because innovation never stops. 

Simplifying Infrastructure

Remember, we were driven to run this project because we wanted to move fast, we wanted to be secure, and we wanted to be resilient. That meant talking the talk or walking the walk for our infrastructure as well. The thing is, whatever can change or can mutate, will change — so prepare for that. 

On our end, one year after we've built our Terraform Enterprise, we started looking at it. Is this the most secure, resilient future-proof architecture that we can have for Terraform? Or is there a better way of doing it? When we started this project, there wasn’t feature parity with Terraform Cloud. Terraform Cloud was starting out — or at least was very new when we started this project.

We did what we thought was resilient enough. We had a scale set in there when the health check failed — the scale set would then provision itself again. At least we have some HA in there. All of the components are still PaaS. We had geo-redundancy and everything in there to be as resilient as possible. But nothing beats SaaS. 

I'm sure that HashiCorp via Terraform Cloud is going to be a lot better — or more resilient or more secure — than what we can ever build ourselves. The simplified architecture or infrastructure was to move into Terraform Cloud. Then use the agents to talk to Terraform Cloud and have a line of sight on-premises and on the private endpoints by using Azure Container Instances. And as such, the managed service identity that used to be the scale set would now be the agent itself as container instances in Azure.

We wanted to simplify integration. The way that we provision servers would then reference a workspace, which references a module. And the modules use an image that was created in Packer — and Terraform has to create it in the datacenter. At some point, I would have to provision the VM in there. 

But the reality is one thing security needs is whenever you provision a VM, it has to be enrolled on our privileged access manager — which is in Manila. We wanted Terraform to integrate to it directly, but there was no out-of-the-box provider available. We needed to buy another component on our privileged access manager (PAM) so that we would have integration to Terraform.

Initially, we were thinking let's integrate it to Ansible. Terraform would then pass the initial password to Ansible, and then Ansible would enroll that server to our PAM. That's okay and good. 

But remember, we had to be 100% sure without a doubt that this is going to work. If Ansible is down, that means that we've just spun up a server and maybe not enrolled it to our PAM., Terraform does not know the state of how the playbook went in Ansible. Even though we've tested it and even though it's working, we can never be 100%sure. There's always that 0.001% of doubt that maybe I've never enrolled it. And we don't want that — we cannot afford that. We wanted to make sure that whenever I spin up anything, it's going to be as secure as possible, and this process could never fail.

Use Terraform to Manage Enrollment 

The solution was Terraform should manage that enrollment in there. But we didn't have the provider. One day, we started thinking, can we build our own? So we started attempting it, and we were able to. To our surprise, we built it in a day. That is a testament to how easy it is to use the documentation and the ecosystem that HashiCorp already gives you. 

We've decided to use the plug-in framework not SDK v2 because we knew we wanted to future-proof all of our solutions — and it looks like this is the way forward for plugins as well. 

But for us, I've only done two or three Go projects. And to be able to do that in one day, it means that it's not hard. There's not a lot of documentation out there coming from other people — but if you follow the existing documentation to the letter, it's really super simple.

Simplifying Automation

Lastly, we wanted to simplify authentication. What do I mean by that? Well, Ansible Tower is a big critical piece of infrastructure for us. Right now, the way that we talk to servers in Azure — Tower in Azure — would have to talk to it. If we need to talk to servers in Manila — which is where our main datacenter is — we use what we call an isolated node, and then via SSH, it connects to the VM that it needs to manage. 

Remember, we have forty offices in there, and we wanted the workloads to be run locally. Meaning if my server’s in Pakistan, I shouldn't be triggering it from Manila or from Singapore, I should be triggering it from Pakistan. Tower, in this case, only acts as a scheduler.

Everything is good architecturally again, but then there's one big thing. The Tower servers and the isolated nodes have to share SSH keys. These SSH keys have supreme access to all of our Linux servers to be able to do whatever it is Ansible needs to manage. There's a scary part where what if one of the isolated nodes in one of the field offices is compromised? Then, in theory, they have access to all of our servers. 

I think security isn't happy with that. We had discussions about how we can protect the keys? Do we need to replace it every day, every month? How complicated is this? And we started asking ourselves — this is a problem that seems to be generic, or at least a lot of people are facing — how does big tech handle this? We started looking at how other companies do this. And we've come across something called Signed SSH.

Enter Vault. Out of the box, Vault allows us to use Signed SSH. The idea is that Ansible Tower and the isolated nodes — their SSH keys — have no privilege to all of the servers. They need to have it signed by HashiCorp Vault, and of course, the SSH key has to be trusted by the servers. But it has to be signed by HashiCorp Vault with a time-to-live setting. For us it was five minutes. Then, once it's signed, you can use the signed SSH key to connect to your servers. It's like having somebody validate your IDs and only for a limited period of time. 

Summing Up 

That's everything, all of the use cases that we have to share. 

Move Fast. 

This project was driven by three things required by the organization. We wanted to move fast. And by breaking down modules — by breaking down workspaces like Lego blocks — we can easily adapt. We can easily be flexible to whatever use cases are thrown at us. 

One example for us, we had to build our data warehouse in one of our projects. It took us a day to write modules, to write the workspace, and lo and behold, we had another project that required us to build yet another data warehouse in there. And we were able to provision it in five minutes because we write our modules and our workspaces as thin as possible, so it's always ready for reuse.

Keep Security In Mind

The way that we enforce our security policies is to place it on the modules per se. But you can never be 100% sure because maybe the workspace did not use the modules — for some reason, it bypassed the modules. 

Or maybe somebody who's developing the modules changed the policies. How can you then ensure it? We're using the modules to make sure that we should enforce the policies in place, but you can never be too paranoid with security, especially for a bank.

We use HashiCorp Sentinel. Sentinel still enforces it, regardless if the module enforces it or not. Sentinel is going to be that bouncer in a club that allows you to go in or out. For us, that gives us 100% confidence that anything provisioned by Terraform is following our security postures. 

Designed with Resilience in Mind

By having our whole datacenter written as infrastructure as code, that's more than 2,000 configuration items that exist as a blueprint in code somewhere with an audit trail — with version history, with all of the commits in there. 

If we would've done it manually, then it exists in somebody's head. And that's always scary because A, can you remember every single thing? And B, what if that person leaves?

But what is the cherry on top of the icing on top of the ice cream on top of the cake? We kept it simple — we use one cloud provider, which is Azure, and one region, which is Singapore. So, if ADB decides tomorrow to build the whole VDC in another region, if we decide, Hey, let's create one in Hong Kong or in Japan, then we're ready for it. 

In a single click, we can provision our whole datacenter in any region because we've designed it with resiliency in mind. We've designed it with flexibility in mind to make sure that we are able to adapt to whatever change is necessary at any point in time.

Finally, it's been an interesting year, an interesting journey for us. We know we've barely scratched the surface. I'm excited to learn more, to improve what we have and look at how we can future-proof all of our architecture. Hopefully, you've learned from us too. I guess that's all I have to say. Thank you so much for listening.

More resources like this one