Hear how GoPay is transforming the management of their stateful components lifecycle toward GitOps with Terraform and Ansible.
GoTo Financial is one of the biggest payment providers in Southeast Asia. In this talk, the speakers will delve deep into how they manage 2,500+ database servers with Terraform, ranging from provisioning, updating, auditing, security hardening, to setting up database-specific observability metrics.
My name is Eko Simanjuntak, and my colleague, William Albertus Dembo. We have come from the infrastructure engineering team from GoPay. Before we continue to our main topic, let me give you some context about our company.
We are part of goto groups. Under goto groups, we have three big companies. The first one is Gojek. Gojek is an application that offers you on-demand services, like a two-wheeler taxi ride, and food delivery, logistics, and many more. And the other side, we have Tokopedia. One of the biggest eCommerce services in Indonesia.
We come from goto financial. Under goto financial, we have several products. We have moka POS as our point of sale system. We also have midtrans as our payment gateway. We also have GoPayLater. As its name suggests, you can purchase things and pay later.
We come from the GoPay team. GoPay is a wallet that enables you to purchase from our offline and online merchants. You can pay your electricity bill, your insurance, your internet — and you can also set up a recurring transaction.
GoPay has had the largest monthly active users in Indonesia since Q4 2017. It also has 700,000 offline and online merchants. And GoPay is also integrated with more than 28 financial institutions.
To support this business, we have: GoPay Engineering:
We have more than 300+ engineers split into more than 35 teams.
We run more than 30 Kubernetes clusters.
We have 500 workloads like services, workers, and cron jobs.
We use multiple datacenters. On these datacenters, we have 15 VPCs.
We also run more than 2,500 databases like PostgreSQL, Redis, MongoDB, Kafka, RabbitMQ.
All these databases are self-managed. We don't use any service provided by cloud providers — like Cloud SQL. Or, if you use AWS, you may know Elasticache.
Today, we were talking about the database provisioning evolution in our company, we have three generations. The first generation we call Proctor. The second generation we call Manual Infrastructure as Code (IaC). And the third generation that we are working on right now, we call it Stateful Component Portal or SCP.
Before we begin, I want to see hands. How many of you are already working on Terraform?. How about Ansible? Working on Ansible? Oh, nice. Then, I think this presentation's good for you.
Let's start from the first generation. Back then, we used Proctor, our in-house solution, to run the predefined script. The script will generate your Terraform manifest — then it will be applied to the Terraform manifest. Then, you’ve got your databases, your VMs provisioned for you.
This is how you execute it from the flow perspective. You need to execute the Proctor CLI. Then, the Proctor CLI will invoke the Proctor server. Then, the Proctor server will talk to Kubernetes, "Hey, Kubernetes. Please run this script on your job."
As an infrastructure team, we create the Terraform modules. And we also create the predefined script that will be executed. And as a development team, you need to execute the provided script.
This is an example of our predefined script: You can see here, we create a Ruby library so we have Terraform functionality. The Terraform wrapper can generate the Terraform manifest, plan, and apply. This is the example of how you execute in Proctor.
First, the generated manifest is not stored anywhere. It means you lose the main functionality of Terraform — the state management. And, because you don't store the Terraform manifest, you don't have any audit locks. You have no visibility of the state of your infrastructure.
Then, there are some cases where you want to update your database. For example, you have Postgres, and you want to change the maximum connection. What do you do? You need to SSH into the server, change the conflict manually, and reload your service. That's how we do the first generation.
That's why we came to the second generation. We call it Manual IaC. We start to create our own Terraform models. The instances were created by committing the Terraform manifest to the repository. Then, we use Ansible as the configuration management — and it's quite stable to provision our databases.
From the infrastructure team's point of view, we just need to create the Terraform model. Then we’re working on a ticket requested by the development team. As the development team, they need to create a ticket to mention, "Hey, I need these databases. For example, I need Postgres version 10, then I need hundreds of GB of this." Then we create it manually.
We have an issue with copy-paste infrastructure. This means, if you want to provision a new database, you just need to copy the existing IaC then modify it as per the request.
We have no standardized resource size. If a developer says, I want this database to be 500 GB, then another engineer asks for 500 GB, there's no standardization.
We face difficulty to update all the manifests. For example, you already have plenty of Terraform manifests, and you have a new source of your model. Then we need to update it manually. You need to go to our Git repository and change each button one by one.
Other than that, we have an issue in that we don't have any database inventory. We lack visibility of ownership. We also lack visibility of the other databases — and which team owns any particular database.
That's why we came to the third generation that we called SCP. As the infrastructure team, we created a Terraform model and also an Ansible playbook. Then we created a template for each specific database.
Then, we need to review pull requests and trigger the CI/CD pipeline to get the database provisioned. Then, as the development team, we just need to create the provisioning request via our portal.
I will run through how we use SCP. I will deliver some context for you, and I hope you get it. First, the Terraform model. Nothing fancy here, a simple Terraform model. We also have an Ansible playbook ready. Nothing fancy — just a basic Ansible playbook.
This is an example of the Ansible playbook to provision PostgreSQL. Here, the templates come. We have the Ansible folder, the Terraform folder, and also the `gitlab-ci.yml`. Ansible contains the template for Ansible manifests. Terraform contains the template for the Terraform module. Then we have the `gitlab-ci.yml`. In this case, we use GitLab as the CI/CD, and the `gitlab-ci.yml` will be generated for you. Then you’ve got your pipeline configured.
Here we have SCP. In SCP, we have a contact about the cloud provider. Previously, I mentioned that we use multiple datacenters. Here, we use GCP and AWS — and we also use our local datacenters. A cloud provider has a cluster — you can split it like staging UAT production or whatever you want.
A cloud provider also has services. Services is the offering that we give to the developer, "You can provision this kind of database on this particular datacenter." Here you register the service. After you register service, you’ve got — for example, this one — the service name is AWS PostgreSQL Master Replica.
Then, we have the template URL — the Git URL that we already created before. And we also get the template tag, and we have the metadata, and we also have the service availability — that is allowing you to enable or disable database provisioning on a particular cluster.
Also, we have plan. Plan is the offering that specifies database size. We'll come to the plan — here's the plan. You can see here the plan has metadata. Metadata that defined that the plan is medium, the boot disk size should be 50, and then the data — this IOPS — should be 4,000. And that's all — this data will be rendered later to our template.
The plan also has a create parameters schema. It is a decent schema that defines the list of parameters that users can input to your system later to get the database provisioned. All that we need is ready. So, what next? Will be continued by my colleague, William.
William Albertus Dembo:
Thank you. I already explained about how we create the template and how we register the template to the SCP to create an offering for the developer. Now how do I develop or create the instance?
Before that, the developer doesn't need a database to deploy their application. They need the application itself. Internally, we have something called gopay.sh. Gopay.sh is the point of contact between the developer and the infrastructure. When they want to create an application deployment, they just need to go to gopay.sh and create the deployment. They will have the repository and the pipeline ready. That's also the same for the database.
In this talk, we are talking about database creation, but if you are interested about the application deployment part of gopay.sh, you can refer to the KubeCon talk by our team here in Honduras ]. You can learn about the application deployment there. But for now, let us focus on this — that's full component deployment.
Here's the example of the gopay.sh dashboard There are multiple applications there and so is its owner. The owner is the team. We are using the stream and the pod structure for our team. Let's pick one application — here's the application detail. Here you can see the application name is scp-playground. It's deployed in multiple environments — staging, and production. On that one, you can see a cluster. You can choose a cluster where you want to deploy it. For example, you want to deploy to AWS Jakarta, Singapore, GCP Jakarta, Singapore, Taiwan, etc.
Once you know what to pick when you want to create the database, you can just click the application. In this case, where you are creating the SCP test on the staging — click on the I button. Here's the detail of the release. The type of this release can be SCTP, GRPC, etc. Might be a cron job or worker, etc. There are also health checks, a port, and Kubernetes SSH. This is the detail of the release. But we need the database.
In gopay.sh, a database isn't an add-on. If we want to create an add-on, we go to the add-on tab. We don't have any add-ons right now. Let's try to create a new one. Let's try and create a PostgreSQL with master replica set up in AWS.
We need to create the add-on. And here it will show a form to create an add-on. In this case, we want to use the stateful component portal for our provider. We want to create AWS for PostgreSQL master replica. We just need to click to do that. And what kind of website will it be? In this case, we are going to use the medium balance.
You no longer need to know the many AWS instance types— like X.2-o, X.5-,3.1, whatever it is called; High mem, high CPU. There's a lot of it. You just need to know whether it is small, medium, and high. But if you want to know the detail, it will be shown to you. But most likely, the developer won't care about that.
Then after you choose what you want to create, it will show you the form you need to fill. If you want to create a PostgreSQL master replica in AWS, you only need to fill this in. This Barito App Group is related to the login. The name, the wal-G, the database name, and the version — only this form.
This form is created by the JSON schema that Eko mentioned previously. It will show you the form that I built to use the AWS PostgreSQL master replica. As you can see, you don't need to know the team that creates this one. You don't need to specify the ownership because it's already in gopay.sh. gopay.sh will send the information together with your input here.
Then once you click submit, it will show the add-on is under the set provisioning. From this point, the developer doesn't need to know anymore what's going on behind the screen because the process is done by the infrastructure team — and part of it is automatic.
We're still working on the automatic way with Consul. But for now, once the developer submits the add-on creation, it will send a slack notification to our team. SCP will get the template — the one from GitLab — and combine it with the input from gopay.sh, and from the user. They will render the template, do some commit to the repository that belongs to the component, and create an MR.
This is the result of the template. It will generate all those files. Ansible and Terraform, and the pipeline — all depends on the sizing. You can see it's M for the medium. For our team, we just need to review this MR, and check whether there is any missing value, whether the sizing is correct. If all is good, we just need to merge this one.
And in GitLab CI, it's already included the pipeline. This is a standard pipeline. If you want to use Terraform, you definitely need to do plan and apply on those plans. There are also validation works by the governance team. When you do plan, it's automatically validated — whether it's following all our compliance. For those Ansible rails and Ansible, it's actually to see what kind of changes that will happen on the dry run, and Ansible will execute it.
This is the example of the plan. It’s a standard Terraform plan — no fancy things here. This is all happened on the pipeline. You just need to start the pipeline. The apply is also the same. It's creating the results. In this case, it's creating the IP.
For Ansible, you need to send the information from Terraform to Ansible. We are using Vault here. Everything that Ansible needs will be sent by Terraform to Vault. Once the pipeline is executed, it's done. It only takes about 5-10 minutes to provision the PostgreSQL, and now the database is ready.
From the developer point of view, they already know that the database is created. The status becomes provisioned, and they get the metadata — such as the name, the username, the IP, etc., the first name, etc.
But they don't just get this information. Since we already mentioned before that, when we create something with SCP, it's already following our compliance — so, when they create the database, it's already included with the monitoring itself.
They already have the graph and the dashboard — belongs to their team for the component — and also for the alert team. So, if there's anything wrong here, the alert is already set up to be assigned for their team.
It also includes instance monitoring. This is for something like CPU, RAM, this, etc. Also has the alerting included. And the next part for, like the previous talk mentioned, monitoring also needs to include logging. The logging is also set up for the instance. This is also already owned by the team, so they can access this logging directly. No need to set up anything else.
It's become very easy for the developer to create a database in our team. Just need to do some small fields, and everything is ready. Let's go back to the future. Later on, our database will grow in size. What will happen if the disk size is not enough? We simply need to increase the disk size. Let's talk about how we update the database.
Updating the database is similar to when you do create. We go back to this view; what do you think you need to do to update? We need to click the edit button. It's right there. It's very easy.
The form is also similar to before. It's also using the JSON schema. If you want to increase the disk size, we need to update that value. Also, if we need to change the configuration. In this case, we need to update our merge connection in PostgreSQL. We need to update that value
Then we click submit. The process is the same. It will create a notification in Slack for the MR. And SCP will automatically do commit and create an MR. We can history review this. The database size is increasing — and configuration format connection is already there.
But there are the third chains. There is something called SCP Contacts ID. What is it? Previously in the previous generation, we mentioned that we cannot have the audit. We don't know what changed, who did it, when did it happen?
In SCP, we create that audit. We create that paper trail. We have something called Provisioning History. Because every change you make is done by the gopay.sh, we have the full record list of what changed, when it happened, who executed.
In this case, board execution is requested by Eko. We know that Eko requested change, and Eko requested creation. And if you click the merge request URL, you will see what changed and when the pipeline is executed. Also, the status for the provisioning is already done for the submitted merge — MR is created but not executed yet. That's it. The update is easy now, and everything is recorded on the SCP.
What if we want to apply the update across all of our instances? What should we do with it? We are working in goto financial. Usually, there are a lot of regulations. We need to enforce some security chains. We need to improve our access management. Those are the two use cases that we're going to use to do mass change.
One day, the security team wants to run a playbook that executes to all our Ubuntu instances. What do they do? Do they execute it manually across each instance? There are around 2,500 instances. No, they don't need to do that. Since we already have the template, they just need to add the playbook to our template.
They need to add that CIS Ubuntu 20. They need to add their playbook to their template. They commit and create an MR and will review it. Once we merge the template, the first thing will bomb. You see that in the plan, there are versions. Once we bomb the template, we need to do mass change across our instances. The template will be rendered, and the pipeline will be executed. It's pretty easy to do mass change with this.
There are use cases that we try to switch this SSH access. We have an internal access management, and we try to migrate from it to Teleport. This is done by the governance team, so the change is similar.
They need to create an MR to our template. And we need to review it, merge it, and bomb the version, and do mass change. It's pretty easy — like Thanos snapping a finger like that. The other thing is regional — those two changes, security, and the access is not actually requested by our team. It's done by security and governance.
This templating step is enabling us to do some collaborative improvements. If any other team wants to add any change that’s applied across all instances, they just need to create an MR. We don't need to do anything else about this other than reviewing and bomb.
When you're using this template, there are a lot of dependencies. The template has Terraform Ansible. Terraform has their Terraform module. For Ansible, they have a playbook — they have four. How do you bomb those?
Do we manually bomb? How do we manually check the version? No, we create something else also for that too, to make our life easier. It's a bit dependable, but it's internally made since the template is not a standard thing. It works like a vendor one. But when there are version changes, there will be a slack notification, and we will do the review.
The MR will look like this. When you bomb the playbook level and the Terraform module, there will be an MR to upgrade the template. The change will be like this: If we need to apply this to all of the instances, we just need to do the mass change.
We mentioned that this is our third generation. In the early days, we didn’t have the Terraform state. How do we do onboarding? For onboarding, we create a template that does Terraform imports. It’s simple — nothing fancy in onboarding — just do a create template with a Terraform import.
Using this Terraform import capability, we have already migrated around 2,100 from 2,500 databases. That includes the PostgreSQL and Redis. For others, it's still ongoing. And using the Terraform import, it's pretty quick, around 2,000.The former way took longer. That's the progress of our onboarding to SCP. Next, we are going to talk about GitOps, and Eko will deliver it. Thank you.
Thank you, Dembo. Back again to our title, achieving GitOps. Looking at what we already presented to you, have we achieved GitOps? To evaluate that, let's see the GitOps principles.
The first principle is declarative. It means the expected state of your infrastructure should be shown on your IaC. That's the first one.
Second one is versioned and immutable. It means every IaC should be versioned and immutable. It means every change cannot be updated. Then, you need to make other change if you want to update your configuration. Then, it should be pulled automatically by the agent server, and it should be continuously reconciled. So, the agent should be pulled automatically and reconciled where the state of your IaC on your rapid story already reflects on your actual infrastructure.
We got two principles. The declarative one and the versioned and immutable. From the four, we only got two, right now. There are another two principles left. We are working on this. Hopefully, we can achieve all these principles in our SCP.
The first is easy provisioning. Our developer doesn't need to use CLI or create a ticket and send to us — they just need to use our portal.
Then it should be complying to company standards, as Dembo mentioned before. If you have any changes, you need to do updates on your Terraform or your Ansible playbook and update all of the versions,
Then it should be automated and updateable. Yes, you can see here that we can render the template — then we can update the template. We have a single source of truth that reflects our infrastructure state.
Then we have the paper trail to know when something changes, who changed it, and what the effect of the change is. We also enable a collaborative environment. As Dembo mentioned before, we have two use cases where we can do collaborative work with another team.
On this talk, I want to give credit to our team. This is not only our effort, but this is a team effort. And only our last year intern that is working so hard to create the SCP, the emergency management.
Yep. I think that's all. Thank you for coming to the session. Thank you.