Learn about a Malaysian energy company's DevOps journey while operating infrastructure as code in both AWS and Azure using HashiCorp Terraform.
Lisa Chan: It's so nice to be here with everybody today to share our DevOps journey. We'll tell you a bit about who we are, why we did what we did, and some of our challenges, and we'll take a deep dive into Terraforming. Looking back, of course, we've got some war stories and lessons learned to share.
I'm Lisa Chan. I'm a bit of a contradiction in the IT world. My background is not technical, yet I find myself the head of the software engineering team in PETRONAS We build bespoke applications for the PETRONAS group of companies.
Because we do so much development, we inevitably look into automation, DevOps, Terraforming, agile implementation, and all that good stuff.
I don't do that much actual work, which is good because I'm not very technical. Most of my job entails removing bottlenecks for my team, getting funding for my team, explaining how things work to people in our business who are also nontechnical.
But I leave all the good work that we do to Syafiq, who will introduce himself next.
Thank you, Lisa. Hi, I'm Ahmad Syafiq. I'm head of DevOps engineering in PETRONAS. I'm working with Lisa in transforming a new way of working within the organization toward agile and DevOps. Back to you, Lisa.
Thank you, Syafiq. Before I get into what we did in DevOps, I want to talk a bit about why we did what we did. If I think about it, I talk more often about why we need DevOps in our organization than I do about how we do DevOps.
With senior stakeholders, I talk about minimizing waste and maximizing the acceleration toward delivering features to our users quickly. But when I talk to my own IT counterparts, I talk about the human suffering that is involved in not doing DevOps, and sometimes I'll show them this picture showing 3 people praying in front of a server rack. I think anyone who's ever been in IT can relate to this photo, when you're on holiday or it's 2:00 AM and there's an outage.
It doesn't matter where you live in the world—it's always 2:00 or 3:00 AM when there's an outage.
We talk about how important automation is, how important it is that we have control over our systems. We talk about creating safe systems of work, and that's what DevOps is all about, that's what we're trying to create.
When I do this, it helps to make an emotional connection to the people whose work I'm trying to change, whose tooling I'm trying to change, whose ways of working I'm trying to change. I feel that it's as much a technical change as it is a cultural change, as many people have said.
What exactly is DevOps for us in PETRONAS? It's a mindset and way of working that allows us to accelerate the technology value stream. That stream turns a business hypothesis into a technology-enabled service or product so that our customers can enjoy value as quickly as possible.
I took many of these words from the DevOps Handbook. But why is it that we need a definition when something is so easily Googled, when something is so easily found online, and it's so popular in the media today, and everybody in the IT world more or less knows what DevOps is.
In the early days, we had a lot of misconceptions about what DevOps was and what it wasn't. I'll give you some examples. There were people that thought that if you put a dev guy and an ops guy in the same team, that DevOps would just magically happen. Another misconception was that DevOps and agile were the same thing. Some people even thought that if you implemented DevOps, that would guarantee you a product that users would love.
Maybe DevOps is a little of these things, but it's not the entire picture.
We had to be really clear about what DevOps would help our business to accomplish, and about what our business is.
Before I get into our journey, I want to give everybody an idea about our IT shop. It's pretty big as far as enterprises in our part of the world go. We serve the whole Petronas group of 50,000 users domestically and internationally.
We have 700 projects for IT running at any one time. We have more than 2,000 applications for just 50,000 users, which is a lot. In IT, we've got about 1,800 staff and 18,000 tickets a month, and this doesn't even include some of the service request tickets. These are just tickets to do with incidents and outages and so on. The rest of the tickets total about 50,000 tickets a month.
The case for change and the implementation of DevOps is real, and there's lots of value that we stand to gain.
The screen is showing a high-level depiction of our journey to DevOps. We started in 2017 like many people do, because in IT it's always overwhelming when you're running such a legacy shop. There are always issues and things to do, always new projects coming in, and never time to pay down that technical debt or introduce initiatives that will improve developer productivity. No one ever had time to focus.
We took a team and gave them 5 applications that IT had control over—the funding, the releases, the features, and so on.
We told them, "Implement the tooling of your dreams, and if it works, we'll extend this to the rest of the organization." And that's what they did.
It didn't even take that long. I remember asking the team, “What are you going to buy, and how much is it going to cost us?" I got a response within 24 hours, so the team had obviously been thinking about this for a very long time. I think that's one of the key learnings for me: Good developers and technical guys know what they need to make their lives better, to make their work more productive, to preserve the flow of their work.
I think the best thing we can do as leaders is to just give them the space to create and to improve and to roll out the changes that they need to roll out.
We shielded this team from incoming work and let them get on with the tooling. We propagated their successes. We did lighthouse events, we did show-and-tells, and 2019 and 2020 were all about implementing DevOps at scale.
Today, we have more than 200 products on the tooling, more teams are practicing DevOps, though, granted, the teams are practicing DevOps with varying degrees of maturity.
We realized early on that agile was a precursor to practicing this, so we invested a lot in agile certifications and training so that people could maximize the use of the tooling. Because it wasn't just technical change; it was a people and process change.
It's exceeded all of our expectations. The question is, Where does that bring us to today? Just to give you an impression of the activity around DevOps that's going on today, from having nothing much at all back in 2018 when we started measuring this, to doing a whole lot of it today.
And I know a lot of these metrics, like 273 deploys a day, don't say much about how much value monetarily we're adding to the business, but it does say a lot in terms of the confidence and the maturity of the teams in terms of how they're using the tool sets, how they're introducing their changes, how they're deploying to production.
I'm infinitely proud of my teams when I look at these metrics and all that they've accomplished since 2017.
We love our tools, but I'm not going to go through our entire tool set yet. I'm going to hand over to Syafiq shortly to do a deep dive on Terraforming. I just want a level set that toward the end of 2020, we had almost 200 products on boarded to our CI/CD tooling and achieved a lot of success in terms of our application-related automation.
But we did not make a lot of headway except for a few isolated POCs here and there around infrastructure automation. Inevitably, this led to a lot of difficulties that we continue to have like environment-readiness. And when those environments were ready, they were consistent, they were predictable. You go back and forth with the operations team in terms of getting what you want. We knew this was almost like the final piece of the puzzle that we had to implement within our DevOps tooling.
I'm so happy to pass you on to Syafiq. He'll tell you a lot more about what we've done with Terraform.
All right. Thanks, Lisa.
No matter what we are trying to accomplish here in Petronas, no matter how big or small, all adventures start with 1 small step. I'm going to walk you through our journey in implementing Terraform.
In the early term we took a step back and looked at our current processes—how we implemented the infrastructure design, the RBAC, the policies that had been set up—so that we could come up with a solid due diligence in our infrastructure as code tool selection. We also collaborated with our enterprise architecture, cybersecurity, cloud, and network teams to make sure that everything was covered from the beginning.
Because Petronas is currently using a hybrid cloud solution, on Azure and AWS, we needed to make sure that every element was covered.
And as we transitioned from the early-term to mid-term plan, we defined our standard reference architecture for our basic web application, for microservices—for example, Kubernetes—and for the function-as-a-service for our serverless applications, which then we converted into several templates where the requester would be able to select for their infrastructure provisioning.
We also started to do testing in our sandpit environment and selected a few of our products for the pilot to make sure that our implementation was working as we had planned.
Moving on to the long term, we started getting confidence. The team was now getting our new IaC, Azure and AWS landing zone, which is currently at 80% readiness for our cloud migration initiative, meaning that Terraform will move from being the exception to the standard way of automated provisioning for all of our future workloads.
This was important for us, to declare clearly to everyone that we are all aligned on the most sensible way to proceed.
We are practicing these 5 principles and best practices in our current implementation, the first of which is the plan and apply.
It was important for us to make sure the proposed changes matched what we expected before we apply the changes.
We also put in place the approval gates in our CI/CD pipeline, especially for our apply and destroy command, to control the changes and avoid any mistakes.
The third one is the code structure, but basically there is no right or wrong in choosing the code structure. Rather, it is highly dependent on the project type or size, the complexity of the project, how often the infrastructure changes, and which deployment platform and deployment services we are using, for example, Kubernetes or OpenShift. Both might require a slightly different approach.
Lastly, how are the environments grouped? For example, is it grouped by environment? Is it grouped by the region or by the project?
For the naming convention, we use underscores in the place of dashes, but beware that the cloud resources have many hidden restrictions in their naming conventions. We cannot use dashes, for example, and some must be camel case. Of course, do not repeat the resource type in the resource name to avoid confusion.
Last but not least, the remote state, where it refers to the storage of the Terraform configuration state file in a location, for example, in the AWS S3, in Azure blob storage, or the storage container, for instance.
This is to allow team collaboration, easier provisioning, and to control and audit on your infrastructure changes when you move everything to production.
We are currently getting our new IaC landing zones for Azure and AWS ready for our cloud migration initiative. As we all know, a landing zone is the composition of multiple resources, for example, modules, blueprints, or services that deliver a full application environment. In Petronas, we are adopting the Cloud Adoption Framework (CAF) from Microsoft for our Azure landing zone, which helps to maintain a set of curated modules.
We mainly use modules to enforce a consistent set of configuration settings and requirements. We implemented 5 structures, from Level 0 to Level 4, with each of these layers serving a different purpose.
For example, Level 0 is the launchpad, which at the subscription level gives privilege access to the workstation or service principles that could be created at that layer. Level 1 is where the security and compliance configuration take place, for example, the RBAC, the policies, the OMS monitoring, and shared security services.
Level 2 is for the hub and spokes, as well as the shared services of each environment. It also refers back to our design document for the backup, disaster recovery, the Azure monitor, patch management, and so on.
Level 3 is where we manage the application landing zone in a spoke environment, for example, the ETS, the WAF, VM, as well as app service plan.
Level 4 is where we manage the deployment of the application, for example, the Spring Boot, the .NET core, the microservices, and so on.
As you can see, each of these levels has its own agent and its own Terraform state. And the MSI in each layer we have set to have read and write access to its own layer, but we also set the read-only access to the previous layer.
The goal is to have standardized deployment, accelerate the setup and configurations, set common practices, and set simple and reasonable configurations.
We are also following as much of this same structure in our Azure landing zone in our AWS landing zone, which is currently at 80% readiness.
Of course, some tricks are needed to adjust for the differences in those 2 platforms.
This is a super-high-level design for our IaC implementation in the Azure landing zone.
As you can see here, all of our Terraform scripts and configurations are started in our Azure DevOps repo, which will then integrate into our Azure pipeline for our CI/CD implementation. We also integrated Azure Key Vault for the resource credential and configure it into the pipeline to comply with our security standards.
Since we need to make sure that everything that we are going to push to production is properly tested and safe, we also enforce and manage our source code with a pull request for a proper review and also the approval gates by the relevant parties before pushing to production.
Of course, once approved, the pipeline will then execute it by self-hosted agents in Azure VM by using the Rover image from the Docker registry, and it will call the CAF-based module in the public hub, which references back to the original script.
Once everything has been executed and provisioned, it will store the Terraform state file in the block storage with state locking enabled. The reason we enable the state locking is that we want to prevent the concurrent runs of the Terraform script against the same state.
We had this issue previously, and I think that's why we implemented this in our current design.
This is our super-high-level design for our AWS platform. It's not much different from Azure: the same tools to store the Terraform script in Azure report, and executing it from the Azure pipeline as well.
And we retrieve the AWS secrets manager from the configured pipeline for our security measure, and the Terraform state file will then be stored in an S3 bucket.
For this, we are using DynamoDB for the state locking mechanism. And then the same process as well:
Once the pipeline is approved, the Terraform scripts will be applied in order to provision the required resources.
I've talked about the reference architecture that will define the Terraform templates that we created based on that reference architecture. How should we utilize those templates as a catalog?
We have come up with a self-service portal where the users can submit the cloud infrastructure probationary request via a single-pane view. This initiative indirectly showed the successful collaboration between our cloud team, especially in coming up with this MVP1 of the self-service portal for the last 2 months.
Users just need to go to the portal, select the predefined infrastructure from the catalog, and then submit the request via the portal. They can also monitor the status of their request from that portal. Of course, there is still an approval gate to validate the request, as every provisioning has a cost incurred.
Once approved, the underlying Terraform pipeline will be triggered, and by having this, we have been able to remove the approval gates that we had set in our Azure pipeline.
Some people say it is never too late to be what we have been. We have gone from a manual accept form, where we submit the request via email and everything goes through email, back and forth. You might get a reply 3 days after that saying more details are required or maybe the infrastructure you requested does not follow the standard, so you fix that.
You resend and you get a reply in another 3 days, and then the same loop over and over. In the end, it takes 7 to 10 days, or maybe more than 2 weeks, to get all the environments ready.
By having this self-service portal, everything will be guided and predefined, and based on our pilot rollout, it takes about 4 to 8 hours for the entire process to be completed. This will be a continuous improvement, and we will add features to allow additional options to be selected by the requester in our upcoming MVPs.
At the end of the day, what we want to have is full automation end to end in our infrastructure provisioning to align with our infrastructure as our infrastructure as code roadmap.
Now I will pass back to Lisa to conclude our presentation. Back to you, Lisa.
Lisa Chan: Wow. I get so excited whenever I see Syafiq present what we've done with Terraforming. I mean, look at the manual form; it's the stuff of nightmares. It's like a horror movie that never ends, and you never get your environments. You never get the infrastructure provisioning and it just goes on and on in some bureaucracy of form filling.
I want to talk now a bit about the lessons learned that we wish we had known when we started this journey.
I've mentioned that agile was almost a prerequisite, but it's like a chicken-and-egg situation: you can't do agile without DevOps, and you can't do DevOps without agile. I wish that we had embarked on both initiatives at the same time, approached the teaching of people and getting people certified about what agile was and wasn't.
We even had situations where people just reorganized their project plans into sprints, but none of the underlying engineering changed, but we were saying, "We're agile, we're so freaking agile." I think it would've been good to invest earlier in the capability development, which is something that we're doing quite heavily now.
Even though I keep talking about DevOps, dev and ops are still very much different departments in Petronas. Our work around the self-service portal and Terraform brought the departments closer together. To be honest, our early initiatives were very much about dev trying to get away from ops and the dependencies and the processes, trying to automate away the work that ops did for us. But infrastructure provisioning was true collaboration, and we now have our ops counterparts working with us almost every day.
I remember a time when I spoke to the head of ops maybe once a month or once a quarter. Now, I speak to him almost every day. And it's true that, for every level of the organization, meaningful empowerment has developed.
Because we're starting up new foundations and we're building everything from scratch, we have new landing zones, we have the cloud migration for projects, we have DevOps, we have new tooling, new everything.
This tends to make governance and cybersecurity pretty nervous. So I think there is some level of persuasion that I think we have to engage in and come to a middle ground.
I think that giving too much empowerment, too much control, does create chaos, and I think people have to be comfortable with the changes that we're implementing. That's why we have manual gates for approval, but, you know, one day it might be fully automated.
It might be totally no humans checking, and that's the dream, right?
But also I think whenever we make a mistake, the team has to be given some support and some guidance and flexibility to work through those mistakes. Very often, when we make a mistake, we end up introducing a new form, a new gate, a new level of approval, and so on. It never ends.
I think psychological safety, and also not having a knee-jerk reaction to introduce more processes when mistakes happen, is something that we still struggle with and something that leaders need to push back on.
That's it for our presentation. Please feel free to add us on LinkedIn and tell us what you thought. I'd be so happy to connect with any of you to see what we're doing wrong, what we could be doing better, how you approach DevOps and Terraforming in your own organization.
Thank you very much for having us.