PMI's Journey With Terraform
Jun 30, 2020
Learn how PMI researched, adopted, grew, and created training for HashiCorp Terraform in their fast-scaling, new internal IT organization.
Hello everybody. My name is Nikola Stjelja, and I'm an engineer at Philip Morris International. This talk is about how we started with Terraform, and it's very tightly aligned with how we started doing our internal transformation of the company.
Before I go into all of that, I want to do all that a bit introduction. I'm from Croatia. I've been working for 13 years, and I've been doing a lot of different stuff. I worked for the last three years at Philip Morris.
From Outsourced Engineering to Building Their Own Teams
In the beginning, around five years ago, Philip Morris was a very traditional enterprise corporation. We had a strong IT, but we didn't have a lot of internal engineering experience. We mostly outsourced that, and we relied on different vendors to sell us solutions.
At that time we started to change how we work. We started to create new products, and we realized that IT is a critical component to our success in the market and in achieving our mission.
We wanted to bring engineering back into the company. I was one of the first engineers to be hired in the company in a long time. And as Philip Morris does—we do things at scale. We're a big company—we operate worldwide, and we have a lot of people working internally.
Starting the Terraform Journey
We started to do this transformation—making the company more agile, flattening the hierarchical structure and bringing expertise back into the company. The project that I started in was a big commercial solution, which we are using to sell our new products worldwide.
It was a big project—it had more than 150 people. At times it was bigger; at times it was smaller. People working in different countries—different continents, actually. The system was composed of 40 different subsystems, and it was a huge solution.
We differentiated the way we did the system internally. Previously, we did stuff in our data centers—in-house on the physical servers that we owned. And we decided to go to the cloud to leverage the power of AWS—different cloud solutions—to accelerate our development and reduce the amount of non-differentiating solutions we have to develop.
We wanted to create infrastructure at scale. We wanted to operate with a lot of subsystems. We wanted to create that infrastructure repeatedly—different environments, different ad hoc socials for experimentation.
We wanted to operate across multiple clouds because you know that AWS, for example, is not present in Russia and we cannot operate with AWS in China. We know the different regulations dictate which cloud solutions we can use—to leverage the power of infrastructure as code to accelerate how we do all this development. As a company, we have our auditing elements—which internal controls we change; we check what we are doing.
And infrastructure as code—making changes through code—was something that we thought is going to help us to have a controlled environment. An environment where we know that the change that we want to make was made for a specific business reason, and that it was controlled by a person other than the person who did the change—so that we have the right element built into the default IT processes.
The problem at the time was that we did not know Terraform at all. Our team was composed of multiple people. We had software engineers. We had DevOps engineers. We have people doing traditional system engineering. We had Salesforce developers. We had developers of different types in the team. None of us knew a lot about DevOps and applying development practices to operations. None of us had a clue how to do Terraform. The project was big—it was distributed across multiple countries, across multiple teams.
We had to learn how to do it around three and a half years ago. There was not a lot of documentation on how to do Terraform. There was one book, one blog, and the practices that we found there helped us get started, but they did not work for us.
We had to learn how to use Terraform correctly and with that how to do infrastructure as code in an enterprise at scale, in a distributed environment. Then we had to figure out what actually does not work. One of the first things we tried was a monorepo. We're going to have one team—one repo—we're going to put our code there. We figured out that that did not work for us.
Why? The reason was we were not working on one problem. We're not working on one single monolithic solution. Our solution—as architected—was composed of 40 separate projects managed by separate teams. In those projects, we had multiple moving parts, and we were working at different speeds. Secondly, we did not know how we wanted to use Terraform. For some cases, having a monorepo worked. For some cases it did not.
So what we did—and what worked for us—we did it in a way that was not mandated by the architect team. It was not mandated by management. Terraform was a technology that was adopted by engineers.
We did not look at somebody telling us how to do it. We had to figure it ourselves. We had a lot of conversations. We did a lot of internal meetings, checks, discussions, arguments, and it was like a grassroots approach. We figured out that separating our code into small isolated repositories—which are changed only for the one reason—worked for us.
Then in this experimentation—trial and error—we can figure out what works for us and what does not work for us. We enjoyed the fact that Terraform as a technology was very well documented. You may not have known the specific problem you were trying to solve, but you could easily learn it.
Although we cannot use the same Terraform code between different cloud providers— because the APIs and the providers change—the language and the ways of working is the same. That brought us a certain level of confidence and quality in the solution.
Within this core DevOps group, we figured out how Terraform should work. The problem was a lot of the people in the organization did not have that high a level of enthusiasm for the technology. People liked it, people wanted to learn it, but adopting it at work was hard for them. Especially for the people who never did anything related to infrastructure or people who—for example—spend most of their career working as Salesforce developers or Salesforce operations.
We had to figure out how to scale it. And again, it was not something which was mandated by management. This was not something that was mandated by some architects. Although we had architects working in our group, and it was a very heterogeneous team with different people with different experiences.
We said, "Let's do some training." We did internal training on Terraform—a one-day workshop. We did it several times with different people. And in that workshop, we covered all the basic elements, all the principles and practices you need to know to make sure that you can solve your problems. We had people with zero experience in technology and practices attend the workshop—and the next day, they were productive.
It wasn't a good code from the beginning, and it was not something—for example—I would have written. But they were able to move forward. And this is one of the reasons that the engineers and Philip Morris love HashiCorp products. We find that it's easy to solve problems consistently.
Growth in Terraform Adoption
We started seeing Terraform being adopted with different teams across the organization irrespectively of us. We started with this big project around three or four years ago. New projects started to fill different enterprise needs, which were required by the new market—the new environment—we found ourselves in. New teams were started from the ground up. We had a lot of people coming to the company.
At that time, we did not have a single channel where all of us dealing with the same type of problem could communicate. People started doing the DevOps activities in islands. Over some time, we started having different enterprise sessions. We started talking to each other. We started figuring out what works, what doesn't work, what people are doing.
Everything was fine. We started getting the benefits of Terraform, such as code. We talked to other teams; they started doing the same things. But then we all started to figure out that we are hitting the same type of problems. Our initial ideas as a group did not work all the time.
Password and credentials management
We figured out that managing passwords and credentials was not a good scenario for Terraform (see Vault).
Virtual machine configuration management
We figured out that using a default provisioners to manage the configuration on a virtual machine on AWS was not something the tool was good at.
Refactoring code without destroying production
How to refactor the code? When the reason and the way you change your solution change, how do you refine through a Terraform code so you don't destroy production? Terraform destroying or making a change—which drops the resource and creates a resource—is not something you can sell to the management, to business. "Sorry, I dropped that production because, well, I had to refactor my Terraform code."
How to Test Infrastructure Code
We had to start looking at these specific problems. One of the solutions that we started to look at was how do you test your code? How do you make sure—instead of manual testing, manual verification infrastructure—that what you're doing works well?
It took some time. It wasn't an easy process. We don't have an answer still for some things, but we managed to figure out one practice—not a standard—a practice, which worked for us in multiple cases.
For managing VMs, we said, "We're not going to use Terraform for that. That's not what the tool was built for. We're going to use Ansible, and it worked. We had teams who use Ansible to manage their accounts—and they said, "Nah, that's not working for us. Let's switch to Terraform."
A GitOps Approach
We created a GitOps-based approach, which was really simple. We use Gitflow as a branching model. We use the Terraform plan and apply cycle to check the Terraform code changes. We attached those plans into our pull requests, which we did using Bitbucket’s and Jira’s integration.
We have the Jira ticket—GitOps branch. From that branch, you do a
terraform change. You do a plan; you attach it; somebody else reuses it. You match it. The code gets applied automatically through Jenkins. It took some time to build it. The biggest challenge was adopting CI as a standard—of course with Terraform—but this started working. With that, we started making changes very, very quickly.
As the business grew—as our products went live on production and we started serving multiple markets and each market has a different need—we had to make infrastructure changes very quickly. We did not want to slow the business—particularly our development teams—with the way we do infrastructure.
Applying these practices, applying this tool, applying this way of sharing the knowledge inside the organization—how to do the tool helped us to accelerate the business. It helped us make changes quickly, consistently, and with good quality and speed.
HashiCorp Product Suite Rolled Out at PMI
Now, we know that this is not the end goal—this is not the only solution that we use. In PMI and different teams and different levels—product teams, infrastructure teams—they use a lot of HashiCorp products. We really like the open source elements of it.
I've seen Packer is used. We use Vault. We use Consul. And the consistency of the documentation and the quality of the solutions helped us adopt them very quickly. We also looked at Terraform modules for our purpose—for cloud solutions that have a REST API but don't have a quality provider.
This is where it hit a snag, organizationally. For example, we learned if we want to use Terraform for some solutions, we have to have a very good and regularly updated Terraform provider. We don't have the Golang skills in the organization that are required for working with Terraform. We have Golang skills—I spent one month writing a Terraform provider to see how it goes.
It's not hard. It's not easy—and it's not something as easy as writing an Ansible module, which is written in Python, and it has very good documentation. This was a part we were struggling with.
We said we don't want to write modules. Right now, we don't see a business case for us to invest people who are going to write that. And that also drives how we adopt Terraform across the entire landscape of the cloud solutions that we are doing. If there is a good provider, we usually adopt Terraform. If there is not a good provider, we start looking at something else. The adoption of the solutions is organic. It was never driven by enterprise architecture.
We're now coming into a point where we're consolidating across the application. People at different levels are having discussions. Terraform is part of our enterprise toolbox. At this moment, we are a hybrid cloud company.
We operate a very large public cloud footprint, where we provide a standardized engineering platform for teams so that they don't focus on operations. Instead they focus on development and create solutions, which differentiate our products on the market.
We also provide our private cloud solution called SDI that I'm working on. I moved away from the public cloud space for the commercial platform into the private cloud team. We operate this solution across multiple regions. We have our datacenter. We have a lot of factories where we're deploying this solution, which is serving as the bedrock for our IoT product.
The solution is based on VMware—and we see Terraform being adopted there. We see Terraform being adopted by the teams that are building solutions on our private cloud for the factory setting. They’re using the VMware module for VRA, where people can codify their infrastructure—their machines—in the same way they would codify with AWS.
Multi-Cloud Deployments at PMI
We see teams which are doing two clouds. Either our private cloud—SDI—and AWS in the same configuration. They're connecting the public cloud and private cloud resources, which is cool to see coming natively in the organization. Inside the private cloud team, we're also using Terraform to codify our infrastructure.
Right now, we're deploying OpenShift clusters across the organization, across different regions that we operate in. We are using Terraform to codify our base VMware infrastructure. The technology is very present in the organization. We are using Packer for managing and creating virtual machines across the public and the private cloud. We use Vault for secret management and very heavily starting to integrate with it and starting to leverage its solutions.
The biggest driver—at least how I see it—is the ease of adopting the technology. I would say that Terraform—but I think it is also true for all the HashiCorp solutions—it’s like PHP. It's very easy to learn, but it is hard to master.
You can start it very quickly, and the documentation is so good. The tools are consistent and good quality—so you can your problems very quickly. But you can write very crap code and you can make a lot of mistakes—and those mistakes are sometimes hard to fix. You have to spend time or learn how to refactor the code—how to work with it, how to scale it. And this is a pitfall of the technology—the thing is it's inherited in these and SOUs.
Well, thank you very much for listening to this call. If you want to have some questions, please ask them. Here is my LinkedIn profile, I'm always able to respond to messages, and I hope you enjoyed the presentation.