What Samsung Learned From Using Terraform to Build Bixby 2.0, Knox, and RCS
Jul 23, 2020
Learn how the globally distributed teams at Samsung optimized Terraform workflows and workspace design to manage multi-cloud infrastructure across AWS and GCP.
Samsung manages many global services with HashiCorp Terraform. From Bixby to the Samsung AI Service, Knox, and RCS, Terraform is a crucial underpinning technology in deploying and managing the multi-cloud infrastructure across AWS and GCP. This talk dives into the optimizations the team implemented in workflows and workspace design to cater to the needs of DevOps teams around the world, including South Korea, the United States, China, Europe, and India using Terraform Enterprise.
Good day. I'm Joonsik from Seoul, South Korea. I'm working as a DevOps engineer at Samsung Electronics. We're all doing lots of things, but these days I'm working on managing service efficiency using IaC Terraform. My colleagues and I have been doing Terraform for a couple of years. We've got some tips to think about the process of applying Terraform.
All these things are from the perspective view of us. I'll share these basics for someone who's having trouble using Terraform for their projects. The way to utilize cloud providers that are only used by VMs is an old story and intertwined with the various cloud services and architectures, even with one big project.
Let me introduce the projects that our team has worked on. The Bigsby customer AI assistant service is used all around the world through the various Samsung devices—and a B2B service that we call Knox serves a variety of corporations. RCS is a large scale service that delivers rich content to customers through messages with the carriers.
The service has over 60 million users in over ten environments— they’re all using Terraform. And there are over ten integrated development and operation teams in various services and architectures over there—they’re VM only, container, and serverless. We also had a multi-cloud and infrastructure as code byTerraform.
Why Did We Choose Terraform?
The technology reduces people’s mistakes. The infrastructure written in code becomes a part of the infrastructure build process in a sense. There are benefits to using infrastructure as code. IaC helps reduce human error—we'll be able to collaborate smoothly with various people. These reasons provide a clear understanding of the current infrastructure.
The clear infrastructure is abstracted and used on objects closely linked to services. That means we can take the same approach to infrastructure and service. While the person in charge of infrastructure as code is changing, the infrastructure maintenance becomes easier than before. It will also be able to design this or that; it can be reused for future new projects—so we can call it reusable.
We must not only consider infrastructure as code, but also how can we work together to maintain integrity between infrastructure and code. Among additional issues that come to mind when you do a real project is how to work with your infrastructure as code on other teams.
There were times when we had to build a new, unexpected environment in a short time. Sometimes we had to design a new service on top of the existing legacy Terraform network design—such as creating a new service infrastructure on the existing network that was composed of complex Terraform codes linking multiple private networks. We can handle it easily.
When we designed the project infrastructure, we needed to consider various cloud providers. It has different accounts, regions, environments. In addition, that SHA will vary depending on whether service architecture is a VM only, serverless, or container-based.
In more detail, it may even be necessary to consider distinguishing between the area where the structure of the working department and codes should be changed frequently—and not. I think those are problems that need to be solved through the design of the Terraform workspace and model.
I think workspaces should be properly divided, so we can call it well-grained. I think the important thing is that who is working on Terraform and what for—and dependency related to states. Because Terraform organizes all the resources in the directory into the single workspace state, modifying our large workspaces is expensive because you’re browsing through large folders to find a desired file. As a small and clear workspace, Terraform becomes concise as there are only a few files when you open a folder. That folder represents the resources written in the files—let's look at the yellow part.
We tried to modify using Terraform. If you tried to use it in a monolithic workspace design, your Terraform workflow grows to an infinite yellow pod over the one big state. But if you choose a well-grained workspace design, condensing your Terraform workflow to a finite yellow pod—the same size as a small state. It also creates also a well-organized, loosely-coupled workspace design.
Who Is working on Terraform?
It’s important to know who is responsible for changing the infrastructure when designing Terraform in order to prevent causing changes to resources that are unrelated to changing infrastructure. Real-world Organization structuring is the most important factor—I think. This is because changes to the development team’s organization will result in changes in the services—and changes in the services may cause changes in the infrastructure. Designing this kind of thinking in advance reduces the amount of work to be done later on.
If only one DevOps team exists in one office—that could be solved through conversation. However, if the team that is getting harder to talk to in other offices work together. Considering the role of people who have to use their infrastructure, it is necessary to think about how to work with the various teams.
We can design—the deeper the folder is, the more dependent it is on the planning for the parent folder. It is designed so that the frequent changes inside it affects the parents as little as possible.
As you can see, the infrastructure and dependencies are revealed under the single folder. The reference resources can be clearly identified in either state, but there are drawbacks. The full structure is too deeply nested. It’s hard to understand the entire infrastructure if you’re a new member of the DevOps team.
I prefer distributed design for workspace. If you do separate repositories for the purpose—for example—if the EU starts to manage service A’s infrastructure only, we can suggest managing the separate repositories, regardless of the environment. The EU DevOps care only for repository three for service A. But the KR DevOps team covers all of the repositories for the global infrastructure—so the permission to modify is given according to the appropriate role for each team. That is the important thing.
There are data tags in Terraform. Data tags refer a directory to resources created on cloud providers. Terraform code is one of the ways to create resources for cloud providers. If data a tag refers to the resources themselves, it won't have directory dependency on existing Terraform code. This approach is also worth considering when you're working with an organization that makes all the decisions. But you have to be careful because all of the network infrastructure will be changed. Because they're using a network infrastructure from other organizations that have already been made out of Terraform, we couldn't erase it and make it new.
But we could revert to the combination of complex existing Terraform codes; at that time, the Terraform data tag can be of little help. The data tag based on a unique network ID refers to the resource created directly on the cloud. This approach is also worth considering when working with an organization with different decisions in a real enterprise.
There is not so much of a problem if there is only one team in your local place. However, imagine the United States, South Korea, India, the E.U., and Chinese DevOps teams working together in the same repository to build infrastructure for own their environments at the same time. Managing infrastructure as code naturally becomes difficult. I think certain workflow rules are necessary to solve this problem.
So, it's a workflow that must be altered by all members responsible for infrastructure as code. There are ways to get an open source tool, and there is Terraform Enterprise. The infrastructure as code is usually managed through the Git—we are familiar with—to prevent inconsistency between infrastructure and the Terraform state. The basic policy of applying Terraform after being merged in the main branch should be applied. Usually, the main branch is a master, right? Because
plan does not affect the state, it is good to attach the plan to each PR or create an idea of the infrastructure shape yourself through the review.
If you have any interest in open source tool opportunities, I believe you already know that if you make some PR you can type into Atlantis and attach the infrastructure that you want to modify—and the person who tries to review your pull request will find it much easier than just looking at code.
Terraform Enterprise is a paid service that can solve other problems. And as much as you pay, various convenience and features are provided. The remote state management and VCS integration with GitHub, Enforce Terraform Policy, User Management, and History Management. If you can’t afford it, it is important to maintain and manage the remote state yourself.
Consideration should be given to synchronization to ensure the transaction in case of multiple people working and the state encrypts the entire contents of the infrastructure. At the time we need state encryption and synchronization by the cloud providers.
The completion of the complex modules becomes the biggest enemy of those responsible for entering Terraform. Infrastructure should be clearly seen rather than creating various resources with single modules—with complex conditional statements. It is better to make several modules with a clear purpose different from environment to environment.
Terraform cannot be added when the attributes of the resources are not defined in the module. Therefore, it’s recommended that the attribute’s value is defined first as a default value in the cloud provider column for reuse of the function in the future. I recommend you design open source rather than starting the background—and please consider what attributes would change.
I think the biggest potential advantage in Terraform is a design of the project using the various providers. The providers are written in Golang, and there isn't a dependency on the tool chain—where pre-requisite packages on there, stand for.
Integrating design and management of various software layers and tools—except for the code level of the service as utilized in infrastructure as code—enables a high level of clarity. If the Kubernetes cluster—consisting of the server cloud providers established its own metric— alarms and even deployments distributed in the process are managed in Terraform. I’d expect a higher quality service will be possible as a result.
When designing a multi-cloud, you should review each function before you work on it. Similar functions may differ in structure in the different clouds and folders due to their dependencies.
The shape of the code will be different, so it is better to distinguish it accordingly. Finally, this is a quote from Conway, whose department I work in on Terraform. Due to the nature of a company, there may be changes in your work team. Considering the design and working style of the infrastructure can vary with the change of the organization. I think we can manage this in a more flexible, better way with infrastructure as code. Thanks for watching.